John Watson: Operations Engineer at Disqus
George Coursunis: Software Engineer at Disqus
Matt Pfeil: Founder at DataStax
TL;DR: Disqus is a discussion platform for the web that connects publishers with users and allows them to have a public discourse across the web.
Disqus uses Cassandra in a number of different places, primarily in their product. Cassandra powers Disqus’ analytics and content engine behind how they recommend content.
Disqus was originally using a hybrid system to track counts and sets with Redis. Redis worked great for a while, but as they grew, it didn’t scale along with the size of their data. Disqus needed a data store that would allow them to store billions and billions of rows of data, and allow them to access it very quickly. Cassandra was the natural choice to grow and build their system.
Hello, Planet Cassandra viewers. This is Matt Pfeil, and today I’m joined by John Watson and George Courtsunis, engineers at Disqus for an Apache Cassandra use case. Gentlemen, thanks for joining today, why don’t we kick off as we usually do by telling everyone what Disqus does.
George: Disqus is a discussion platform for the web. We connect publishers with users and allow them to have a public discourse in a medium that allows communication across the web. Users can feel connected to the publishers, and then publishers can foster better communities on their websites.
What’s an example of that?
George: Pretend you’re browsing on CNN.com and you reach an interesting article about what’s currently going on in Syria. Below the fold, below the article, you can interact with other users also reading about the Syrian civil war, and maybe have some kind of discourse about it: “did Assad really use sarin gas, or did the rebels use sarin gas”. It allows you to express your opinions with other users.
Very cool, and how do you guys use Cassandra to make that a reality?
George: We use Cassandra in a number of different places. Mainly in the product; it’s used for content recommendation and also a little bit of advertising. Let’s say you’re on that article reading about the war in Syria and you notice that there’s another interesting article relating to what the British PMs have released as a public statement relating to whether or not it’s legal to go to war, and maybe you’re interested in reading that response. What Cassandra does is it powers the analytics and content engine behind how we recommend content.
Very cool. What does your data model look like in that case? In other words, at a high-level, how are you laying out all that data; because, it sounds like a lot.
George: It is a lot of data. We started using Cassandra in 0.8 before sets were released; we have a lot of counting and set-type data. What we do in that specific case, it’s an algorithm called co-visitation, so we have sets of pages or articles that users have viewed, and then we keep those sets for a number of users. Basically when you visit a page, we look at other users who had viewed the same page and try to find the differences in, say, the sets of pages. The idea is that some of those pages would be interesting for you to also read.
There’s some other stuff, like we count the number of comments and views and posts, so there’s stuff that’s globally interesting. We try to do some stuff with that, but we’ve found keeping it simple, actually, yielded the best sort of recommendations.
I’m a firm believer in the KISS theory of keeping it simple, stupid. Complexity is a real challenge to most people.
George: Yeah. Complexity really, really, sucks.
It’s funny, people always talk about what complexity is, or they don’t necessarily pay attention to the complexity, but when something breaks in a complex scenario, it goes downhill extremely quickly.
George: Totally, like trying to debug in anything that’s a spidery type of code or trying to figure out what went wrong in a plane crash, I guess, would probably be a better example.
On the Cassandra front, what was your primary motivation for looking at Cassandra? What other technologies did it either replace in your infrastructure or go up against during the evaluation process?
George: We were originally using this hybrid system to track counts and sets. We were actually using Redis, and Redis worked great for us for a while, but as we grew, it didn’t really scale along with the size of our data. We really needed a data store that would allow us to store billions and billions of rows of data, and allow us to access it very quickly. Cassandra was the natural choice for us to grow out and build our system. After migrating from Redis, we were playing around with Mongo, but they really didn’t satisfy our primary use case.
What’s the number one feature about Cassandra that you think makes it successful for your company?
George: For us, it’s the horizontal scalability. The ability to add nodes and to increase our total capacity is, without question, the main primary reason that we switched to it. As you scale up, having the ability to grow the size of your data on demand is super important.
The second most important thing is probably some of the tunable consistency guarantees; so, being able to read with the read consistency of 1, when you don’t really care necessarily about how accurate the data is at the present moment.
Very cool. Speaking of the technology, what does your infrastructure look like today?
John: We are all on dedicated hardware and, currently, our main cluster is 24 nodes. We’ve found that CPU was actually turning out to be a small bottleneck at times, so we bought the biggest CPUs we could get, which are two 6-core Xeons that run up 3Ghz. RAM doesn’t seem to be too much of an issue, so we decided to save a little bit there. Each node only has 24GB of RAM. The heap’s only at 8GBs, so it definitely fits, and then our hot set takes up the rest of that 15GBs, but we’ve find that even 32 or 48GBs of RAM really wasn’t helping all that much in the performance, so we saved there.
We originally had SSDs in them and, just like the RAM, we’ve found that counters really weren’t saturating the I/O, and so we had all this extra I/O and not enough CPU to drive it. With SoftLayer, we saved a little bit by having 8 disks in RAID 10. It gives us, not much more IO as an SSD, but it’s still slightly cheaper. We can also handle losing a node, but it’s easier to lose a disk.
Currently it’s all just backed by gigabit networking. We’ve toyed around with a couple Cassandra nodes on 10gigabit ethernet. But for this cluster, again, it’s mostly just CPU-bound because, at full load, we’re pushing about 1.5 million increments per second when the cluster’s only 8 nodes and so, with that 24 nodes, we don’t plan on hitting a write limitation anytime soon.
Reads are the reason we scaled it out, just to help there because reads are a little bit heavier. But 24-nodes, again, it’s handling our load of about 30,000 reads a second very well, and we’re only at 20% capacity now, which is great.
It’s all bare metal, and we’ve been really happy with it. It handles networking quite well; we’ve had a couple of times where we’ve lost a back-end router, and Cassandra lost a couple nodes, but it kept chugging along and they came back. Cassandra replayed the hints and we were all back to normal again; we’ve been pretty happy with it.
Guys, that’s great information. As one last final question, is there anything you’d like to share about either Cassandra or the community in general?
George: Really, we have no complaints. The community has been awesome and any time we’ve needed help, either DataStax or IRC people have always been available to help us. I’m super excited about some of the stuff with the light-weight transactions and some of the more concrete data structure stuff, so it’ll be exciting to see how that develops.
It will be, especially with 2.0 and compare and swap coming out soon. It is an extremely exciting time for Cassandra and an extremely rapid evolution of the offering.
Guys, I really want to thank you for your time today. Everyone, check out Disqus. They also recently did a Cassandra meetup in San Francisco. They’re actually the organizers for that group, and they’re very, very active in the community, so check out the video of that presentation.