Jason Harvey: Systems Administrator at Reddit
Brady Gentile: Community Manager at DataStax
TL;DR: Reddit started out as a place for people to go and post links, find interesting things on the Internet and share information with other people. Today it’s commonly known as the “front page of the internet”.
Cassandra was selected at Reddit for its self-healing feature in terms of read-repair and horizontal scalability. Being able to add nodes and scale out writes, was probably the largest of backers in their initial decision process.
Reddit uses Cassandra for multiple use cases including: listings and all of the links on Reddit, voting data, user’s voting history, liked pages and SubReddit pages. Reddit currently runs their 10 node Cassandra cluster in AWS EC2 on SSDs, managing 17 million votes daily.
Hello Planet Cassandra users. Today we have Jason Harvey joining us; Jason is the Systems Administrator for the website Reddit.com. Jason, thanks for taking the time.
Happy to be here Brady.
To start things off could you tell us a little bit about what Reddit is all about?
Reddit.com started out as a place for people to go and post links, find interesting things on the Internet and share information with other people. Over the years it has evolved and now we like to say it’s the “front page of the internet”; most of what is happening on the web you’ll find on Reddit.com. Users will submit links and other users will vote up and down on those links, curating the content themselves. In addition to that, you have a huge number of sub-communities called SubReddits have their own rules and traditions; they are generally specific content focused.
Excellent. How is Reddit using Apache Cassandra?
We are using Cassandra for quite a few different things actually. Most of the listings and all of the links on Reddit comes out of Cassandra in one way or another. The biggest thing we’re using Cassandra for is denormalized lookups; this is data that we need to pull out that’s in denormalized form, so we don’t have to hit the database constantly… our database backend for that is PostgreSQL. Also, anytime that you would vote on something (making sure that little arrow is going to be colored in the next time you view the page) that’s going to be pulled up in Cassandra as well; most of the voting data is stored in Cassandra, along with your liked pages and SubReddit pages.
What was your motivation for choosing Cassandra over other database offerings?
Sure. Some of the nice things about Cassandra, when comparing it to other offerings, is its self-healing feature in terms of read-repair; that was pretty enticing to us when we were first looking at it. If we could do a look-up and something’s behind, then we have a way to get that addressed instead of having to go manually dig it out.
In the distribution of the data on PostgreSQL we had the single master and that can only scale so far… so the horizontal scaling, like being able to add nodes and scale out our writes, was probably the largest of backers in the initial decision process.
Interesting. Are you storing this data in your own data centers or in the cloud?
It is all on Amazon EC2.
Can you share anything about the size of the cluster and number of operations you’re doing?
I can give you some stats: We have 10 nodes right now running on Amazon SSD instances, which are super hefty boxes and there’s roughly 2TB of data within Cassandra that we’re storing right now across those nodes; it’s all in version 1.17.
I could give you an idea of some of the tech data that we store our votes on, as well: we store them in multiple areas and users make around 17 million votes a day; so it’s a lot of data getting tossed in there every day.
And for someone who’s just getting started with Apache Cassandra, do you have any advice? Maybe something that, in hindsight, you might have done differently?
I would say the biggest thing to consider is what type of data model you want to lay out and how you can utilize some of Cassandra’s features in order to implement that data model. If you don’t have really a clear concept of what you want to use your primary data store for, then it might be difficult trying to figure out how you’re going to be using Cassandra, what type of data model you’re going to use or how many nodes you might need.
Make sure to get an idea of what you’re going to be storing in Cassandra and the model that you’d like to store it in. With Cassandra, you’re not locked into as much of a schema that you would normally see in postgreSQL, so you can be much more flexible on that end and play to those features.
Could you walk us through your experience with Apache Cassandra?
Sure. When we started using Cassandra, we started on Version 0.6, which was quite a while ago of course; we were kind of seen as the early adopters. There was a lot of fear on the internet about “why are you guys keeping all this important data” or “where is all this data getting stored.” But we felt pretty strongly about it; there were some hiccups along the way, but the flexibility that we got from Cassandra is probably the biggest reason why we kept on using it. In times where there were errors, especially in the pre-1.0 versions, we were generally able to recover the data because of Cassandra’s self-healing nature. If something bad ever did occur in Cassandra, maybe on the software side or even on the hardware side, we were able to recover pretty reasonably; that rarely occurs in other database offerings… if you’re going to lose a large chunk of data, or something bad happens to some of your data, then you might be down for quite a while during a manual repair.
We started with .06 and had gone through each version since then: 0.7, 0.8, 0.9 and then finally to 1.0. For each one, we ran into our own little problems because we were using it so heavily, compared to lot of other people at the time. Because we were using it in so many different ways, we ran into a lot of use cases that probably a lot of other people didn’t hit so often; that was something pretty unique about us.
Since we hit the 1.0 version, I would say things have been pretty stable; we’re happy with how things have turned out and we’re looking forward to some of the stuff that’s coming up. We’re currently on Version 1.1 right now and we’re looking at upgrading to 1.2 and some of the stuff that’s on the horizon as well.
What are your future plans for Cassandra inside the organization?
As you may know, Amazon has multiple availability zones and our future plans are to spread things across these multiple availability zones to make use of some of the things like global forum, to scale more horizontally and across multiple regions. This will help us create better availability for our users.
What are you looking forward to in updated versions of Apache Cassandra?
The big thing that we’re interested in are virtual nodes (Vnodes) and I know that’s in version 1.2 as well, but we’re not quite there yet. It looks really interesting to see how we can work with some of our data models and see how we can utilize Vnodes. We don’t really use the CQL stuff on the application side but improvement of the command line utilities is something that, as an administrator, I am really looking forward to.
Excellent. Is there anything that you’d like to add about Apache Cassandra?
Nothing specific. It’s been a good piece of software; we’d had our tribulations with it but in the end it’s been more positive than negative, so we’re pretty happy and looking forward to new features that are on their way.