August 21st, 2013



Juan Valencia: Principal Software Engineer at ShareThis


Daniel: Hey it’s Daniel and I’m here with Juan from ShareThis. It’s super interesting to learn a little bit more about how ShareThis is using Cassandra, so I thought we might do an informal chat. Juan, maybe you can start with what’s interesting to you about Cassandra, why ShareThis is using Cassandra today and a little bit on how you’re using it, as well.


Juan: Sure. As our data has grown, we started using it more and more. Traditional relational databases just weren’t meeting our needs in terms of write throughput volume. When we switched over to Cassandra, we were running MongoDB and that had reached its limit (at least for where it was back at that time). They had been using a master-slave technology and once we pumped the limits of that, we started hitting some walls.


Daniel: MongoDB is usually touted as the thing that you switch to when MySQL isn’t working.  The fact that you’re hitting up against the walls of Mongo, what kind of throughput or data is ShareThis doing? What do the numbers look like? This sounds like pretty big scale stuff.


Juan: I think generally we have to be prepared for about 40 to 50 million events being processed at any given time. For our purposes of needing to get data in real time, this translates to about 50,000 writes per second.


MongoDB can certainly do that; but of course when we were looking at these technologies two or three years ago, MongoDB was not full-fledged. It was still in development and we kept hitting the technological limit that they had reached.


We needed something that had gotten a little bit more in mature and Cassandra, which we were also using at the time, gave us a few less headaches in terms of throughput; that’s the bottom line.


Daniel: For those that don’t know much about Cassandra maybe you can just give a quick overview of Cassandra.


Juan: Sure. Essentially, rather than keeping tabular data, it keeps a key value store. It runs in a log fashion, so it’s optimized for commodity hardware disks. You get a lot of benefits in the sense that you divvy up your data in a ring; the Cassandra nodes are distributing data amongst themselves intelligently and giving you the throughput that you get by putting this data on different machines.

It stops being a relational database once you start writing to different parts of the disk or different machines. You lose some things such as joins (which are very popular) and some of the nice aggregation statistics that come with MySQL. What you gain is really high throughput. Getting down to it, you can write a lot if you’re planning to scale.


Daniel: Where are you hosting this? Is this all up on AWS? What kind of commodity box are you using?


Juan: We’re basically using AWS. We have multiple clusters and usually we’ll run an extra-large instance; though for some certain needs, we’ve had to go to SSDs because we have a lot of reads that are very random.


Daniel: What about some of the things that have really worked well or really haven’t worked well? You mentioned some limitations of Cassandra, such as no joins. It sounds like the throughput is the main reason for using.


Juan: Sure. I think the hardest thing about Cassandra – and this is actually more of an art than a technical limitation – is our data tends to be pretty heavily skewed toward one side, so to speak. We’ll have a few data points that get a lot of traffic and a lot of data points, long tail, that get nothing. That works pretty well, unless it’s so unbalanced that Cassandra divides up the data by having it all go on one node, which ends up with all that traffic.


The keys that you use to insert into Cassandra have to be thought through in advance; like I said, it’s an art. You have to certainly know what your data looks like and be able to say “Okay, if we’re going to key the data and it’s going to get divided up, how are we going to key it in such a way that it’s easily spread out?” Because we have this long tail with lots are small, but a few large data points per datasets, those large datasets have to be broken up in some way.


If we key by domain for example, that’s not going to be good because the domains are going to be really large in terms of the dataset; so, we have to key it by saying domain and date or domain, date, and data type.


Daniel: Something more granular that’s thought out beforehand.


Juan: Yeah, and that’s one of the places where we’ve had a couple gotchas. You think you know what your data looks like, you started serving it at a high volumes and then you realize: “Oh, it’s not as even as I thought it was,” or “these keys, the way that I’ve distributed them, are not as even as I thought it was”. At this point, you might have to rebalance the cluster or change the way that it’s dividing up things.


Daniel: How hard is that and how much control do you have?


Juan: If it’s not badly redistributed, you can certainly redistribute the tokens on which it’s hashed. Say you have two or three different tokens and the data that falls between these two tokens goes “here and “here”. If those aren’t so different, you can just insert tokens in between and redistribute the data on your cluster; that’s not terribly difficult with Cassandra.


However, if it’s really bad, say one is very small and then the next five are fairly large, you have a data problem; you keyed it wrong and you really need to go back.


A funny story: We have a number of unregistered publishers that work for us, and so at one point in one of our systems, we keyed it all on the same unregistered key. All of the the unregistered data went to this one machine and totally stopped the thing up. So obviously, that was a gotcha.

Daniel:For anyone who’s thinking about Cassandra today, how do you think it stacks up to the other options like Mongo? Would you do it the same way if you were starting from scratch right now, like if someone is looking at NoSQL solutions?



Juan: I know it’s cliché but you have to know your data. Certainly, there are some solutions where an in-memory store works really well. Even something as simple as Memcached might end up getting better performance in what you’re doing.


Similarly, if you’re working with documents (which is really what Mongo’s designed for), I think Mongo has some pretty good advantages. It does give you some very basic aggregation functions which don’t make sense with Cassandra. Cassandra is really about the throughput and as long as you can evenly divide out your keys and get the right hardware underneath, you can really scale up Cassandra to probably more than what most people are using it for. I’d say if that was your goal, then Cassandra is certainly what you want to be using.


If you have smaller datasets, what are you going to do with that data? If you want aggregations, that’s going to influence your decisions a bit and Cassandra really is not going to be good there. If you want to be moving data in and out of a database and need high reads for example, those things are going to affect your decision as well.


I think Mongo has come a long way since we started using it, so there is definitely some positive growth there. But at the end of the day, it comes back to your current application and how are you going to use it.


Daniel: Do you have any context around ShareThis’ usage of Cassandra or any numbers you know of off the top of your head? This might be helpful for everyone who’s watching to know what kind of numbers you’re talking about.


Juan: It’s always a matter of cost, so balancing how big the cluster is. We can get it to perform, which is nice, but then how much data do you want to run through there? What we try to do is aggregate the sets of data that we have up to some smaller dataset, and we end up with roughly four clusters that we use in production as well as their QA counterparts.  


It depends on the application but if the database is being used as your “typical database” we’ll have 12 extra-large nodes from AWS (that defines two of our clusters).


We have one that services counts and it has a lot of read traffic and a lot of random read traffic, so the caching is hit or miss. That one has 9 regular nodes and 9 SSD nodes to handle that read throughput and it works pretty well. We have another one, which is 12 nodes; we needed high write throughput for this one so we have some special machines that handle Cassandra for that purpose.


Daniel: That’s awesome. You’re really work with large datasets, which is cool. I had heard you say that you’re a very large data user of Cassandra.


Juan: Cassandra really shines when you’re using large datasets, I think.  And we’re not the size of Netflix, one of the largest users of it, but we certainly store a lot of data. I think we have at least 50 large machines in production, plus our QA setup.


Daniel: That’s cool. Is there anything else about Cassandra that you want to share? If I’m watching this and thinking about implementing, is there any important advice that you’d like to share?


Juan: You really want to know your data. Any time we chose Cassandra, even when we were using Cassandra in the past, we look at a new product and say “Okay, what’s the right database for this?” 


We profile the application on the expected use. We run tests on the database, based on that expected use. I think if you are thinking about it, I would urge you to really go through that process because it’s valuable. It will ultimately save you time down the road, when you don’t hit the gotchas because “Oops, I picked the wrong one.”


Daniel: Awesome advice. Well, thanks for the time, this was great.


Juan: Thanks for having me.