Robert Adler: President at Datadio
Christian Hasker: Editor at Planet Cassandra, a DataStax Community Service
TL;DR: Datadio found that Apache Cassandra has a variety of use cases which have allowed them to expand their toolset. Datadio is an aggregator of data sources from services such as Tumblr, Twitter, Facebook, blogs and forums.
They switched from MySQL to Cassandra because they weren’t able to scale past a certain point. Datadio trialed MongoDB for a period of a month and a half; during this trial period, after attempting to use an auto-sharding feature, they lost half of their test data.
Their setup is 6 Cassandra nodes on economy machines with 7,200 RPM SATA II drives. With this setup, Datadio is able to push upwards of 25,000 inserts a second.
Robert says that there was a lot to learn when switching to Cassandra, but the speed and scalability that Cassandra provided made it all worth it.
Hello Planet Cassandra users. We have Robert Adler joining us today to talk about how Datadio uses Apache Cassandra. Robert, to start things off, could you tell us about what Datadio does?
Datadio is a social tool to help marketers make better decisions through data analysis and communicating in real-time.
And can you tell us about how you’re using Apache Cassandra?
Why we like Cassandra is that there’s so many different applications we can use it for; so much, in fact, that it’s actually changed what direction we were taking our toolset in the first place. Mainly, we’re an aggregator of data sources; that means we watch social networks such as Twitter, Tumblr, etc. and also aggregate blogs and forums.
We try to make sense of the data on our end to save people time. You could read 500 words about a post… but why read 500 words if we can tell you that the post is about a guy criticizing a product or company?
We try to get the best information possible and sum it up for people, and give them the ability and the medium to respond to people quicker and with more intelligence.
You talked a little bit about how Cassandra has allowed you to move your toolset in a different direction. Can you talk a little bit about what you originally wanted Cassandra to do for you and how it’s taken you in a different direction?
Sure. When we first started out, I’ve always been a traditional coder on the data base side of things. We started out with MySQL and I’ve done some pretty amazing things as far as benchmarks in a MySQL build. But when we started scaling higher up, we went from 10,000 to 15,000 inserts a second and then even beyond that; it started buckling.
My questions became “Can we potentially load-balance it? Can we shard it? Etc.” There was really nothing out there that we could implement to do that, it was all on our end. So we decided that we would have to do our own distribution of data, our own sharding. This was not necessarily a problem but with Cassandra, it allows us to say “OK, we have four more servers to turn up.” We turn up four more servers and then we set the tokens. The next thing you know, it’s balancing out the ring.
We don’t have to worry about transitioning data ourselves. With just the performance it’s given us so far, we’ve been able to push a lot of our features much farther than being the standard “Here’s a Tweet. Let’s reply to the Tweet.” Now it’s, “Here’s a Tweet, who’s in it? Who is it talking about? What’s the person about? Who does the person work for?”
It allows us now the ability to have different data in different ranges and not worry about that aspect but focus more on our actual product. It cut a lot of maintenance down that we don’t need to do anymore.
Great. We frequently see in the community migrations from MySQL to Cassandra, especially when you’re hitting that very high threshold of data volume and needing predictable performance. I’ve been wondering, did look at anything else before Cassandra?
Actually, we gave MongoDB a try. I’m not going to say that MongoDB wasn’t good. We tried it for about a month and a half and everything about it seemed good but when we needed additional features, we ran into small problems. I don’t know whether or not these problems were due to user error on our end or problems due to “it’s a new feature and we’re running into a small problem”.
For example, MongoDB does have the ability for auto-sharding and we tried that probably four or five times. The first three times, our nodes lost around half of our data. Luckily, it was just test data, but it’s not a good feeling to sit there and have it go, “Oh, look. Everything’s perfect. This is great … and now it’s gone.”
I’m not saying it’s a bad product. I’m just saying that based on our experience, the transition to Cassandra was a lot more smooth. We haven’t really run into issues like that with Cassandra. For the issues we’ve run into, all it takes is a quick message to a couple of people in the community; whether it’s sending it off through email, or putting it in the IRC room. I end up with a solution usually in 30 minutes, so I can’t complain.
How was it coming from relational to Cassandra. What did you find difficult, or things that now you’ve learnt, you wish you’d known back then?
I wish that I knew not to waste time trying to do six secondary indexes on one column family, because I was so used to doing SQL style indexes. When I started out in Cassandra, I said, “I wonder why my writes per second were cut by 80%?”
There was a lot that we had to learn coming from that, but it’s nothing you can’t adapt to. Once we actually embraced changing some of those and even moving those onto a different platform, it really makes you appreciate the speed of Cassandra.
Robert, you mentioned earlier getting questions answered pretty quickly in the IRC. I’m wondering, have you had other experiences with the Cassandra community?
Not yet. I had looked once or twice to see if there was a local group but it didn’t look like there was; on the flip side, I did actually run into a couple people at the local Ruby group that use Cassandra on a two or three node level. They were great for answering my initial questions when I first started.
Can you walk us through a little bit about what your deployment of Cassandra looks like?
When we first started, we were running through what they call “bare metal” servers so it’s not cloud but it’s virtualization on physical hardware. Then, we moved on to regular dedicated servers. Then, after that, we kept transitioning until we needed more than one Cassandra node. For a couple months, we were running on a single node. I was actually really impressed with that, considering I’m running 7,200 RPM SATA II drives, just standards, no SSDs or hybrids. I was still pushing 25,000 inserts a second (single node).
Just the ability to pull that on economy drives was really nice to see. As a result, we went ahead and bought some more physical nodes and now we’re running on six hardware nodes, which their sole job is Cassandra. We went from one node in the ring to six. Believe it or not, that transition was a lot smoother than I thought it was going to be.
In terms of scaling, we’ve been all over the board. We started with one. We went with cloud and then bare metal. We always prefer to be on dedicated hardware, just so we know that if we’re maxing a box, we’re the ones maxing the box. In general, it’s been very predictable in terms of the resource usage. It’s not like a load of four, a load of four, and then two minutes later, a load of 15. It’s been very good so far – very consistent.
As you move forward with Cassandra, is there anything that it doesn’t do that you wish it could do?
There’s always going to be that list coming from the relational aspect, but I know that that’s not logical. In terms of what I’ve seen so far, I can’t think of anything off the top of my head. It really does have everything that you could want.
The only thing I could potentially even think of is when moving nodes to different tokens, I’ve ended up having just slight problems where it would not finalize the transition and the log would not update. That was just more of a me being impatient rather than Cassandra having a problem.
The one thing I’ve always wondered is whether or not there was a way to view a current status, more than just the compactions, for example: on moves. Moves normally have the status of sending files but they don’t have this in between sending to two different nodes.
Robert, this has been great. Thank you very much for taking the time to share your experience of Cassandra at Datadio.