August 1st, 2013

Brady Gentile: Community Manager at DataStax

Theo Hultberg: Chief Architect at Burt

Brady: Hello Planet Cassandra users. This is Brady Gentile, Community Manager at DataStax. We have Chief Architect Theo Hultberg from Burt with us today. Theo, thank you so much for joining us. What does Burt do?


Theo: Sure, Burt is a business intelligence company for publishers; so, we work with digital publishers to help them understand what their revenues are, and how they are making money online. That means tracking advertising and other things surrounding advertising, because that’s how digital publishers make money. We really build analytic systems.


Brady: Excellent and how are you guys using Apache Cassandra at Burt?


Theo: We have a few different places where we use Cassandra. The primary reason why we use it is that we do a lot of writes but not necessarily lots of reads, so we need something that can write lots and lots of data and then occasionally read it back. We use it for some easy persistence of things. We use it for building indexes and histories. For example, one use case is that we built analysis of visitors to a web page, including histories of their visits, all the pages that they visited, all the ads that they’ve seen and other media they have consumed within the scope of the publisher’s various web properties.

We also use it as storage for some of our pre-calculated metrics as well. So we have a few different use cases, and Cassandra fits into them quite nicely.


Brady: How did you come to choose Cassandra?


Theo: As I said, it’s the write capacity. We write so much, and we actually read so little back. For example, we have one application where we decided to use Cassandra where we want to know, or rather the publisher wanted to know, when you have a web store that you want to know when someone buys something and you want to know what affected that purchase.


Maybe you want to see all the advertising that they’ve seen before they made the purchase. Only 1 person in 10,000 may actually buy something online and you need all the histories for everyone when that purchase happens. You will only look at 1 in 10,000 so we need to write lots and lots and lots of things and then read it back once.


Cassandra is really good for that kind of thing. We’ve had problems in the past with other databases that haven’t been able to keep up with the write load and most of the data ages very quickly, so we want to get rid of that as quickly as we can. That’s also a place where Cassandra really shines, getting rid of all the data that you don’t want any more.


Brady: Excellent. You had mentioned that there were other technologies that you had been using prior to Cassandra. Could you dive in a little bit about those?


Theo: Yes, we have a love-hate relationship with MongoDB. I definitely don’t want to speak ill of Mongo. We still use it for some use cases where it really makes sense. But for a primary use case like ours, when you constantly write lots of data you need to get rid of old data MongoDB couldn’t actually get rid of the data faster than we could write new data, so we were speeding into this brick wall. We knew it was coming but there was no way to put the brakes on.


Brady: Very interesting. Was Mongo the only one you had looked at previous to Cassandra, were there any others?


Theo: We definitely used lots of different technologies. MongoDB is probably the only other database product we used heavily. Cassandra is the go-to database for new things right now.


Brady: Could you share with us some insight into what your Cassandra deployment looks like?


Theo: Sure. We have a few quite small clusters. The biggest one is not more than six or seven nodes. We have three of them in production and then about the same amount in our staging environment. We are running on EC2. Everything is running on EC2 so we spread our Cassandra clusters across three availability zones, which has worked really really well with Cassandra.


We currently don’t do any multi-data center replication, but I’m really looking forward to the day when I can push that out because that seems to be a really great feature to have when you need it. I don’t think we really have anything special in terms of infrastructure when it comes to Cassandra. We run it on top of m1.xlarge instances. I think we’ve got two Cassandra 1.1 clusters and one Cassandra 1.2 cluster. We want to move everything up to Cassandra 1.2 because that is just great. Actually our whole platform is written in JRuby and the drivers for Cassandra in the Ruby world have been quite bad and we’ve always used Java drivers to talk to Cassandra. That kind of works but it always feels a bit dirty.

A few weeks ago I released a new Ruby driver for Cassandra for a pure CQL3 driver so we want to move everything up to Cassandra 1.2 so we can run on a native Ruby driver from now on. That will be great.


Brady: You had mentioned that you were interested in doing multi-data center replication. In regard to future versions of Cassandra and features that you are looking forward to, what are you most looking forward to? What would you imagine you would like to see in future versions?


Theo: I’m a bit excited about the compare and swap feature that’s coming in 1.2. I’m not really sure that’s something that I will use very much myself, but there’s so many questions on Stack Overflow with people who are new to Cassandra, who haven’t really understood the data model yet and are trying to do things that will require an operation like compare and swap.


Currently you also have others who are like ‘No, you really need to change how you think about your data model. You need to change things fundamentally.’

That’s kind of a hard thing to tell someone and also it’s very hard to give them a good, short concise answer of how to actually do that.

I think a simple thing like compare and swap can actually make those things easier for people who are new to Cassandra who haven’t yet understood the whole data model thing and how to model with and just do a few simple things where they really need to be able to atomically change two things at the same time, for example.


Otherwise, CQL3 has been a great addition to Cassandra. It might be the biggest thing that’s happened since I started using Cassandra. It’s a great step forward; it feels like Cassandra has grown up and has moved from the old way that always felt a little hacky into something that is nicely packaged. There are definitely a few features in the CQL protocol, that I as a driver implementer would like to see and other improvements around CQL.


Brady: You mentioned the data model for Cassandra. It appears to be a really popular topic at Cassandra meet-ups. Are you involved in meet-ups locally?


Theo: We don’t have a very big community specifically around Cassandra. I’m running a Göteborg Distributed Systems Meetup Group and a Göteborg Ruby Meetup Group and help organize a few other meet-up groups as well. There’s not very many people in my area that have deep experience in running Cassandra so we haven’t gotten that far yet.


Maybe we’ll have a meet-up one day where we talk more in depth about data modeling and Cassandra. Right now people are more on the level that they need to hear about the different new databases that are available. People are so stuck in the MySQL/Oracle world, at least here in my area, in my experience.


I am trying to answer as many questions as I can on Stack Overflow and that is definitely the one big thing people have problems with, data modeling and thinking about how they can model their problems with Cassandra because they come from the relational database world where things are very, very different.


Brady: Thanks Theo. Is there anything else that you would like to add?


Theo: No, not more than just plug my Ruby driver. I am hoping to get some more traction. I think that historically Cassandra hasn’t been a big thing in the Ruby community, and the lack of good drivers is probably one big reason for that. I’m hoping that with CQL3 and my drivers maybe we can make some inroads into the Ruby community.


I think CQL3 is something that will make it so much easier for people to get into Cassandra and start using it.


Brady: Anyone can download your CQL3 Ruby driver from Planet Cassandra’s drivers download page.


Theo: Cool.


Brady: All right, Theo, thank you so much for joining us today, and best of luck to you and Burt. I hope to talk to you soon.


Theo: Sure thing. Thanks.