Brady Gentile: Community Manager at DataStax
Ben: At Mollom we are focused on preventing spam and unwanted content on blogs and ports. We largely focus on the Drupal community, we also have plugins for many other CMS’ such as WordPress, Joomla, SilverLight and we have an open API where people can program against.
Brady: Very cool. How does Apache Cassandra fit into the mix of what you guys do at Mollom?
Ben: One of the big problems that Mollom needed to solve was, we have a massive inflow of information coming into the API. We’re handling more than 700 requests per second, which are all pieces of content that we need to analyze. Every piece of content basically gets split up into many different small tokens and things that we need to track. For example, we track reputations for users, we compute spam scores for a certain piece of content, and all of this leads to a very high write loads. This was one of the prime reasons we choose Cassandra.
Another really important reason we chose to use Cassandra is because we need to be highly resilient; we mustn’t go down, because 70,000 really big websites depend on us being up. We have several servers running in many different data centers across the globe and also keeping a consistent state in the data storage, across different data centers, is really non-trivial and Cassandra solved this really well for us.
Brady: Excellent. You’re fond of multi-data center replication then?
Ben: Yeah, neat. One of the really killer features of Cassandra.
Brady: For someone who’s just getting started in multi-data center, would you have any advice for them, any tips or tricks? Maybe something that they shouldn’t do?
Ben: It’s really important to really understand how Cassandra performs replication assignments across the token ring. Then make sure the architecture you plan for how replication will happen across the different data centers, matches this Cassandra behavior. If you start using Cassandra from scratch on a small project, everything works really nicely, but if you then need to scale it to really a bigger system, you really need to understand well how the different replicas are spread across different servers on the ring, and where these servers are in different data centers. Really need to plan ahead of this. That would be the biggest tip.
Brady: That’s excellent. Are you using any sort of cluster management tools at all, at the moment?
Ben: We use the DataStax Ops Center for monitoring. We use Munin for monitoring actually all of our different services and we also have Munin Graphing for Cassandra Metrics, which is really helpful.
Brady: How did you start out with Cassandra?
Ben: We started Mollom about six or seven years ago and in the early days, we had a MySQL datastore with a Java server. Everything went smoothly until we reached about a thousand customers and then we started evaluating storing stuff on Solid State discs because, due to the high write loads, SSDs at that time were a compelling solution. It was still really early for solid state, it was expensive and we had several hardware failures. Then we started looking for a simpler storing solution, this was early days, there weren’t that many around. We started really looking for one that was optimized for write performance, and we stumbled on Cassandra. This was in the 0.5 days, was really early and we went in production on the 0.6 version.
We became experts in Cassandra by just going along with all the different versions, all the different improvements, knowing to work around some of the early bugs. We very quickly saw very big potential in Cassandra, because it outperforms SSDs, like writing to files on Solid State, even on normal rotary discs and replicating across data centers was super easy, even in the earlier versions. This was really one of the biggest reasons we switched to Cassandra.
In the 0.6 days there were still huge bugs in Cassandra and even then we decided to really go for this solution because it was the right technology for what we needed.
Brady: That’s great. It sounds like you guys were very early adopters to Cassandra. That’s always awesome to hear, when it worked for people and they’ve continued to use it, after the early days. Would you be to share some insight into what your deployment looks like?
Ben: We currently run on the Soft Layer hosting, where we have dedicated hardware. It’s not a cloud solution. We currently run 6 nodes per data center three data centers. There is a Cassandra ring and then we have a bunch of API nodes to do the API handling in the different data centers. It’s not a massively big ring; we have a moderately sized data set, it’s not petabytes of data but it’s hundreds of terabytes of data, and it still fits on a couple of machines. It’s because of the write-heavy loads, a limited number of machines can go perfectly well with correct sizing of the caches and the heap size.
Brady: What would you like to see out of Apache Cassandra in future versions? Are there any features that would help you specifically, or your use case?
Ben: It’s still non-trivial to deploy configuration changes or updates of a larger ring. Help here would be definitely nice. In my mind, the different nodes in the ring shouldn’t have individual configuration files, there should be a more centralized configuration management, where you can describe the ring properties and then things get propagated nicely through the rings.
Another thing would be to make it easier to set up, to build in tool support, to make it easier to set up across data center replication. Use tools that are really aware about how the replication works and then come up with sensible allocations.
That and maybe a final one would be, it is still non-trivial to get the cache sizes right. There are some internal metrics which help you, like Cache Hit Rate and things like that, but it is still hidden away somewhere in JMX. It would be neat if these things could auto-tune or have better support for figuring out. I know DataStax has solutions for that.
Brady: Okay, and what is your experience with the Apache Cassandra community?
Ben: We did find the Cassandra development mailing lists very productive. From the early days we had issues and bugs that we found, we posted them on the mailing list, and Jonathan Ellis and these guys would just jump on and fix them in a couple days. That was really one of the reasons we also choose Cassandra, because we saw that there was a real community of people caring for it. Truly an open source project.
Brady: That’s really good to hear that you had a positive experience with the community. Thanks for joining us today.