Matt Kemp Team Lead at Signal
"The main benefit for us is the automatic replication both intra and inter-region... we’ve spent zero hours ever dealing with Cassandra replication issues. What this means for our clients is that once they have saved data, it’s available to them no matter where they are geographically."
Matt Kemp Team Lead at Signal

Signal is the global leader in real-time, cross-channel marketing technology. The Signal Open Data Platform helps marketers collect data from any offline or online source, synchronize the data across all touchpoints, and distribute it to any marketing or analytics endpoint – all in real time. The platform is ecosystem-neutral and makes data and marketing technologies work better together for increased engagement, loyalty and conversions.

Evaluating CouchDB, MongoDB, and Riak

Before settling on Cassandra a couple of us evaluated it alongside CouchDB, MongoDB and Riak. At that time Cassandra 0.8 was the latest release. For every one of the databases we worked up a trial to see what it would actually look like if we used this database as the backend. It might have been a little ridiculous to have four different backend implementations at one point, but it was extremely useful in our evaluation as we got first hand experience with with each of these technologies.

While we were flexible on the underlying data model, we had some hard constraints around performance, capacity and multi-data center replication. We wanted to make sure if a client updates their data in one region, it propagated to the other in a reasonable amount of time. Out of the box cross-data center replication was the ultimate requirement that led us to Cassandra over the other databases.

We use Cassandra for two very different parts of our infrastructure. The original use case was to power our Fuse platform. Fuse allows our clients to combine and utilize their data across previously silo-ed channels: web, mobile, CRM systems and more. Additionally clients have access to the data no matter where they are located geographically. The second use case was KairosDB plus Cassandra to power our client facing stats. Going forward we’re looking at other potential uses of Cassandra as well.

A changing Cassandra

When we first launched our Cassandra ring in production, version 1.0 had just been released. We upgraded to 1.1 in 2012 and then subsequently to 1.2 at the end of 2013. We made a similar evolution of drivers starting with Hector, then moving to Astyanax, and finally ending up with the Datastax driver.  We also started with Netflix’s Priam for security group management, but moved on to write our own “sidecar” which uses our our cloud-agnostic infrastructure reference API instead of Priam.

Overall we’ve liked the change from the Thrift interface to CQL. It’s definitely lowered the learning curve to get people started with Cassandra. The biggest challenge has been getting people to understand the differences between Cassandra and a traditional relational database both how the data is stored on on disk and how you can query that data. For example that you can’t perform joins and if you want a reverse or secondary index you basically have to build it yourself.

We switched to virtual nodes at the same time we upgraded to Cassandra 1.2. Our conversion process was a little complex, but ultimately went very smoothly. We set up a brand new ring alongside the existing one. We converted our data access service to read from both rings, but only write to the new one. That allowed us to deal with all organic traffic. To expedite this process we ran script that scanned the old ring and forced re-saving the data into the new ring (using our service).

We had some initial confusion about how virtual nodes operated. We thought that virtual nodes would split the whole ring into X slices. Once we understood that each node would take its own X slices of the rings, everything kind of made sense. Now that we’re on virtual nodes, we are really enjoying some of the benefits. For example the ability to scale each region independently based on volume. The only real downside is that it makes running repairs a bit more complicated.

Replication out of the box

We have several Cassandra rings in production. Our largest ring is currently around eighty nodes distributed across four AWS regions. We have a smaller stats ring running KairosDB backed by Cassandra.

All our Cassandra nodes are running on i2.xlarge instances. Initially we started with the m2 instance line with striped disks. Once we switched over to the i2 line we saw just how beneficial running on SSDs is.

The main benefit for us is the automatic replication both intra and inter-region. Besides some work to automatically run repairs via cron, we’ve spent zero hours ever dealing with Cassandra replication issues. What this means for our clients is that once they have saved data, it’s available to them no matter where they are geographically.

A secondary benefit is that we only recently had someone start to manage our Cassandra nodes full time. Prior to that there were four of us working with Cassandra part time doing maintenance or upgrades as necessary.

Getting started

Our advice would be to focus on two items: the data model and a load testing framework.

In some senses data modelling is the easiest piece, but also the most crucial to get correct. Our approach has been to figure out what questions we want to answer and work backwards from there. Each table in our Cassandra schema is designed to answer one (or sometimes more) of those questions. Due to the lack of joins in CQL this often leads to denormalized data in your tables.

The load testing framework isn’t the end goal, but rather a tool to help plan for capacity and performance test changes before they’re introduced to the production ring. If you can figure out want loads you expect including the mix of reads and writes then you can using a load driver to generate appropriate traffic to a test ring. It’s also useful for finding the hockey stick point for the load each Cassandra node can handle. We use JMeter to drive load (via HTTP) at our data access service. From there we can monitor performance of our application as well as Cassandra. We’re able to test schema changes, application query pattern changes, driver upgrades and Cassandra upgrades with the same tool.

We also recommend looking into open source community’s tooling around Cassandra in addition to what Datastax offers. There’s larger projects like Astyanax and Priam from Netflix as well as smaller but still extremely useful projects like Al Toby’s Page Cache Stats.

Want to work on Cassandra at Signal? Check out their careers page.

Follow @twitter