Hi, I’m Will, the CTO of CivicScience. CivicScience is trying to power the world’s opinions.
The essence of our technology is simple: we ask people questions. The individual interactions are short, but we aggregate them over time. As you might imagine, things get interesting once we have large, longitudinal profiles of many people. A fun example is a deep dive on what it means to like different styles of chicken wings or the sauce you choose for those chicken wings. While somewhat silly, it does illustrate the kind of broad and deep analysis that you can get out of our data.
We started in 2008. As many young companies do, we started with a simple database solution (MySQL) and waited to see our pain points. Out first bottleneck was a need to arbitrarily scale out writes.
Arbitrarily scaling out reads was a “solved problem”–you just add caching or read replicas. At that point in time, scaling out writes meant sharding (partitioning) and that was a painful process that added complexity to both the application and infrastructure. This was the time period when the distributed, horizontally scalable NoSQL system was appearing. I looked around for a better approach, and sure enough there were some open-source projects already; Apache Cassandra was one of them.
On the business side of the decision, my company was still quite small, so the open-source model was key. I liked the active community around Cassandra. And having corporate backing (DataStax) gave us a path to buy support if needed.
On the technology side of the decision, the home runs were minimal configuration, linear scaling, and tunable consistency. Homogenous configuration was not a typical feature of other products at that time, which made deployment easier for me as I’m not really a system administrator at heart, I’m more of a programmer. The fact that I didn’t have to adapt my application code or mess with Cassandra settings to get sharding and thus linear scalability seemed kind of magical. Lastly, I liked the idea that it has a tunable level consistency to trade speed vs. reliability.
And last but not least, high availability. It was vital for us to make sure we were rock solid, since we are embedded on other websites.
We transitioned two MySQL tables to Cassandra, and these two still make up the vast majority of our Cassandra use by volume. The first is session tracking, which is an interaction with someone answering questions (a non-programmer would have just called session a “poll” or “survey”). The second is observations (again, a non-programmer would have just said “answers”).
For sessions, I had to use a stronger level of consistency. For observations, eventual consistency was fine, so I elected for lower consistency and higher throughput.
We started small with a 4 node cluster. We’re up to 9 nodes now, running 1.2.x. We usually stay one major version behind until minor release 4 to 6, though obviously we’ve fallen behind as 2.1 was just released before we managed to get to the 2.0 line! In terms of 1.2, VNodes was a big win for us, massively simplifying token management and server placement.
We’re in the Amazon cloud, split across 3 availability zones in one region. We’ve recently transitioned everything (including Cassandra) from EC2 Classic to VPC. We’ve also switched Cassandra’s storage from ephemeral disks to SSD-based EBS.
We did use Cassandra’s multi-data center support at one point: one ring for production and one ring for Hadoop. We’ve since moved back down to a single data center. In the background, I transfer the SStables to S3 where I can access them directly from Hadoop.
Years later, everything I hoped about Cassandra has happily turned out to be true. The mailing list (community) has been a great help on many occasions. I’ve been able to get some extremely helpful consulting support from DataStax. Identical homogeneous nodes were easy to manage. I was able to scale by just adding more nodes. Finally, in terms of uptime, Cassandra has been our most reliable system.
Looking forward I’m very confident that Cassandra can support our future needs; I don’t have any concerns about scale or logistics.
More than anything, you have to understand the way the tool works and if that aligns with your use case. Between the mailing list, the documentation, and the ability to look at the source, nothing about how Cassandra works is opaque. If you’re willing to get your hands dirty, you’ll be successful.