“Eventually, we needed to scale beyond a single-master system,
and we chose Cassandra for its linear scalability and relative ease of operation.”
– Josh Dzielak, VP of Engineering, Keen IO
Josh Dzielak VP of Engineering at Keen IO
We are a hosted, API-driven analytics backend-as-a-service. Customers use our modern, developer-friendly API instead of spending time and money building and maintaining their own big data infrastructure. Instead of worrying about storage and scale, our customers focus on exploring data, creating visualizations, running reports, and generally answering the questions that they care about. I like to say ‘we do the hard stuff so you can do the fun stuff’.
Apache Cassandra at Keen IO
Cassandra is the event data store in the most recent version of our architecture. Right now, Cassandra is running in production and supporting nearly all of our biggest customers. The real-time ingestion side of the API collects JSON and stores it in Cassandra. We don’t store the data as raw JSON, but convert it to a much more efficient format that leverages Cassandra’s columnar orientation. We serve ad-hoc queries directly from this optimized format and are able to perform very low latency queries (seconds) over very large datasets (gigabytes).
Linear scalability, low latency + ease of operation
Previously we were using a single-master document store, which lets us develop and operate quickly as we validated our market and developed initial customers. Eventually, we needed to scale beyond a single-master system, and we chose Cassandra for its linear scalability and relative ease of operation. We didn’t go with something like Hadoop or HDFS because they aren’t designed for low latency access (and are also single-master). And while there are other low-latency linearly-scalable distributed data sources out there, we found that Cassandra gave us the most options in terms of modeling the data and tuning consistency.
We also have a high write rate. Given that our API stores critical data for our customers, write availability is priority number one. You can always re-run a query. But you can’t replay a lost event if you don’t have it. Writes are critical, and one of the reasons we chose Cassandra was it’s tunable consistency and mechanisms for achieving eventual consistency. Another reason we chose Cassandra it that its immutable storage format reduces the wear and tear on our SSDs.
Managing their Deployment
We are hosted in SoftLayer. We’ve been with SoftLayer from day one, and we’re satisfied customers. We run on their dedicated hardware. We run hot-hot in two separate data centers, and as a result are both highly available and highly redundant. Our Cassandra nodes are large boxes with 16 cores, 32GB RAM, and 1.6TB or more of directly attached SSD storage.
I’ve been on the mailing list for a while, and see a lot of really good stuff come through there. It’s the first place I go with questions. I also try and keep up with the content that’s out on Planet Cassandra. It’s probably worth mentioning that we use Storm on top of Cassandra to implement complex, distributed queries. There is a healthy overlap between the Storm and Cassandra communities. So occasionally at Storm meetups I’ll end up learning something new about Cassandra.