March 27th, 2014

By 

GitHub 

“The result is a data store that’s extremely stable in production, with predictable latency, growth, and failure behavior when the system is partially down.”

- Derek Greentree, Software Developer at GitHub

 

Derek Greentree

 

Derek Greentree Software Developer at GitHub

 

GitHub

My company, GitHub, helps software developers work better together; we host source code repositories and provide tools for builders to collaborate on their projects. I work on the analytics team at GitHub. My role includes the care and feeding of our data analysis pipeline and the user-facing features we build on top of it.

 

Analytics with Cassandra

We are using Cassandra version 2.0.3 as a performant, available, fault tolerant data store for reporting output from our analytics warehouse. This reporting output is used directly in user-facing features on the site. A small API layer handles retrieving data

GitHub Commits

directly from Cassandra.

 

It shows GitHub users an overview of HTTP traffic to their repositories, including referrers, unique visitors, top content within the repository,and more.

 

Getting to Cassandra

We wanted a data store we could use to serve results directly (and with low latency) on our highly-trafficked website. We wanted that data store to have an easily understood scaling pattern past a single node and to be tolerant to node failures.

 

Linear scalability, fault tolerance, and per-query consistency are important to us. Cassandra’s scalability allows us to easily reason about our architecture’s future expansion and cost based on current usage and growth. Cassandra’s fault tolerance features, including tunable consistency per query, allow us to practice many different failure scenarios internally and harden the service to them.

 

The result is a data store that’s extremely stable in production, with predictable latency, growth, and failure behavior when the system is partially down.

 

GitHub’s deployment

Our deployment is relatively pretty small. We have 4 large nodes on Amazon EC2 (cc2.8xlarge). Currently, they are all in one AWS region (us-east-1a) to take advantage of higher interconnect speeds between nodes, although we previously experimented with nodes spread across different regions to tolerate an entire AWS region being offline.

 

While our Cassandra deployment isn’t massive in scale or novel in terms of its use, I’d like to reiterate that, as a data store, it’s been stable and performed with reliable latency for several months without much attention. As a developer and administrator, I’ve been extremely pleased with its ongoing behavior.

 

Planning for disaster

If you’re planning a Cassandra deployment, make sure you include some failure testing where you down one or more nodes at a time while reading and writing to the cluster, or otherwise introduce node failures or network partitions.

 

Cassandra community

Positive! I’ve been able to easily find answers to my questions on mailing lists or via the large volume of articles and blog posts about C*. My impression is that the C* community is large, active, and friendly.

 

Vote on Hacker News