May 18th, 2015

Austin Ouyang Program Director at Insight Data Engineering
I am an electrical engineer by training and have bounced from hardware/software integration to biomedical research and now into data engineering. I currently help academics and others in industry make their transition into the data engineering field. Improving my ability to help others understand complex concepts has always been a passion of mine, and I strive to continually expand my knowledge in the data engineering field.

David Drummond, Program Director at Insight Data Engineering
I am a passionate learner of anything and everything! As a physicist, I have pushed myself to understand how the natural world works, using that knowledge to design new technologies in quantum computing. As a data engineer, I love to learn new technologies and tools for taming large data sets, enabling me to build awesome products with the ever-increasing wealth of information available. I also have a passion for helping others learn and communicating difficult concepts in a clear, concise way.

The Technology Highlight series covers the leading open source tools learned in our seven week Data Engineering Fellowship. In this post, Austin Ouyang – an Insight alum and Program Director – discusses how he and recent Fellows used the NoSQL database Apache Cassandra.

This blog post is also being published here on the Data Insight Engineering blog.

Cassandra has become one of the most popular NoSQL databases used today by companies such as Netflix, Instagram, and many more. Originally designed at Facebook, Cassandra came from Amazon’s highly available Dynamo and Google’sBigTable data model. Cassandra is a highly available and decentralized database with tunable read and write consistency. All nodes communicate with each other through a Gossip protocol similar to Dynamo and Riak, exchanging information about themselves and other nodes they have gossiped with. Cassandra can also deploy multiple data centers acting as backups or to reduce latency across regions.

Getting Started with CassandraAlmost all of the incoming Fellows had little experience with NoSQL databases, but were able to quickly pick up Cassandra. They got started by spinning up clusters on Amazon Web Services (AWS) using the preconfigured DataStax Amazon Machine Image, which allowed Fellows to quickly begin using Cassandra “out of the box”. Using a SQL-like language known as the Cassandra Query Language (CQL), we were able to learn and test queries immediately using the CQL shell.

We also got a great introduction to NoSQL and Cassandra internals from Insight alumnus Mike Grinolds and Patricio Echague, Director of Engineering at RelateIQ and a committer to Hector – the Java client for Cassandra. He also discussed data modeling for Cassandra, which requires a different approach than relational databases like MySQL or PostgreSQL.

Patricio Echague gave an excellent introduction to NoSQL and Cassandra

Unlike relational models where the schema and tables are designed to represent objects and their relationships, data in NoSQL should be modeled around the specific questions that the end-user will ask. This often means using multiple tables with the same data, but the redundancy of this denormalized approach is often worth the better performance on massive volumes of data. While Cassandra doesn’t support ad-hoc data manipulation with functions like JOIN, SUM, and MAX, it can handle a higher throughput of data if you design a data model specifically for the desired queries.

Cassandra’s data model, derived from Google’s BigTable and similar to HBase, is a partitioned row datastore organized into tables (formerly known as column families), where the rows are key-value pairs. Each table contains a set of primary keys which uniquely identify each row of data. The primary key can be composed of multiple columns resulting in a compound primary key. The first part of the primary key is always the partition key, which assigns a consistent hash that determines where each row of data will reside in the cluster of nodes. The partition key can also be composed of multiple columns (known as a composite or compound partition key) which can mitigate scenarios where specific nodes end up with significantly more data than others in the cluster. Columns in the primary key following the partition key are known as clustering columns, which are sorted keys that map to values. Since each row can contain up to 2 billion columns and the values can be sorted by clustering column types like timestamp, Cassandra is great for storing time series data.

Cassandra for Messaging AnalyticsWhile this approach to designing tables was initially difficult (especially when coming from a background with relational databases), the best way to understand the data model is to explore an example use case. The first step when creating a table is to determine the type of queries a user would make. For example, let’s suppose a management team from a messaging app would like to request the total number of messages sent on a specific date across all US counties. They would also like to see how the number of messages sent each day has progressed this year for a specific county.

Example of a cluster with a composite partition key

The above diagram outlines a keyspace (analogous to a schema) design and how data resides on each node for the messaging app example. All tables must reside within a specific keyspace (analogous to a SQL schema) and defines if the data will reside in multiple datacenters, as well as the replication factor for the tables. Our table will consist of columns labeled, state, county, date and count. In this example both the state and county have been assigned to be the partition key. Reasons for this design may be that a majority of messages come from mainly California and New York. If only the state were chosen as the partition key, the nodes in the cluster responsible for these states would have a much higher workload than others. This diagram shows how each partition key is hashed to one of the five nodes in this Cassandra cluster (the replication factor is set to one for simplicity).

Example of a compound partition key with sorted clustering columns

Data for each partition residing on the same node will be clustered by the date column as shown in the second diagram. The dates will also be sorted which makes range seeks extremely efficient. With this design both queries from the example can be easily queried: the total number of messages sent on a specific date across all US counties is given by

and the number of messages sent each day this year for a specific county

Reads and Writes Anywhere with Multiple DatacentersSince Cassandra uses a decentralized approach, it works well for use cases that require multiple datacenters in different locations. This was one of most interesting features for Guang, one of our recent Fellows who is now a data engineer at a stealth startup. He built a data pipeline to analyze game data from the real-time strategy game Starcraft II, which is played worldwide. If his data needed to be quickly accessible to several locations, Cassandra would easily scale to multiple datacenters with flexible configurations. For example, a piece of data can be stored in two replicas on the east coast and three replicas on the west coast using:

More so, Cassandra has built-in support for AWS datacenters (which we use during the Insight program) that optimizes the Gossip protocol for the best read and write performance. This ability to easily and flexibly sync data geographically is an important use case for national and international companies like Netflix and Spotify.

Mutli-datacenter clusters for geograpically separate data
Tunable Consistency for Each QueryIn the language of Brewer’s CAP theorem – which states that partition-tolerant distributed systems can be either consistent or available, but not both – Cassandra is considered available and partition-tolerant (AP). Since there is no ‘master’ node to coordinate reads and writes, data is always available on one of the nodes, but it’s possible it could be out of date for about a second before the other nodes gossip about new updates. However, one of the main differences from other NoSQL databases is Cassandra’s ability to tune the level of consistency for each query.

For example, another recent Fellow, Silvia – who is now a data engineer at Silicon Valley Data Science – built a platform for a fantasy football service. As a consumer facing product that needs to avoid downtime, she chose to use Cassandra and can easily check a player’s point with a simple query:

At the same time, she wanted to implement a feature that allowed users to make real-time player trades during games. For this feature, it’s important that different users see consistent data – it would be unfair if one user had more recent data than another. Fortunately, she can tune the consistency for this use case by simply adding a USING CONSISTENCY clause to the query:

which reads the data from multiple nodes and only returns the result if a quorum (i.e. majority) of the responsible nodes responds with consistent results. The consistency for both reads and writes can be controlled for each query: from accessing a single node, to requiring all the nodes to respond, and many options in between.

Cassandra with Spark for Ad-hoc AnalysisCassandra currently has limited exploratory capabilities such as performing aggregates on grouped data or joining tables on selected columns – the ideal use case is to know the desired queries prior to writing data. When this post-write processing was desired, many Fellows used Apache Spark’s in-memory processing framework to easily extract and transform data with the spark-cassandra connectordeveloped by Datastax. The spark-cassandra connector and its dependencies need to be built into a JAR file with sbt, which will include it when starting the spark-shell REPL. The following example shows how we can read a Cassandra table into Spark, compute averages for each specified group using Spark’s newly implemented DataFrames, and write the new table back into Cassandra.

Suppose we begin with a Cassandra table, by_county_month, which represents the total number of messages sent in each US county for each month from the previous example.

Messaging data by county and month

Let’s say we would like to calculate the average number of messages sent per month for each county. First we will need to create an output Cassandra table for the computed average number of messages.

We can then start the spark-shell REPL from the terminal and include the spark-cassandra connector jar file.

The following shows how to pull a table from Cassandra into Spark as a Resilient Distributed Dataset (RDD) composed of case classes, convert the RDD into a DataFrame for aggregate computations, and write the results back into a Cassandra table.

Finally, we can view our results in the CQL shell with:

Messaging by county averaged over months with TTL

Writing this aggregate operation is extremely straightforward with Spark’s new DataFrame feature which is built on Spark SQL. It also includes Cassandra’s native support of time to live (TTL) that allows columns of data to expire automatically. This feature was useful for another Fellow, Shafi – now at Yelp – that used TTL to keep a rolling count of the top trending stocks on Twitter for the last ten minutes. A new feature in the spark-cassandra connector allows users to add a TTL option for the data written into a Cassandra table.

Looking forwardCassandra is one of the many NoSQL databases used today and has helped Fellows better understand some of the key differences with relational databases. Fellows had an enjoyable experience learning to choose appropriate NoSQL databases based on their queries and understand the tradeoffs of other NoSQL options. Fellows will keep pushing the envelope and continue exploring various technologies for different use cases, but Cassandra’s will likely remain a popular choice as a highly available, decentralized database.

Find out more about the Insight Data Engineering Fellows Program  and apply for our next sessions in Silicon Valley and New York by June 29th.