August 26th, 2014

The State of the Art

The state of the big data analysis art has changed over the last couple years.  Hadoop has dominated the space for many years now, but there are a number of new tools available that are worth a look, whether you’re new to the analysis game or have been using Hadoop for a long time.

So one of the first questions you might ask is, “What’s wrong with Hadoop?”

Well, for starters, it’s 11 years old–hard to believe, I know. Some of you are thinking, wow, I’m late to the game. To those people I would say, this is good news for you, because you’re unencumbered by any existing technology investment.

Hadoop was groundbreaking at its introduction, but by today’s standards it’s actually pretty slow and inefficient. It has several shortcomings:

  1. Everything gets written to disk, including all the interim steps.

  2. In many cases you need a chain of jobs to perform your analysis, making #1 even worse.

  3. Writing MapReduce code sucks, because the API is rudimentary, hard to test, and easy to screw up. Tool like Cascading, Pig, Hive, etc., make this easier, but that’s just more evidence that the core API is fundamentally flawed.

  4. It requires lots of code to perform even the simplest of tasks.

  5. The amount of boilerplate is crazy.

  6. It doesn’t do anything out of the box. There’s a good bit of configuration and far too many processes to run just to get a simple single-node installation working.

A New Set of Tools

Fortunately we are seeing a new crop of tools designed to solve many of these problems, like Apache Drill, Cloudera’s Impala, proprietary tools (Splunk,, etc.), Spark, and Shark. But the Hadoop ecosystem is already suffering from an overabundance of bolt-on components to do this and that, so it’s easy for the really revolutionary stuff to get lost in the shuffle.

Let’s quickly break these options down to see what they offer.

Cloudera’s Impala and Apache Drill are both Massively Parallel query processing tools designed to run against your existing Hadoop data. Basically they give you fast SQL-style queries for HDFS and HBase, with the theoretical possibility of integrating with other InputFormats, although I’m not aware of anyone having done this successfully with Cassandra.

There are also a number of vendors that offer a variety of analysis tools, which are mostly cloud-based solutions that require you to send them your data, at which point you’re able to run queries using their UIs and/or APIs.

Spark, and its SQL query API, provide a general-purpose in-memory distributed analysis framework.  You deploy it as a cluster and submit jobs to it, very much like you would with Hadoop. Shark is essentially Hive on Spark. It uses your existing Hive metastore, but it executes the queries on your Spark cluster.  At Spark Summit this year, it was announced that Shark will be replaced by Spark SQL.

… And the Winner Is?

So which tool or tools should you use? Well, actually you don’t really have to choose only one. Most of the major Hadoop vendors have embraced Spark, as has DataStax. It’s actually an ideal Hadoop replacement. Or if you haven’t started using Hadoop, at this point it makes sense to simply redirect your efforts towards Spark instead.

There’s been a lot of investment in Hadoop connectors, called InputFormats, and these can all be leveraged in Spark. So, for example, you can still use your Mongo-Hadoop connector inside Spark, even though it was designed for Hadoop. But you don’t have to install Hadoop at all, unless you decide to use HDFS as Spark’s distributed file system. If you don’t, you can choose one of the other DFS’s that it supports.

Unlike Hadoop, Spark supports both batch and streaming analysis, meaning you can use a single framework for your batch processing as well as your near real time use cases. And Spark introduces a fantastic functional programming model, which is arguably better suited for data analysis than Hadoop’s Map/Reduce API.  I’ll show you the difference in a minute.

Perhaps most importantly, since this is a Cassandra blog, I assume you’re all interested in how well this works with Cassandra. The good news is you don’t even have to use the Hadoop InputFormat, as DataStax has built an excellent direct Spark driver. And yes, it’s open source!

Spark vs. Hadoop

What is Spark?

Spark is:

  • An in-memory cluster computing framework

  • Built in Scala, which means it runs on the JVM

  • 10-100 times faster than Hadoop Map/Reduce, primarily because it runs in memory and avoids all the disk I/O

  • A lot like the Scala collections API, except you’re working with large distributed datasets

  • Batch and stream processing in one framework, using a common API

Spark has native language bindings for Scala, Python, and Java, and there is a separate project called SparkR designed to enable R on Spark. It also offers some interesting additional features, including a native graph processing library called GraphX and a machine learning library called MLlib (think Mahout, but much better).

Also, Spark SQL gives you the ability to run SQL queries over any Spark collection, so there’s no additional bolt-on tool or query framework required. That’s right, any collection! SQL queries on Cassandra tables, or log data, or text files–like Hive without needing an extra component.

One of my favorite features of Spark is the ability to join datasets across multiple disparate data sources. Imagine doing joins across Cassandra, Mongo, and HDFS data in a single job.  It’s quite powerful.  Oh, and did I mention you can write SQL across that joined data?


Spark + Cassandra 

DataStax has recently open sourced a fantastic Spark driver that’s available on their GitHub repo. It gets rid of all the job config cruft that early adopters of Spark on Cassandra previously had to deal with, as there’s no longer a need to go through the Spark Hadoop API.

It supports server-side filters (basically WHERE clauses), that allow Cassandra to filter the data prior to pulling it in for analysis.  Obviously this can save a huge amount of processing time and overhead, since Spark never has to look at data you don’t care about.  Additionally, it’s token aware, meaning it knows where your data lives on the cluster and will try to load data locally if possible to avoid network overhead.



So what does this all look like on a physical cluster?  Well, let’s start by looking at an analysis data center running Cassandra with Hadoop, since many people are running this today.  You need a Hadoop master with the NameNode, SecondaryNameNode, and JobTracker, then your Cassandra ring with co-located DataNodes and TaskTrackers.  This is the canonical setup for Hadoop and Cassandra running together.


Spark has a similar topology.  Assuming you’re running HDFS, you’ll still have your Hadoop NameNode and SecondaryNameNode, and DataNodes running on each Cassandra node.  But you’ll replace your JobTracker with a Spark Master, and your TaskTrackers with Spark Workers.



Building a Spark Application

Let’s take a high-level look at what it looks like to actually build a Spark application. To start, you’ll configure your app by creating a SparkConf, then use it to instantiate your SparkContext:

Next, you’ll pull in some data, in this case using the DataStax driver’s cassandraTable method. For this example, we’ll assume you have a simple table with a Long for a key and a single column of type String:

This gives us an RDD, or Resilient Distributed Dataset. The RDD is at the core of Spark’s functionality. It is a distributed collection of items, and it’s at the core of Spark’s fault tolerance, because it can be recomputed at any point in the case of slow or failed nodes. Spark does this automatically!

You can perform two types of operations on an RDD:

  • Transformations are lazy operations that result in another RDD. They don’t require actual materialization of the data into memory. Operations like filter, map, flatMap, distinct, groupBy, etc., are all transformations, and therefore lazy.

  • Actions are immediately evaluated, and therefore do materialize the data. You can tell an action from a transformation because it doesn’t return an RDD. Operations like count, fold, and reduce are actions, and therefore NOT lazy.

 An example:

Saving Data to Cassandra

Writing data into a Cassandra table is very straightforward. First, you generate an RDD of tuples that matches the CQL row schema you want to write, then you call saveToCassandra on that RDD:


Running SQL Queries

To run SQL queries (using Spark 1.0 or later), you’ll need a case class to represent your schema. After that, you’ll register the RDD as a table, then you can run SQL statements against the table, as follows:

The result is an RDD, just like the filter call produced in the earlier example.


Server-side Filters

There are two ways to filter data on the server: using a select call to reduce the number of columns and using a where call to filter CQL rows. Here’s an example that does both:

As a result, you have less data being ingested by Spark, and therefore less work to do in your analysis job. This is a good idea, as you should always filter early and often.


Spark Streaming

With the Spark Streaming Context we can now process streams just like any other data. Spark slices up the stream and produces RDDs on a defined interval. Because it simply produces an RDD, you can perform the same operations on the stream source that you would on any other RDD.

Streams themselves can come from a variety of sources. There are built-in components to pull from sources like Flume, a Kafka queue, Twitter, etc. Or you can build your own source pretty easily. Take a look at the streaming package for API details.


Spark in the Real World

I have shown you some simple examples that demonstrate how easy it is to get started with Spark, but you might be wondering how you structure your application for real-world task distribution.

For starters RDD operations are similar to the scala API, but …

  • Your choice of operations and the order in which they are applied is critical to performance.

  • You must organize your processes with task distribution and memory in mind.

The first thing is to determine if you data is partitioned appropriately. A partition in this context is merely a block of data. If possible, partition your data before Spark even ingests it. If this is not practical or possible, you may choose to repartition the data immediately following the load. You can repartition to increase the number of partitions or coalesce to reduce the number of partitions.

The number of partitions should, as a lower bound, be at least 2x the number of cores that are going to operate on the data. Having said that, you will also want to ensure any task you perform takes at least 100ms to justify the distribution across the network. Note that a repartition will always cause a shuffle, where coalesce typically won’t. If you’ve worked with MapReduce, you know shuffling is what takes most of the time in a real job.

As I stated earlier, filter early and often. Assuming the data source is not preprocessed for reduction, your earliest and best place to reduce the amount of data spark will need to process is on the initial data query. This is often achieved by adding a where clause. Do not bring in any data not necessary to obtain your target result. Bringing in any extra data will affect how much data may be shuffled across the network, and written to disk. Moving data around unnecessarily is a real killer and should be avoided at all costs

At each step you should look for opportunities to filter, distinct, reduce, or aggregate the data as much as possible prior to proceeding to the operation.

Use pipelines as much as possible. Pipelines are a series of transformations that represent independent operations on a piece of data and do not require a reorganization of the data as a whole (shuffle). For example: a map from a string -> string length is independent, where a sort by value requires a comparison against other data elements and a reorganization of data across the network (shuffle).

In jobs which require a shuffle see if you can employ partial aggregation or reduction before the shuffle step (similar to a combiner in MapReduce). This will reduce data movement during the shuffle phase.

Some common tasks that are costly and require a shuffle are sorts, group by key, and reduce by key. These operations require the data to be compared against other data elements which is expensive. It is important to learn the Spark API well to choose the best combination of transformations and where to position them in your job. Create the simplest and most efficient algorithm necessary to answer the question.

Good luck, and happy coding!

Get the Apache Spark + Cassandra OSS Connector at GitHub

View more information about Apache Spark + Cassandra integration on Planet Cassandra:

“The New Analytics Toolbox with Apache Spark — Going Beyond Hadoop” was created by Robbie Strickland, Software Development Manager, and Matt Kew, Internet Application Developer, both at The Weather Channel.