May 14th, 2015

Brian O’Neill, Chief Technology Officer at Health Market Science
Brian is Chief Technology Officer at Health Market Science (HMS) where he heads development of their data management and analytics platform, powered by Storm and Cassandra.  Brian won InfoWorld’s Technology Leadership award in 2013 and authored, Storm: Blueprints for Realtime Computation.  He has a number of patents and holds a B.S. in C.S. from Brown University.

Spark SQL is awesome.  It allows you to query any Resilient Distributed Dataset (RDD) using SQL.  (including data stored in Cassandra!)

First thing to do is to create a SQLContext from your SparkContext.  I’m using Java so…
(sorry — i’m still not hip enough for Scala)

Now you have a SQLContext, but you have no data.  Go ahead and create an RDD, just like you would in regular Spark:

(The example above comes from the spark-on-cassandra-quickstart project, as described inmy previous post.)

Now that we have a plain vanilla RDD,  we need to spice it up with a schema, and let the sqlContext know about it.  We can do that with the following lines:

Shazam.  Now your sqlContext is ready for querying.  Notice that it inferred the schema from the Java bean. (Product.class).  (Next blog post, I’ll show how to do this dynamically)

You can prime the pump with a:

The count operation forces Spark to load the data into memory, which makes queries like the following lightning fast:

That’s it.  Your off to the SQL races.

P.S.  If you try querying the sqlContext without applying a schema and/or without registering the RDD as a table, you may see something similar to this:

Spark SQL Against Cassandra Example” was created by DataStax Cassandra MVP Brian O’Neill; to view more postings by Brian, check out his blog.