July 24th, 2013


Solr and Hadoop are two big open source technologies that we have integrated in DataStax Enterprise on top of Cassandra. For those just joining us, Solr allows for full search, and Hadoop provides a distributed file system and allows processing large datasets via MapReduce.  In the traditional world, if you wanted to run MapReduce over some data and also do searches over that same data, you would have to ETL that data to your Solr cluster, which has all the pitfalls of trying to keep the data in sync between the two clusters.  The beauty of DataStax Enterprise is that with the right replication setting you can search and do mapreduce operations over the same dataset with ease.  In this example I’ll be using a modified dataset from a survey done by The Pew Research Center about Facebook habits and  attitudes.


This demonstration was run on my EC2 cluster , 2 m1.large Ubuntu 12.04  with a  binary install of DSE 3.0.4


The cluster has been setup to have 2 virtual datacenters or DCs, an Analytics DC with a node running Hadoop, and a Solr DC with a node running Solr.


To begin we need to get the survey file: Omnibus_Dec_2012_csv
I’ve modified this survey file from the original by removing many of the columns, our primary focus will be two columns pial1a and pial4vb which map to these two questions

Secondly we need to create a solr schema file so that DSE Solr understands how to import the data, index, and store the data in Cassandra. Copy and paste this to a file called answers_schema.xml . This schema tells Solr how to index our documents, and will be mirrored in DSE by a Cassandra table.

And lastly we are going to use the solrconfig.xml provided to us from the wikipedia demo that ships with DataStax Enterprise.


We will create the keyspace to store our survey data first and set the replication strategy and options such that data will be available in both the Solr DC and the Analytics DC. By default DSE Solr would only store data in the Solr DC.

Now we can upload the solrconfig and answers_schema xml files up to DSE Solr, this process will automatically create a column family named fbsurvey under the answers keyspace along with the columns and the appropriate metadata.

Now we can upload the survey csv data and have Solr process the data and store it back into Cassandra. We can do a quick count and see the # of records, and check to see that the data transferred over.

Now we can search using SOLR’s HTTP API and find out how many people mentioned a COMPUTER or FAMILY in their response to why they stopped using Facebook.
The query I’m using here has some added parameters which will properly indent the response for us, as well as only show me the two columns I’m interested in lookin at, the id and pial4vb which contains the person’s response.

No computer? Ouch.


Now we hop over to our Hadoop node so we can run some MapReduce jobs over our data that we’ve imported via Solr. In this example we will use Hive which uses a very SQL like syntax that many of you will be familiar with that makes using MapReduce easy to use. We can easily reference the data in Cassandra by using the name of the keyspace as our database, and the name of the column family as our table in SQL parlance. Let’s see who answered yes to owning an e-reader and gave a significant response as to why they don’t use Facebook anymore.


This example is just the tip of the iceberg  in what you can do with Cassandra, Solr, and Hadoop and in DataStax Enterprise your data can be used how you see fit without having to wait or worry about ETL. I glossed over a lot of concepts about Hadoop and Solr in regards to how it all ties to Cassandra in the demonstration, but if you want to know more continue on to the additional reading. If you want to try DataStax Enterprise yourself download it here from this link.

Additional Reading
DataStax Enterprise Hadoop

DataStax Enterprise Search