September 5th, 2013

This blog posting was created by Venkatesh Ragi at the Imaginea Blog.

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Cassandra’s data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table’s primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure.
Earlier versions of CQL, CQL2 was close to the concepts of thrift which required clients to decode the CompositeTypes and also exposes the internal storage engine.

Hive support for cassandra:
The source code for hive support for cassandra is hosted at under cas-support-cql branch.

Though the code includes support for both CQL3 tables and thrift tables separately by CqlStorageHandler and CassandraStorageHandler, this blog covers using CqlStorageHandler for creating CQL3 tables in cassandra through hive.

Note: The CqlStorageHandler supports tables created through CQL3 only. CQL2 tables cannot be accessed using this handler.Create a CQL3 table through hive:Create a database in hive. Then execute the following command.

hive> CREATE EXTERNAL TABLE test.messages(message_id string, author string, body string) STORED BY‘org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler’ WITH SERDEPROPERTIES (“cql.primarykey” = “message_id, author”, “comment”=”check”, “read_repair_chance” = “0.2″, “dclocal_read_repair_chance” = “0.14″, “gc_grace_seconds” = “989898″, “bloom_filter_fp_chance” = “0.2″, “compaction” = “{‘class’ : ‘LeveledCompactionStrategy’}”, “replicate_on_write” = “false”, “caching” = “all”)

where ‘test’ is the keyspace in cassandra. The above query also creates a column family in cassandra if does not exist. To create a keyspace that doesnot exist in cassandra execute the following query

hive> CREATE EXTERNAL TABLE test.messages(row_key string, col1 string, col2 string) STORED BY ‘org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler’ WITH SERDEPROPERTIES  (“cql.primarykey” = “row_key”)  TBLPROPERTIES (“” = “mycqlks”, “cassandra.ks.stratOptions”=”‘DC’:1, ‘DC2′:1″, “cassandra.ks.strategy”=”NetworkTopologyStrategy”);

Note: For brevity, only minimal SERDEPROPERTIES are  given in the above query.

If ‘test’ keyspace does not exist in cassandra it will be created.

Inserting values into CQL3 table through hive:

hive> INSERT INTO TABLE messages SELECT * FROM tweets;

The values from tweets table are appended to messages table.

Note: With Cassandra INSERT OVERWRITE is same as INSERT INTO as Cassandra merges changes if keys are same.

Retrieving values from a CQL3 table using hive:

hive> SELECT * FROM messages;