This post was created by Stéphane Moreau on the LogikDevelopment Blog.
I recently tried out Apache Cassandra which is a NoSQL solution that was initially developed by Facebook and designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure.
In order to populate the database, I used Apache Flume and Flume NG Apache Cassandra Sink which helped me to inject logs into it. But let’s focus on Cassandra here, I will write posts about Flume later on.
This is the Cassandra schema I was using (which is the one suggested by the Cassandra sink):
After adding the data into the database, I wanted to fetch them to make sure everything went well.
I tried by three different ways:
- Cassandra CLI
To return the first 100 rows (and all associated columns) from the records column family, I ran the following command:
However, the rows were looking like:1<span style="font-size: medium; font-family: helvetica;">RowKey: 6c6f67696b6465763a32653630306661362d633664652d346336612d386561352d323636326533353661616332=> (column=data, value=39312e36362e3233392e323530202d202d205b32392f4465632f323031323a30393a31383a3433202d303730305d2022474554202f77702d636f6e74656e742f706c7567696e732f73796e746178686967686c6967687465722f73796e746178686967686c696768746572322f736372697074732f636c6970626f6172642e73776620485454502f312e31222032303020313635392022687474703a2f2f7777772e6c6f67696b6465762e636f6d2f323031302f30372f30372f7573652d736572766572786d6c687474702d7468726f7567682d70726f78792f2220224d6f7a696c6c612f352e30202857696e646f7773204e5420362e313b20574f5736343b2072763a31372e3029204765636b6f2f32303130303130312046697265666f782f31372e3022, timestamp=1357169131235135)=> (column=host, value=39312e36362e3233392e323530, timestamp=1357169131235134)=> (column=src, value=6c6f67696b646576, timestamp=1357169131235133)=> (column=ts, value=323031322d31322d32395431363a31383a34332e3030305a, timestamp=1357169131235132) </span>
This behavior is clearly explained on the DataStax page Getting Started Using the Cassandra CLI:
Cassandra stores all data internally as hex byte arrays by default. If you do not specify a default row key validation class, column comparator and column validation class when you define the column family, Cassandra CLI will expect input data for row keys, column names, and column values to be in hex format (and data will be returned in hex format).
To pass and return data in human-readable format, you can pass a value through an encoding function. Available encodings are:
* integer (a generic variable-length integer type)
Which means that we need to specify the encoding in which column family data should be returned. We can do it for the entire client session using the following commands:
So if we now run the previous command, the rows are looking like:1<span style="font-size: medium; font-family: helvetica;">RowKey: logikdev:2e600fa6-c6de-4c6a-8ea5-2662e356aac2=> (column=data, value=126.96.36.199 - - [29/Dec/2012:09:18:43 -0700] "GET /wp-content/plugins/syntaxhighlighter/syntaxhighlighter2/scripts/clipboard.swf HTTP/1.1" 200 1659 "http://www.logikdev.com/2010/07/07/use-serverxmlhttp-through-proxy/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0", timestamp=1357169131235135)=> (column=host, value=188.8.131.52, timestamp=1357169131235134)=> (column=src, value=logikdev, timestamp=1357169131235133)=> (column=ts, value=2012-12-29T16:18:43.000Z, timestamp=1357169131235132)</span>
In order to retrieve columns in the records column family, we can use the following SELECT command:
However, the output looks like: