Be sure to check out more blog postings by Edward Capriolo at his blog found here.
Creating legacy tables
You can create Thrift/CLI-compatible tables in CQL 3 using the COMPACT STORAGE directive. The compact storage directive used with the CREATE TABLE command provides backward compatibility with older Cassandra applications; new applications should generally avoid it.
Compact storage stores an entire row in a single column on disk instead of storing each non-primary key column in a column that corresponds to one column on disk. Using compact storage prevents you from adding new columns that are not part of the PRIMARY KEY.
I stumbled on this today and find it interesting. First, my beloved sstables have been dubbed “legacy” tables. Second, I do not agree with or understand some of the rational.
“Compact storage stores an entire row in a single column on disk….” This seems not true. I can not even digest the rest of the sentence.
“Using compact storage prevents you from adding new columns that are not part of the PRIMARY KEY.” This goes back to my last blog about what is a “row”, or a “primary key” or a “row key” and now disambiguating all these terms. All I know is that you have always been able to add columns to rows with Cassandra, because that is what the ColumnFamily data model is.
Anyway, I reject the statement that you should “generally” avoid legacy tables. There are cases where “legacy” tables are better. When are they better? You might ask… http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html
[default@dev] list events;Using default limit of 100Using default column limit of 100-------------------RowKey: 2:201302<strong>=> (column=2013-02-20 10\:58\:40+1300:, value=, timestamp=1357869160739000)</strong>=> (column=2013-02-20 10\:58\:40+1300:is_dam_dirty_apes, value=01, timestamp=1357869160739000)=> (column=2013-02-20 10\:58\:40+1300:pressure, value=000011d0, timestamp=1357869160739000)=> (column=2013-02-20 10\:58\:40+1300:temperature, value=00000015, timestamp=1357869160739000)-------------------RowKey: 3:201302<strong>=> (column=2013-02-20 10\:58\:45+1300:, value=, timestamp=1357869161380000)</strong>=> (column=2013-02-20 10\:58\:45+1300:is_dam_dirty_apes, value=01, timestamp=1357869161380000)=> (column=2013-02-20 10\:58\:45+1300:pressure, value=00001ed2, timestamp=1357869161380000)=> (column=2013-02-20 10\:58\:45+1300:temperature, value=0000001f, timestamp=1357869161380000)
You notice CQL3 puts this extra column everywhere, it was done so CQL will not show you tombstones (rows that have all the columns deleted). This is nice, although tombstones never actually bothered me. I am not sure I want to give up the disk/ram/memtable space for this.
In a worse case scenario imagine a row is one column. To write one column, I have to write two columns. Also remember that sstables are write once, if one upserts over the same column at a later time, now what would have been two rows, is now four rows until Cassandra gets around to compacting it. Doesn’t sound that bad, what if your row is spread across 10 sstables?
As always, look at your data and understand how the data will layout on disk. Do some benchmarking. One extra column per row or a slightly less efficient method for storing data can make a huge performance difference when working with millions or billions of rows, or maybe it doesn’t, but I am not on the train with calling compact storage “legacy” and the suggestion of generally avoiding it. I feel the decision is much like choosing MyISAM over InnoDB.
As the Oracle says: “You can’t see past a choice you do not understand.”