February 28th, 2013

Schema Design is the cornerstone of making awesome databases. If you do not understand your data, do not understand what users need, and do not understand limitations of hardware and software you can not effectively design schema.

To understand schema design in Cassandra I think start with what Cassandra is:


Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. 

This statement used to be on the front page of http://cassandra.apache.org but the message gets re-crafted periodically. The current boilerplate is:

Cassandra’s ColumnFamily data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.

Saying the boilerplate changed is a bit of a misnomer. Cassandra has not fundamentally changed, but there is some window dressing on top of the core features.

For me, this all goes back to what “ColumnFamily model” means and how “schemaless” Cassandra should be. Lets dive into the column family model.

From http://research.google.com/archive/bigtable.html:

A Bigtable is a sparse, distributed, persistent multi-
dimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) ? string

A subtle difference between BigTable and Cassandra is that in BigTable (and in Hbase) columns are multi-dimensional by time. That is to say that in BigTable if one were to insert ‘row:1, column:1, time1’ and ‘row:1, column1:, and time:2’ BigTable would have both versions of column1,where in Cassandra only the last version of Column1 is kept. (In Cassandra we can use UUIDs and composite columns to achieve a similar thing) 

Lets take a moment to look at the simple API BigTable provides:

Table *T = OpenOrDie(“/bigtable/web/webtable”);
RowMutation r1(T, “com.cnn.www”);
r1.Set(“anchor:www.c-span.org”, “CNN”);
Operation op;
Apply(&op, &r1);

The original Thrift API is very close to the BigTable API. Which is this simple yet powerful API to set/get and search opaque binary columns.

What makes the ColumnFamily schema-less awesome? One thing is sometimes schema gets in the way. For example, in a RDBMS you might spend a lot of time trying to craft the schema. You have to ask yourself questions like “Is field 3 an int? Is it a long? Is it a tiny int? Is it a varchar? Is it a char? Should field 3 be indexed so users can search on it?

The irony is that the answer to these questions are only micro optimizations. Imagine you chose field3 to be an int. More data will fit on disk then if you chose a long. But in the end if your database is not horizontally scalable it will not matter what you chose! Eventually after 100, 1000, or one billion records you will run out of disk, or your single node database will slow down for other reasons.

After I read the Big Table white paper I did what I think everyone should do: Try to understand it by doing it yourself. You can try to replicate it using Cassandra or Hbase it does not matter, just try to implement the WebData using the ColumnFamily model.

This is where the learning begins. You are probably used to relational database management systems and normalization. Every bone in your body is going to look for auto-ids and ways to break data up into normalized tables.

But now is the time to give up! DENORMALIZE! Watch, it is really easy! Say we model the web crawler database in SQL. Look how much work it is!

create table page (
id bigint auto_increment not null,
page varchar(1024) NOT NULL,
tstamp bigint NOT NULL,

primary key (page(300)),
index time_idx(tstamp)

create table image (
page_id bigint NOT NULL,
image_link varchar(1024) NOT NULL,

index img_idx (page_id,image_link(255) )

create table link (
page_id bigint NOT NULL,
link_text varchar(255) NOT NULL,
link_target varchar(1024) NOT NULL,
index link_idx (page_id,link_text)

create table raw_data (
page_id bigint NOT NULL,
raw_data text NOT NULL,

primary key(page_id)

Now watch how easy the de-normalized approach is!

WOW! All that schema does not help you scale if your database does not. When trying to store 1,000,000,000,000 URLS a tinyint instead of a bigint is not going to be the deciding factor, if the schema will meet your application needs for low latency reads!

Also you can’t just add indexes after the fact! If it takes 40 minutes of blocked write downtime to add an index to a mysql table, how long is it going to take to add an index to a table with a trillion rows?

If you want to see my very simple head-to-head Relational DB vs Big Table proof of concept from about 3 years ago look here: http://www.jointhegrid.com/svn/hbench/trunk/src/com/jointhegrid/hbench/. It is a pretty neat application because you can just pump N urls with M random content into the system and see where data store falls over.

Here we see how schema gets in the way. But then why is schema being added to Cassandra, and is schema useful at all? My answers are “a lot of reasons” and “yes”. I will explain the second answer first. Schema is useful for validation, it is a simple way to prevent yourself from packing the wrong bytes into a column.  If we know that column 3 should always be a number there is no downside to enforcing this. This knowledge helps us display data reasonably in a CLI or client application, and if our data store wishes to create some type of reverse index on the type, knowing the type will help it do that.

What is CQL3 and why does it have so much schema? Well there is this matrix quote that sums up some of how I feel about CQL3 and Cassandra schemas. (see image below)

Schema’s, relational databases, and built in indexes are the “System” to me. So are tools like Hibernate, ORMs, or Enterprise Java Beans, I will even take a shot at “the cloud” here. They are the ‘system’ because they seek not to help you understand how your data is organized on disk, rather they seek to abstract you from this. 

This is not going to be a CQL rant. In fact I like a lot of CQL. I like how CQL will pack bytes for you, I like how CQL helps you insert and slice rather complex composite columns. What I do not particularly like is it s huge departure from BigTable white paper. CQL3 does not support SET (from BigTable white paper), instead you are writing INSERT queries. I do not always desire the ColumnFamily model to be abstracted or hidden from me.

I do not like how CQL uses the terms ‘PRIMARY KEY’ and ‘TABLE’. These are terms from the ‘System’. If you do not understand what a Column Family is you are blind to the truth, you will not be able to design an effective Cassandra data store.

‘Add index’ is a tool of the ‘system’ as well, to de-normalize you should be building your own indexes to understand what and index really means. You can learn more about Cassandra schema design from this SQLite document then you can by asking a tool to help you and remaining blind.

An index is another table similar to the original “fruitsforsale” tablebut with the content (the fruit column in this case) stored in front of therowid and with all rows in content order.

This is just another way of saying the most basic Cassandra schema advice, “Design your data as you wish to read it, eliminate seeks, use wide rows”. An index is just another ordering of data. You lay down the data on disk in the way in which you wish to read it. If that means you have to lay it down on disk N times to make some searches faster, do that. (That is really what you have been doing all along with relational databases and indexes). Understanding how the data lays out on disk is more important then it being easy like mongo, or mysql, or whatever.