February 28th, 2013

Typically, I am facing with a lot of questions from many developers about how internally Apache Cassandra storage works. So, I think it is good idea to tell something about how Cassandra internally deals with data. The article describes what data structures Cassandra uses to provide such fast access(especially write-access) to Data.

NoSql is not equal Schema-less

NoSql equals schema-less: this sentence isn`t valid, especially for Cassandra. Cassandra(starting from version 0.7) encourages developer to share schema information to achieve more transparency.

For example, creating ColumnFamily(Table) with CQL:

CREATE TABLE timeline (

user_id varchar,

tweet_id int,

author varchar,

body varchar,

PRIMARY KEY (user_id));

Looks pretty similar to SQL? But it does not work in a similar way.
In RDBMS storage engine is based on b-trees[1], while Apache Cassandra implements log-structured merge-tree[2].
The rough difference between RDBMS and Cassandra – if you will insert something with primary key(not full row) into RDBMS, resources is allocated to complete row. Unlike it, in Cassandra each row is sparse: it is stores just columns present in inserted data. Thus is possible, also according use of log-structured merge-tree.

Originally posted at Sergey Enin – ruby and BigData expertise | direct link to this article