August 2nd, 2013

Visit DataStax Academy to learn data modeling with Apache Cassandra step-by-step, via the free self-paced course DS220: Cassandra Data Modeling

What Was Old is New Again

Creating static schema in Cassandra may seem unusual, but is frequently done in production data models. Not every table created needs to have dynamic columns. In some cases it’s useful to have the data stored in fixed column names with type validation. The following examples will get you started with defining static schema tables and in some cases create hybrid data models that add a dynamic column component for greater flexibility.

Fixed schema – User Data Pattern 1

To start, we’ll work on a very basic table of data that you might find in any relational database. If you have created a table in MySQL or Oracle, this should look very familiar to you. In the syntax below, the username is the only primary key, so we put the declaration to the right of field type. You can use this reference to see the different types that can be used:


   username text PRIMARY KEY,

   first_name text,

   last_name text,

   address1 text,

   city text,

   postal_code text,

   last_login timestamp


Once we have our table definition, it should be as equally as familiar to insert data using the INSERT INTO syntax.

INSERT INTO users (username,first_name,last_name,address1,city,postal_code,last_login)

VALUES (‘cstar’,’Cassandra’,’Database’,’123 Main St’,’San Mateo’,’11111′,’2013-4-4′);

Once the data is in, you can use the SELECT command.

SELECT first_name, last_name

FROM users

WHERE username = ‘cstar’;

One large difference between relational databases and Cassandra CQL is the WHERE clause syntax. Because of how the data is stored, you can only use the elements declared in the primary key. Unless you want every row, you need to specify the row key at a minimum. We’ll use the next example to dive further into this difference.

Fixed Schema  on a Wide Row – User Data Pattern 2

For the this pattern, we’ll again be storing what looks like static row oriented data. The subtle difference here is in how the data is stored by Cassandra. By using multiple fields in the PRIMARY KEY definition, we are specifying that this data will be stored in a wide row. Let’s use a playlist model as an example:

CREATE TABLE user_playlist (

      username text,

      playlist_name text,

      song_name text,

      artist text,

      list_order int,

      songid UUID

      PRIMARY KEY (username, playlist_name, song_name)


At first glance, it would appear that we are creating a table that will contain multiple rows with 3 unique fields. The first element ‘username’, will be used as the row key. The remaining elements will be be used by CQL to create unique columns in the underlying storage engine. The result set presented to the user is a large table of data similar to a relational database. Let’s insert some data and see how it works:

INSERT INTO user_playlist (username, playlist_name, song_name, artist, list_order, songid)

VALUES (‘cstar’,’Classic Rock’,’Stairway to heaven’,’Led Zepplin’,1,c9df5407-7dd8-46a4-84ee-1c358106c926);

INSERT INTO user_playlist (username, playlist_name, song_name, artist, list_order, songid)

VALUES (‘cstar’,’Classic Rock’,’Sweet home Alabama’,’Lynyrd Skynyrd’,2,d2fb151c-cba8-46ff-bc03-66b32c45434a);

INSERT INTO user_playlist (username, playlist_name, song_name, artist, list_order, songid)

VALUES (‘cstar’,’Alternative’,’The Queen is dead’,’The Smiths’,1,24106aa0-10d5-4de1-9dab-a3a652e32c49);

INSERT INTO user_playlist (username, playlist_name, song_name, artist, list_order, songid)

VALUES (‘cstar’,’Alternative’,’Blue Monday’,’New Order’,2,7a0b82ba-5a24-4554-903c-1970d1e3aaea);

All of the data inserted will be on one row with the key ‘cstar‘, however, we we access that data with CQL, it will appear as many rows.

SELECT playlist_name,song_name,artist,list_order from user_playlist

WHERE username=’cstar’;


 playlist_name | song_name          | artist         | list_order


   Alternative |        Blue Monday |      New Order |          2

   Alternative | The Queen is dead |     The Smiths |          1

  Classic Rock | Stairway to heaven |    Led Zepplin |          1

  Classic Rock | Sweet home Alabama | Lynyrd Skynyrd |          2

The output is nicely formatted and easier to read by the user. The underlying storage engine has actually created a single row of data using unique column names derived from the PRIMARY KEY definition. The result is a faster access pattern on reads when all data is co-located on the same node and sequentially read from disk. You can read more in-depth here:

Fixed Schema with a Dynamic Collection – User Data Pattern 3

When you need some dynamic element to the static table, this can be accomplished by using the collection feature of CQL. A collection can be a Map, Set or List and can be read about more here:

The example we’ll use is a user table with an added requirement to dynamically store one or more locations sorted by time. The map collection can be used to create an ordered pair of data where the key is a timestamp and the value is a location stored as text.

CREATE TABLE user_with_location (

      username text PRIMARY KEY,

      first_name text,

      last_name text,

      address1 text,

      city text,

      postal_code text,

      last_login timestamp,

      location_by_date map<timestamp,text>


First we insert the entire user record. Note: When inserting a collection, you will overwrite any of the previous values if the row key exists.

INSERT INTO user_with_location 


VALUES (‘cstar’,’Cassandra’,’Database’,’123 Main St’,’San Mateo’,’11111′,’2013-4-4′,{ ‘2013-4-4’ : ‘San Francisco’});

Then we add a couple of locations for this user. This will be an update so not overwrite the previous location data in the collection. If you use the same map key as a previous data point, you will overwrite just that map value.

UPDATE user_with_location SET location_by_date[‘2013-4-5’] = ‘San Diego’

WHERE username = ‘cstar’;

UPDATE user_with_location SET location_by_date[‘2013-4-2’] = ‘Santa Rosa’

WHERE username = ‘cstar’;

When we select the row of data for the user ‘cstar‘, the result set will contain the entire map of locations. Collections have to be returned completely, so keep that in mind as you design your model. Too much in one collection could cause a lot of data to be returned.

SELECT first_name, last_name, location_by_date

FROM user_with_location

WHERE username = ‘cstar’;


 first_name | last_name | location_by_date


  Cassandra |  Database | {2013-04-02 00:00:00-0700: Santa Rosa,

                           2013-04-04 00:00:00-0700: San Francisco,

                           2013-04-05 00:00:00-0700: San Diego}

You might also notice that the data in the first element of the map is sorted. This is a feature of the map to sort by the type of the key.


As you can see, creating user based schema can be simple and with some of the CQL features shown, powerful in creating wide row data sets. Cassandra has had the reputation of only fitting large dynamic data sets, but with CQL you can create fixed schema and strong validation typing. There are many other patterns for user profile management. These three should get you started on your user centered data model.