November 26th, 2013

The question is — what are the ways in which the distributed nature of Cassandra affect your Java application programming? In other words, How will it affect your application’s needs to create, read, update and delete data? In order to explain this properly, we need to make sure certain terms are defined, such as a materialized view. In Cassandra, think of a materialized view as a table designed for a specific query or purpose around the UI . For example, let’s delve into the PlayList use case to explain this more clearly. 


Assuming you have a Play List Service which contains a connected Play List DAO let’s see what is written first to the Cassandra data base and why, and how it will be used later. 


In our application, let’s call it Zandora, there is a Play List which contains Audio Files. Each audio file has all the data which it needs to be fully relevant to the system. However, the audio file itself is independent. What if you wanted to know all the songs that are available from a certain artist, such as Sade Adu? How would you accomplish this? Let’s first look at the audio file table, and then we can look at a materialized view. Then we can discuss the ramifications of this for your Cassandra work. 


The CQL for the audiofile table:


CREATE TABLE audiofile (audioid uuid PRIMARY KEY, artist text, album text, track_title text, genre text, language text, track_length int, track_num int, year int, rating int, cover_art blob, audio_actual blob);




The Java code which will use the DataStax Java Driver:


/* all code by Laurent Weichberger */

ByteBuffer coverArt = this.readCoverArt();

ByteBuffer audioActual = this.readAudioTrack();


AudioFile af = new AudioFile(  //Data Transfer Object

  UUID.randomUUID(), //need a unique id

  “Kami Nixon”,  //artist

  “Fertile Girl”,  //album

  “Not Another One”,  //track

  “Crossover”,   //genre

  “English”,   //language

  279,    //track length seconds 

  4,    //track number 

  2007,    //year recorded

  5,    //rating (stars)

  coverArt,   //the album cover art 

  audioActual);  //the audio track as binary




/* dao is a PlayListDAO object, shown above is an INSERT */


public void saveOrUpdate(CassandraEntity ce) {


  AudioFile af = (AudioFile) ce;   

  PreparedStatement statement = getSession().prepare(

   “INSERT INTO myplaylist.audiofile ” +

   “(audioid, artist, album, track_title, genre, language, track_length, ” + 

   “track_num, year, rating, cover_art, audio_actual) ” +

   “VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);”); 


  BoundStatement boundStatement = new BoundStatement(statement);





















The code shown starts with the CQL code used to generate the table itself. Then we switch to Java code to show you how to insert data into that table using the DataStax Java Driver using the session’s execute method with a Bound Statement. Once this row exists in the table, we can see that we may in fact need another table in order to respond rapidly to a query for music by a certain artist. Rather than use that audifile main table, we can create a materialized view to answer that query directly. This is the Table per Query pattern we discussed earlier.


Although we will duplicate the data in by writing it to the materialized view, the upside of the rapid query is worth the pain of the extra write. Let’s look at the CQL for the materialized view:


CREATE TABLE audioByArtist 

  (audioid uuid, artist text, album text, track_title text,   

  genre text, language text, track_length int, track_num int, 


  year int, rating int, PRIMARY KEY(artist, audioid));



As we can see, the table name describes what the view (or query) is for. We want to see the music by the artist in question. Notice we chose not to put the actual audio file in this table. Why? Because – in our use case, when that actual audio file itself is needed, to play the music, we can get it from the audiofile table. This table is more for seeing what the music is about, the meta-data. In our use case, people will see lots of music before ever playing a song, so this pattern works well. Let’s move on to the update and the then lastly, the delete considerations. 


Before we do, one last word on the INSERT. When inserting new data in Cassandra, the developer will need to consider carefully exactly which materialized views (or query tables) will also need to have that same data inserted into. Remember, writes are cheap in Cassandra, and we want to maximize the response on reads with the query tables. And as we have learned elsewhere an insert to an existing primary key value is allowed in Cassandra and known as an UPSERT, so be careful. Let’s move on. 


Update is naturally also important. Keep in mind that data stored in Partition Key or Clustering Columns can not be updated. So there are two major considerations with updates:


1. Update your query tables: just a strong reminder that updating one table may not be enough in Cassandra if there is an expectation that the query tables are reflecting the state of that main table (as we just saw). 2. How to update a Primary Key. Since Cassandra will not allow updates to Partition Key or Clustering Column values, you must manually update them.



Manually Updating Primary Key values: A. Read the entire row. B. Create a new statement which would reflect that row, with the values you wish changed in the Partition Key or Clustering Columns, keeping whatever state you need from the old row. C. Delete the old row. D. Insert the new row using the freshly crafted statement. 


As we can see, to update a Primary Key is not trivial, so consider your keys and columns carefully when modeling. 


And now lastly, deletes. Patrick McFadin is fond of reminding us that a delete is just an update which writes “a tombstone.” So the data still comes in as a write, and goes through the entire write path. Therefore a lot of deletes can affect the system, and trigger a compaction. If you are in a situation in which you must perform a tremendous number of deletes in a short time span, consider carefully the affect this will have on your MemTables, SSTables, and Compactions.