April 4th, 2016

Hanneli Tavante, Software Developer, Codeminer 42
Hanneli (@hannelita) is a developer addicted to code, learn new programming languages, blow capacitors, do some C programming and commit useful (or not) code for random Open Source Projects that she finds at Github. She tries to help community with blog posts and organising NoSQL, programming languages and Math/Physics/Science meetups around the globe. She also likes Math, Lego, dogs, hardware and coffee.

Back in 2010, a new feature was introduced for Cassandra 0.7: Secondary Indexes. Unlike a relational database index, it is an index associated to the values of a column which provides a key for all rows in a table. For example imagine the table “Users”. It has a primary index as “user_id”. Now imagine that you want to access the email of an user. You must search by ID and then collect the value that you are interested in (email, for example). This isn’t practical with any kind of performance. See the image below:

Screen Shot 2016-04-03 at 9.35.28 PM

Humans can fetch the information we want in a single step by making the inverse operation. Given an email, recover the ID. In terms of computing, how could we achieve a similar process in our database? Secondary indexes solve these scenarios by using email, fetching the user ID, which acts as an inverse function of the query. This feature can be powerful and practical. Applying secondary indexes for small sets of data makes queries extremely efficient. It’s pretty easy to create a secondary index in Cassandra:

It is also possible to define several secondary indexes by table since we are not limited to just one.


Secondary Index Problems

Unfortunately, this is not silver bullet. Since secondary indexes are managed locally and not globally as the regular indexes, they may lead to some problems. Especially if we are dealing with large datasets containing sparse and very distinct information. For our user email example, if we have a huge amount of users. We may have an equally huge amount of emails and they will all be unique. To handle with a secondary index at ’email’ column, Cassandra would have to ‘join’ all the information spread across the rings because this index is local. This may be harmful in terms of performance. If you have five instances, you would need five readings for a single query including five disk operations. On the other hand, for a regular index, this would not be necessary since ‘user_id’ is globally handled. Due to this reason, some developers criticise the use of secondary indexes.

Several strategies for secondary indexes have been proposed – [1](http://brianoneill.blogspot.com.br/2012/03/cassandra-indexing-good-bad-and-ugly.html) , [2](http://blog.websudos.com/2014/08/23/a-series-on-cassandra-part-2-indexes-and-keys/), [3](http://www.slideshare.net/edanuff/indexing-in-cassandra) and [4](https://dzone.com/articles/cassandra-indexing-good-bad) are all good examples.

To enhance this set, Apple recently open sourced its own secondary index strategy: SSTableAttachedSecondaryIndex, or simply [SASI](https://github.com/xedin/sasi). This strategy has been shown extremely efficient and powerful. By presenting robust and flexible queries, SASI manages CPU, I/O and memory properly in order to save computational resources.


SASI by Example

Let’s look at a simple example to understand how SASI works. At the moment there is Thrift and CQL3 support. For an arbitrary ‘foo’ keyspace:

Then we create ‘bar’ table:

We use “CREATE CUSTOM INDEX” command in order to create a secondary index with SASI:

Note the configuration ‘case_sensitive’: ‘false’. It may remind you of full text search tools. Take a look into the class [Analyzer](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/analyzer/StandardA

nalyzer.java). SASI allows you to do some interesting customization for indexes. Take a look at the next example:

Here we create an index with a suffix analyser. That evaluates data by suffix and not by the entire term.
There is also the ‘SPARSE’ mode:

‘SPARSE’ mode allows us to make queries by timestamp intervals in an efficient way. If no mode is declared, ‘NORMAL’ is the default, providing exact case sensitive matches only.


Understanding SASI Basics

SASI is divided in two parts: Indexing and Search. Cassandra provides a distinction between memory and disk resources and SASI takes advantage of that, as well as write only, immutability and sorted data to perform properly. The built indexes are then flushed from a memtable to disk. SASI data structures are created in memory while SSTable begins the write process and flushed to disk before the process terminates. This is good for performance since writing the indexes requires only a sequential write on disk.

For each SSTable, SASI writes one index file for each indexed column. The file contents are built in memory by the class [OnDiskIndexBuilder](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/disk/OnDiskIndexBuilder.java).

Once flushed to disk, the information will be read by [OnDiskIndex](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/disk/OnDiskIndex.java), which contains an optimised search data structure. A key value mapping for the SSTable positions is created and then after the disk write, this information is stored in memory to speed up the index access.

When we perform a search, there is another group of classes that parses and interprets the indexes into a tree structure, optimises it and analyses what should be searched. To perform Unions and Intersections of data, SASI has a group of smart iterators, such as [RangeUnionIterator](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/utils/RangeUnionIt

erator.java), that can perform union operations across hashes, reading only the minimum necessary of each set to satisfy query needs. There is also [RangeIntersectionIterator](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/utils/RangeIntersec

tionIterator.java) for intersections.

Given this indexing and searching toolbox powered by optimised data structures, we can see that SASI is not magic. SASI juggles the data and performs smart operations taking advantage of Cassandra architecture and structure.

SASI has some limitations. For example, it requires that your cluster is configured to produce LongTokens (ex: Murmur3Partitioner). ByteOrderedPartitioner and RandomPartitioner do not work with SASI and only Cassandra on versions > 3.4 are supported.


Final Considerations

The main goal of this article was to provide a quick and general idea about Cassandra secondary indexes and present the basics from SASI. There is still work that needs to be done into this tool, of course. However, it has been proving itself as a robust and powerful alternative to the previous Secondary Index implementation. Visit the [official Github repository](https://github.com/xedin/sasi) for detailed information about the project, issues and releases.