In Cassandra Lunch #57: Using Secondary Indexes in Cassandra, guest speaker Anil Mittana presented on using Secondary Indexes in Cassandra. This blog post is to give an overview of the presentation. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
Secondary Indexes in Cassandra
In this blog post, we will be writing an overview of the presentation given by Anil Mittana covering secondary indexes in Cassandra.
Query First Approach
In Cassandra, tables are created with the intention of facilitating future queries. For example, the following command in CQL would create a Table with the intention of querying for a rating by movie title. We won’t dive into the details here as this blog post is intended to be about the secondary indexes.
Secondary Indexes are meant to help facilitate queries on columns in a table that are not a part of the primary key. However, they can cause performance issues, especially if a query needs to access multiple nodes, and should be applied carefully. They should only be used on columns or tables that have low cardinality, do not contain a counter, are infrequently updated, or tables that do not have large partitions.
How Cassandra Stores an Index
When indexes are created, a hidden table is created in a background process. To query a secondary index the partition key and secondary index column should be included in order to be successful. By including the partition key and the secondary index column only one node will need to be queried.
Distributed Index vs. Local Index
Tables and materialized views are examples of distributed indexing. A table or view data structure is distributed across all nodes in a cluster based on a partition key. When retrieving data using a partition key, Cassandra knows exactly which replica nodes may contain the result. For example, given a 100-node cluster with the replication factor of 5, at most 5 replica nodes and 1 coordinator node are needed to participate in a query.
In contrast, secondary indexes are examples of local indexing. A secondary index is represented by many independent data structures that index data stored on each node. When retrieving data using only an indexed column, Cassandra has no way to determine which nodes may have necessary data and has to query all nodes in a cluster. For example, given a 100-node cluster with any replication factor, all 100 nodes have to search their local index data structures. This does not scale well.
Write and Read paths of Secondary Indexes
Whenever a mutation, writing to a table, is applied to a base table in memory (memtable), it is dispatched as notification to all registered indices on this table so that each index implementation can apply the necessary processing. Index memtable and base memtable will generally be flushed to SSTables at the same time but there is no strong guarantee of this behavior. Once flushed to disk, index data will have a different life-cycle than base data e.g. the index table may be compacted independently of base table compaction.
The local read path for native secondary index is quite straightforward. First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data.
Restricting the query to a single server.
All secondary index implementations work best when Cassandra can narrow down the number of nodes to query
Secondary indexes can be very helpful in analytics workloads (Spark batch jobs) where you don’t have an SLA that’s measured in milliseconds.
Secondary Indexes should not be used on columns that have high cardinality, a large number of unique values. Additionally, columns that have extremely low cardinality, such as a column storing booleans, are also not going to be particularly useful. Secondary indexes should not be used on tables that are frequently updated. Interestingly, Cassandra does not eliminate tombstones beyond 100 thousand cells. Once the tombstone limit is reached a query using the indexed value will fail. Secondary indexes should also be avoided in looking for values contained in a large partition unless the query is very narrow.
Problems and Limitations
Secondary Indexes do not support ranged queries ( WHERE Age > 18 ). They can only be used on equality queries. Also, maintaining indexes through hidden tables means they are going through a separate compaction process. . Independently compacting sstables and indexes means the location of the data and the index information are completely decoupled. If the data is compacted, a new sstable is written, and our index is now incorrect. This means we can’t simply point to a location on disk in an index because the location of the data can change.
There are two types of secondary indexes. Regular secondary index (2i) that uses hash tables to index data and supports equality (=) predicates. SSTable-attached secondary index (SASI) is an experimental and more efficient secondary index that uses B+ trees to index data and can support equality (=), inequality (<, <=, >, >=) and even text pattern matching (LIKE). However, SASI indexes are not currently supported in production.
Special thanks to Anil Mittana for putting together his presentation and speaking at Cassandra Lunch #57.
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!