Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra

6/20/2022

Reading time:5

Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra - Business Platform Team

This resource is based on an article originally published here.

In Cassandra Lunch #57: Using Secondary Indexes in Cassandra, guest speaker Anil Mittana presented on using Secondary Indexes in Cassandra. This blog post is to give an overview of the presentation. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Secondary Indexes in Cassandra

In this blog post, we will be writing an overview of the presentation given by Anil Mittana covering secondary indexes in Cassandra.

Query First Approach

In Cassandra, tables are created with the intention of facilitating future queries. For example, the following command in CQL would create a Table with the intention of querying for a rating by movie title. We won’t dive into the details here as this blog post is intended to be about the secondary indexes.

Image of a CQL statement creating a table with the intention of searching for ratings of a movie by movie title.

Secondary Indexes

Secondary Indexes are meant to help facilitate queries on columns in a table that are not a part of the primary key. However, they can cause performance issues, especially if a query needs to access multiple nodes, and should be applied carefully. They should only be used on columns or tables that have low cardinality, do not contain a counter, are infrequently updated, or tables that do not have large partitions.

How Cassandra Stores an Index

When indexes are created, a hidden table is created in a background process. To query a secondary index the partition key and secondary index column should be included in order to be successful. By including the partition key and the secondary index column only one node will need to be queried.

Secondary Index table comparison to a regular Cassandra table.

Distributed Index vs. Local Index

Tables and materialized views are examples of distributed indexing. A table or view data structure is distributed across all nodes in a cluster based on a partition key. When retrieving data using a partition key, Cassandra knows exactly which replica nodes may contain the result. For example, given a 100-node cluster with the replication factor of 5, at most 5 replica nodes and 1 coordinator node are needed to participate in a query.
In contrast, secondary indexes are examples of local indexing. A secondary index is represented by many independent data structures that index data stored on each node. When retrieving data using only an indexed column, Cassandra has no way to determine which nodes may have necessary data and has to query all nodes in a cluster. For example, given a 100-node cluster with any replication factor, all 100 nodes have to search their local index data structures. This does not scale well.

Write and Read paths of Secondary Indexes

Whenever a mutation, writing to a table, is applied to a base table in memory (memtable), it is dispatched as notification to all registered indices on this table so that each index implementation can apply the necessary processing. Index memtable and base memtable will generally be flushed to SSTables at the same time but there is no strong guarantee of this behavior. Once flushed to disk, index data will have a different life-cycle than base data e.g. the index table may be compacted independently of base table compaction.

The local read path for native secondary index is quite straightforward. First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data.

Use Cases

Restricting the query to a single server.
All secondary index implementations work best when Cassandra can narrow down the number of nodes to query
Secondary indexes can be very helpful in analytics workloads (Spark batch jobs) where you don’t have an SLA that’s measured in milliseconds.

Anti-Patterns

Secondary Indexes should not be used on columns that have high cardinality, a large number of unique values. Additionally, columns that have extremely low cardinality, such as a column storing booleans, are also not going to be particularly useful. Secondary indexes should not be used on tables that are frequently updated. Interestingly, Cassandra does not eliminate tombstones beyond 100 thousand cells. Once the tombstone limit is reached a query using the indexed value will fail. Secondary indexes should also be avoided in looking for values contained in a large partition unless the query is very narrow.

Problems and Limitations

Secondary Indexes do not support ranged queries ( WHERE Age > 18 ). They can only be used on equality queries. Also, maintaining indexes through hidden tables means they are going through a separate compaction process. . Independently compacting sstables and indexes means the location of the data and the index information are completely decoupled. If the data is compacted, a new sstable is written, and our index is now incorrect. This means we can’t simply point to a location on disk in an index because the location of the data can change.

SASI Indexes

There are two types of secondary indexes. Regular secondary index (2i) that uses hash tables to index data and supports equality (=) predicates. SSTable-attached secondary index (SASI) is an experimental and more efficient secondary index that uses B+ trees to index data and can support equality (=), inequality (<, <=, >, >=) and even text pattern matching (LIKE). However, SASI indexes are not currently supported in production.

Resources

Special thanks to Anil Mittana for putting together his presentation and speaking at Cassandra Lunch #57.

https://docs.datastax.com/

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra

Related Articles

lucene

cassandra

search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

8/9/2022

lucene

geospatial

cassandra

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

8/5/2022

lucene

plugin

cassandra

GitHub - instaclustr/cassandra-lucene-index: Lucene based secondary indexes for Cassandra

7/29/2022

lucene

cassandra

search / secondary indexes

Stratio’s Lucene-based index for Cassandra, now a plugin - Stratio Blog

7/29/2022

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

sparksql

cassandra.lunch

cassandra

Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown - Business Platform Team

6/27/2022

datastax

cassandra.lunch

cassandra

Apache Cassandra Lunch #68: DataStax Apache Kafka Connector - Business Platform Team

6/25/2022

Explore Further

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

secondary.indexes

lucene

cassandra

search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

8/9/2022

lucene

geospatial

cassandra

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

8/5/2022

lucene

plugin

cassandra

GitHub - instaclustr/cassandra-lucene-index: Lucene based secondary indexes for Cassandra

7/29/2022

lucene

cassandra

search / secondary indexes

Stratio’s Lucene-based index for Cassandra, now a plugin - Stratio Blog

7/29/2022

Secondary Indexes in Cassandra

Query First Approach

Secondary Indexes

How Cassandra Stores an Index

Distributed Index vs. Local Index

Write and Read paths of Secondary Indexes

Use Cases

Anti-Patterns

Problems and Limitations

SASI Indexes

Resources

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?