Escaping Disco-Era Data Modeling – Redeux
Original posted on Planet Cassandra, 2014-10-21
After spending 11 years in the Cassandra tag on StackOverflow, I have seen many folks asking questions about data modeling. I believe much of the confusion out there, is that data modeling for Apache Cassandra® requires a different approach from other database technologies.
The majority of new Cassandra users have their understanding of database modeling grounded in experience with relational databases. Additionally, trends in developer circles indicate that the industry as a whole is moving toward a stance of “not caring” about the database level.
The result is that many folks start to build a Cassandra data model with the goal of finding the most efficient way to store their data. But when that fails to meet their query requirements, they look to secondary indexes in a vain attempt to satisfy those access patterns.
This approach usually manifests itself in questions like this:
“Here is my schema:“
CREATE TABLE chats ( id bigint, ad_id bigint, from_d_email text, from_email text, from_id text, message_body text, message_datetime text, message_id text, message_type text, to_d_email text, to_email text, to_id text, PRIMARY KEY(message_id,message_datetime)); CREATE INDEX user_fromid ON chats (fromid); CREATE INDEX user_toid ON chats (toid); CREATE INDEX user_adid ON chats (adid);
“Why doesn’t my SELECT query work?“
SELECT * FROM chats WHERE fromid='aa23' AND toid='thx1138' ORDER BY messagedatetime DESC;
Or (for the sake of argument) replace the ORDER BY clause with ALLOW FILTERING and ask:
“Why is my SELECT query so slow?“
This model does a great job of storing the data in a logical way, keyed with a unique identifier (message_id), and clustered by a date/time column. Great, that is until you want to query it by something other than message_id. This is the point when multiple secondary indexes get added onto the table.
The problem with Cassandra Secondary Indexes (and if we’re being honest, distributed secondary indexes in any database), is that the data is stored in the cluster according to its key. Therefore, when querying by something other than the key, exhaustive polling of every node in the cluster is what happens. That’s why Cassandra prevents that from happening, unless you create a secondary index. This is where network latency enters the equation.
The unfortunate truth is that secondary indexes are designed for “convenience.” That is, the convenience of not having to build a specialized table to serve the query. But the trouble starts when they are used as a fix to the wrong types of problems.
So when is it ok to use a secondary index?
Secondary indexes in Cassandra tend to work better on columns with a “middle of the road” level of cardinality. Consider a table for tracking online orders. If a query on the orders table wanted to pull back orders with a status of “pending,” then that might work ok (if there are several other possible statuses) as long as the cluster isn’t too big.
Unfortunately, conditions where secondary index use is deemed appropriate depend on several factors and are not abundantly clear. However, the conditions where their use is inappropriate are well-known:
- Large clusters (more nodes, more network traffic)
- High-cardinality columns (nigh-unique values)
- Low-cardinality columns (boolean values)
Cassandra queries work best when given a specific, precise row key. The use of additional, differently-keyed tables populated with the same (redundant) data will perform faster than a query which uses a secondary index. Therefore, the proper way to solve this problem is to create a table to match each query. This is known as query-based modeling.
Here is one way to design a table to suit the afore-mentioned query:
CREATE TABLE mydata.chats_by_fromid_and_toid ( id bigint, ad_id bigint, from_d_email text, from_email text, from_id text, message_body text, message_datetime text, message_id text, message_type text, to_d_email text, to_email text, to_id text, PRIMARY KEY((from_id, to_id), message_datetime, message_id));
With this solution, the data will be partitioned on from_id and to_id, and clustered by message_datetime. Message_id is added to the PRIMARY KEY definition to ensure uniqueness. This will allow the user’s query to be keyed on from_id and to_id, while still allowing an ORDER BY on the clustering column message_datetime.
In the perspective of answering questions on Stack Overflow, the reality is that it is hard to change the way people think within the bounds of a single conversation. Here is a typical response to the above suggested solution:
“I know my original design somewhat violates the whole NoSQL concept, but it saves a lot of disk space.“
That is a very 1970′s way of thinking. Relational database theory originated at a time when disk space was expensive. In 1975, some vendors were selling disk space at a staggering eleven thousand dollars (USD) per megabyte (depending on the vendor and model). Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars. Today (2023), you can buy a terabyte drive for forty bucks. Disk space is cheap; operation time is the expensive part.
The important takeaway is that the overuse of secondary indexes will increase your operation time. They are not something to be relied upon. With Cassandra, you should take a query-based modeling approach. While that may result in storage of redundant data…that’s ok.
In summary, we have managed to free ourselves from bell-bottoms, disco, and shag carpeting. Now it is also time that we free ourselves from the notion that our data models need to be normalized to the max; as well as the assumption that all data storage technologies are engineered to work with normalized structures.
With Apache Cassandra, it is essential to model your data to the expected query patterns. If you do that properly, then you shouldn’t even need a secondary index…let alone three or four of them.