Top 15 Apache Cassandra Best Practices Checklist. There has been some issues reported around insecurity in NoSQL databases – Cassandra, Elastic, Mongo. Hence the reason of the article is to outline Cassandra Best Practices in order to secure your Apache Cassandra clusters. In my experience most vulnerabilities are due to deploying and managing Cassandra rather than inherent security bug in the software. Lets start!
It delivers a highly reliable data storage engine for different applications that require immense scale.
You can create well organized, high performance, complete Cassandra clusters using Cassandra data analysis and modeling. But you need to align with the basic tuning checklist to ensure that the cluster is up and running with no early hiccups.
Look at the below 15 Cassandra checklist to meet your desired goals.
You should merge tables into a denormalized table to extract data from more than one table because tables can’t be joined in Cassandra.
In Cassandra’s table design, denormalization is the key because Cassandra does not support joins or derived tables. Also, it’s important to design to optimize how data can be distributed around the cluster.
When it comes to sorting in Cassandra, it can only be done on the clustering column in the Primary Key.
Cassandra distributes its data across different nodes, and Cassandra drivers can direct requests precisely because of the built-in algorithm drivers.
Adding a load balancer introduces an additional layer that breaks intelligent algorithms used by the driver to introduce a single point of failure. But there should be no point of failure in the Cassandra world.
In Cassandra, the secondary index is local and using a secondary index can cause performance trouble while accessing multiple cluster nodes. But if you want to use the secondary index, you can use it occasionally on a low cardinality column.
Don’t use it for high cardinality columns, and try to avoid using a secondary index generally.
If you want to perform an entire table scan, it can cause extreme heap pressure because Cassandra distribution of partitions is done among all nodes in the cluster.
If it’s a large cluster, it has billions of rows that can cause full table scan trouble. You can tweak the data model to avoid a full table scan for better performances and no bottlenecks.
You should keep the size of the partitions within 100 MB to ensure a streamlined heap and smooth compaction. The maximum practical limit for partitions is two billion cells, so converting large partitions creates more pressure on the heaps and slows compaction.
Limiting the partition sizes can enhance the cluster’s performance and deliver optimized and better results.
You should avoid using batch for bulk loading when multiple partition keys are involved in the process. It can put significant pressure on different coordinator nodes and decrease the performance of the results.
You can use a batch when you need to keep a denormalized set of tables. So try to avoid bulk loading with the batches.
You can create a decent Cassandra model that limits the partition size, distributes data across cluster nodes, and minimizes the partitions returned from the query.
You need to ensure that your Cassandra model design follows these patterns and helps you achieve your desired results with finesse.
You need to ensure symmetrical distribution of partition keys to minimize the pressure on the nodes.
You can choose the partition keys with the number of possible values bound for increased performance. You can also keep the partition key size between 10 to 100 MB, as discussed in one of the above listed practices of using Cassandra.
Also, minimize the number of partitions read by a single query because reading more patients can get expensive as each partition may reside on separate nodes.
The coordinator will issue separate commands to separate nodes from different partitions requested. It will add overhead and boost the variation in latency.
The tables in Cassandra have a set of columns termed the primary key. The primary key also shapes the data structure and determines the uniqueness of the row.
The primary key has two parts: the partition key and the clustering key or the clustering column. The set of columns or first column in the primary key is the partition key and has great importance.
If we talk about clustering keys or clustering columns, they are the columns next to the partition key. They are optional and not required compared to the partition key. The clustering key finalizes the default sort order of the row within the partitions.
You need to ensure in the design process to make the partition key distribute data to different nodes of the cluster and avoid keys with a small domain of possible values like school grades, status, gender, and others.
The minimum value should be higher compared to the cluster nodes and avoid using keys that have highly skewed values.
Use the prepared states when you can during query execution with a similar structure multiple times. Cassandra will cache the results and parse the query string.
You can bind different variables with the cached prepared statements if you want the query on the next occasion. It can increase the performance by bypassing the parsing phase for each query.
Using the IN clause query with greater numbers for different partitions puts significant pressure on the coordinating node that can minimize the node’s performance. If the coordinator note fails in processing the query because of excessive load, you need to retry the entire thing.
During the excessive load, you can use separate queries, bypass single failure points and implement more pressure on the coordinator node.
Leveled compaction strategy can ensure 90% of the reads are done from a single sorted strings table or SStable but when the rows are uniform. It’s great to read latency sensitive and heavy use cases that cause more compaction and require more i/o during compaction.
It’s great to use a leveled compaction strategy during the table creation because once the table is created, it can become tricky for you to change the approach later.
It can be changed later, but one mistake can overload the node with too much i/o.
You need to limit the tables in the cluster to avoid excessive memory pressure and heap that can minimize the performance. A large number of tables beyond the reasonable limit can result in overheating.
It’s hard to find the right number for the tables to be created, but multiple tests have been performed, and the results prove that you should focus on limiting the number of tables to 200 and avoid crossing 500 to initiate the failure levels.
You should use local consistency levels in the multi datacenter environment to prepare a response that can avoid the latency of inter datacenter communication.
It will not be possible to use local consistency levels with every use case, but if possible or if the use case permits, you should prefer local consistency levels in multi datacenter environments.
You should avoid creating queue like data models because they can generate many tombstones. The slice query is sub optimal, which is used to scan through the tombstones for filtering a match.
It can cause an increase in heap pressure and latency because it scans through garbage data to spot a small amount of data that can be utilized.
Great effort! We have learned Top 15 Apache Cassandra Best Practices Checklist.
Now that you know the best checklist for handling the Cassandra cluster and bypassing different difficulties, it’s time to update your checklist and add the missing points. These issues can help you increase your efficiency and help you achieve better results managing Cassandra.