Illustration Image

Efficient Full Table Scans with ScyllaDB Tablets

“Tablets” data distribution makes full table scans on ScyllaDB more performant than ever Full scans are resource-intensive operations reading through an entire dataset. They’re often required by analytical queries such as counting total records, identifying users from specific regions, or deriving top-K rankings. This article describes how ScyllaDB’s shift to tablets significantly improves full scan performance and processing time, as well as how it eliminates the complex tuning heuristics often needed with the previous vNodes based approach. It’s been quite some time since we last touched on the subject of handling full table scans on ScyllaDB. Previously, Avi Kivity described how the CQL token() function could be used in a divide and conquer approach to maximize running analytics on top of ScyllaDB. We also provided sample Go code and demonstrated how easy and efficient full scans could be done. With the recent introduction of tablets, it turns out that full scans are more performant than ever. Token Ring Revisited Prior to tablets, nodes in a ScyllaDB cluster owned fractions of the token ring, also known as token ranges. A token range is nothing more than a contiguous segment represented by two (very large) numbers. By default, each node used to own 256 ranges, also known as vNodes. When data gets written to the cluster, the Murmur3 hashing function is responsible for distributing data to replicas of a given token range. A full table scan thus involved parallelizing several token ranges until clients eventually traverse the entire ring. As a refresher, a scan involves iterating through multiple subranges (smaller vNode ranges) with the help of the token() function, like this: SELECT ... FROM t WHERE token(key) >= ? AND token(key) < ? To fully traverse the ring as fast as possible, clients needed to keep parallelism high enough (number of nodes x shard count x some smudge factor) to fully benefit from all available processing power. In other words, different cluster topologies would require different parallelism settings, which could often change as nodes got added or removed. Traversing vNodes worked nicely, but the approach introduced some additional drawbacks, such as: Sparse tables result in wasted work because most token ranges contain little or no data. Popular and high-density ranges could require fine-grained tuning to prevent uneven load distribution and resource contention. Otherwise, they would be prone to processing bottlenecks and suboptimal utilization. It was impossible to scan a token range owned by a single shard, and particularly difficult to even scan a range owned by a single replica. This increases coordination overhead, and creates a performance ceiling on how fast a single token range could be processed. The old way: system.size_estimates To assist applications during range scans, ScyllaDB provided a node-local system.size_estimates table (something we inherited from Apache Cassandra) whose schema looks like this: CREATE TABLE system.size_estimates ( keyspace_name text, table_name text, range_start text, range_end text, mean_partition_size bigint, partitions_count bigint, PRIMARY KEY (keyspace_name, table_name, range_start, range_end) ) Every token range owned by a given replica provides an estimated number of partitions along with a mean partition size. The product of both columns therefore provides a raw estimate on how much data needs to be retrieved if a scan reads through the entire range. This design works nicely under small clusters and when data isn’t frequently changing. Since the data is node local, an application in charge of the full scan would be required to keep track of 256 vNodes*Node entries to submit its queries. Therefore, larger clusters could introduce higher processing overhead. Even then, (as the table name suggests) the number of partitions and their sizes are just estimates, which can be underestimated or overestimated. Underestimating a token range size makes a scan more prone to timeouts, particularly when its data contains a few large partitions along many smaller sized keys. Overestimating it means a scan may take longer to complete due to wasted cycles while scanning through sparse ranges. Parsing the system.size_estimates table’s data is precisely what connectors like Trino and Spark do when you integrate them with either Cassandra or ScyllaDB. To address estimate skews, these tools often allow you to manually tune settings like split-size in a trial-and-error fashion until it somewhat works for your workload. Its rationale works like this: Clients parse the system.size_estimates data from every node in the cluster (since vNodes are non overlapping ranges, fully describing the ring distribution) The size of a specific range is determined by partitionsCount * meanPartitionSize It then calculates the estimated number of partitions and the size of the table to be scanned It evenly splits each vNode range into subranges, taking its corresponding ring fraction into account Subranges are parallelized across workers and routed to natural replicas as an additional optimization Finally, prior to tablets there was no deterministic way to scan a particular range and target a specific ScyllaDB shard. vNodes have no 1:1 token/shard mapping, meaning a single coordinator request would often need to communicate with other replica shards, making it particularly easier to introduce CPU contention. A layer of indirection: system.tablets Starting with ScyllaDB 2024.2, tablets are production ready. Tablets are the foundation behind ScyllaDB elasticity, while also effectively addressing the drawbacks involved with full table scans under the old vNode structure. In case you missed it, I highly encourage you to watch Avi Kivity talk on Tablets: Rethinking Replication for an in-depth understanding on how tablets evolved from the previous vNodes static topologies. During his talk, Avi mentions that tablets are implemented as a layer of indirection involving a token range to a (replica, shard) tuple. This layer of indirection is exposed in ScyllaDB as the system.tablets table, whose schema looks like this: CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text STATIC, resize_seq_number bigint STATIC, resize_type text STATIC, table_name text STATIC, tablet_count int STATIC, new_replicas frozen<list<frozen<tuple<uuid, int>>>>, replicas frozen<list<frozen<tuple<uuid, int>>>>, session uuid, stage text, transition text, PRIMARY KEY (table_id, last_token) ) A tablet represents a contiguous token range owned by a group of replicas and shards. Unlike the previous static vNode topology, tablets are created on a per table basis and get dynamically split or merged on demand. This is important, because workloads may vary significantly: Some are very throughput intensive under frequently accessed (and small) data sets and will have fewer tablets. These take less time to scan. Others may become considerably storage bound over time, spanning through multiple terabytes (or even petabytes) of disk space. These take longer to scan. A single tablet targets a geometric average size of 5GB before it gets split. Therefore, splits are done when a tablet reaches 10GB and merges at 2.5GB. Note that the average size is configurable, and the default might change in the future. However, scanning over each tablet owned range allows full scans to deterministically determine up to how much data they are reading. The only exception to this rule is when very large (larger than the average) partitions are involved, although this is an edge case. Consider the following set of operations: In this example, we start by defining that we want tables within the ks keyspace to start with 128 tablets each. After we create table t, observe that the tablet_count matches what we’ve set upfront. If we had asked for a non base 2 number, the tablet_count would be rounded to the next base 2 number. The tablet_count represents the total number of tablets across the cluster, where the replicas column represents a tuple of host IDs/shards which are replicas of that tablet, matching our defined replication factor. Therefore, the previous logic can be optimized like this: Clients parse the system.tablets table and retrieve the existing tablet distribution Tablets ranges spanning the same replica-shards get grouped and split together Workers route requests to natural replica/shard endpoints via shard awareness by setting a routingKey for every request. Tablet full scans have lots to benefit from these improvements. By directly querying specific shards, we eliminate the cost of cross CPU and node communication. Traversing the ring is not only more efficient, but effectively removes the problem with sparse ranges and different tuning logic for small and large tables. Finally, given that a tablet has a predetermined size, long gone are the days of fine-tuning splitSizes! Example This GitHub repo contains boilerplate code demonstrating how to carry out these tasks efficiently. The process involves splitting tablets into smaller pieces of work, and scheduling them evenly across its corresponding replica/shards. The scheduler ensures that replica shards are kept busy with at least 2 inflight requests each, whereas the least loaded replica always consumes pending work for processing. The code also simulates real-world latency variability by introducing some jitter during each request processing. [Access from the GitHub repo] Conclusion This is just the beginning of our journey with tablets. The logic explained in this blog is provided for application builders to follow as part of their full scan jobs. It is worth mentioning that the previous vNode technique is backward compatible and still works if you use tablets. Remember that full scans often require reading through lots of data, and we highly recommend you to use BYPASS CACHE to prevent invalidating important cached rows. Furthermore, ScyllaDB Workload Prioritization helps with isolation and ensures latencies from concurrent are kept low. Happy scanning!
Become part of our
growing community!
Welcome to Planet Cassandra, a community for Apache Cassandra®! We're a passionate and dedicated group of users, developers, and enthusiasts who are working together to make Cassandra the best it can be. Whether you're just getting started with Cassandra or you're an experienced user, there's a place for you in our community.
A dinosaur
Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.
© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation. Sponsored by Anant Corporation and Datastax, and Developed by Anant Corporation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?