February 28th, 2013

Bulk loading options for Cassandra was created by Robert Coli.

Cassandra’s bulk loading interfaces are most useful in two use cases: initial migration or restore from another datastore/cluster and regular ETL. Bulk loading assumes that it’s important to you to avoid loading via the thrift interface, for example because you haven’t integrated a client library yet or because throughput is critical.

There are two alternative techniques used for bulk loading into Cassandra: “copy-the-sstables” and sstableloader. Copying the sstables is a filesystem level operation, while sstableloader utilizes Cassandra’s internal streaming system. Neither is without disadvantages; the best choice depends on your specific use case. If you are using Counter columnfamilies, neither method has been extensively tested and you are safer writing via thrift.

The key to understanding bulk-loading throughput is that potential throughput depends significantly on the nature of the operation as well as the configuration of source and target clusters and things like number of sstables, sstable size and tolerance to potentially duplicate data. Notably but not significantly, sstableloader in 1.1 is slightly improved over the (freshly re-written) version in 1.0. [1]

Below are good cases for and notable aspects of each strategy.

Copy-the-sstables/”nodetool refresh” can be useful if:

  1. Your target cluster is not running, or if it is running, is not sensitive to latency from bulk loading at “top speed” and associated operations.
  2. You are willing to manually, or have a tool to, de-duplicate sstable names and are willing to figure out where to copy them to in any non copy-all-to-all case. You are willing to run cleanup and/or major compaction understand that some disk space is wasted until you do. [2]
  3. You don’t want to deal with the potential failure modes of streaming, which are especially bad in non-LAN deploys including EC2.
  4. You are restoring in a case where RF=N, because you can just copy one node’s data to all nodes in the new RF=N cluster and start the cluster without bootstrapping (auto_bootstrap: false in  cassandra.yaml).
  5. The sstables you want to import are a different version than the target cluster currently creates. Example : trying to sstableload -hc- (1.0) sstables into a -hd- (1.1) cluster is reported to not work. [3]
  6. You have your source sstables in something like s3 which can easily parallelize copies to all target nodes. s3<>ec2 is fast and free, close to best case for the inefficiency during copy stage.
  7. You want to increase RF on a running cluster, and are ok with running cleanup and/or major compaction after you do.
  8. You want to restore from a cluster with RF=[x] to a cluster whose RF is the same or smaller and whose size is a multiple of [x]. Example: restoring a 9 node RF=3 cluster to a 3 node RF=3 cluster, you copy 3 source nodes worth of sstables to each target node.

sstableloader/JMX “bulkload” can be useful if:

  1. You have a running target cluster, and want the bulk loading to respect for example streaming throttle limits.
  2. You don’t have access to the data directory on your target cluster, and/or JMX to call “refresh” on it.
  3. Your replica placement strategy on the target cluster is so different from the source that the overhead of understanding where to copy sstables to is unacceptable, and/or you don’t want to call cleanup on a superset of sstables.
  4. You have limited network bandwidth between the source of sstables and the target(s). In this case, copying a superset of sstables around is especially ineffecient.
  5. Your infrastructure makes it easy to temporarily copy sstables to a set of sstableloader nodes or nodes on which you call “bulkLoad” via JMX. These nodes are either non-cluster-member hosts which are otherwise able to participate in the cluster as a pseudo-member from an access perspective or cluster members with sufficient headroom to bulkload.
  6. You can tolerate the potential data duplication and/or operational complexity which results from the fragility of streaming. LAN is best case here. A notable difference between “bulkLoad” and sstableloader is that “bulkLoad” does not have sstableloader’s “–ignores” option, which means you can’t tell it to ignore replica targets on failure. [4]
  7. You understand that, because it uses streaming, streams on a per-sstable basis, and streaming respects a throughput cap, your performance is bounded in terms of ability to parallelize or burst, despite “bulk” loading.