Illustration Image

11/1/2024

Reading time:2

Spark and Cassandra’s SSTable loader

logo

This resource is based on an article originally published here.

Arunkumar
3 min readMay 13, 2018

--

Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.

First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,

This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.

Latencies during Cassandra row writes

The spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.

Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).

Latencies during Cassandra SSTable loads

Also if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.

Network Traffic (Row writes vs SSTable load)

Now let’s get into how to do that in code,

  • Using CQLSSTableWriter build the SSTables per partition
  • We need to define the create and insert statements, but it’s easy to build that from the spark dataframe
  • And stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.
  • And finally the code that run’s it all,
  • As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.
  • Let say the size is 60GB, we will have 256 SSTables of size 256MB each.
  • Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.

Fyi,

  • SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.
  • This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.

We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.

Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala

Related Articles

logo
analytics
cassandra
spark

Explore Further

sstable

cassandra

spark

Become part of our
growing community!
Welcome to Planet Cassandra, a community for Apache Cassandra®! We're a passionate and dedicated group of users, developers, and enthusiasts who are working together to make Cassandra the best it can be. Whether you're just getting started with Cassandra or you're an experienced user, there's a place for you in our community.
A dinosaur
Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.
© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation. Sponsored by Anant Corporation and Datastax, and Developed by Anant Corporation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?