Spark and Cassandra’s SSTable loader

11/1/2024

Reading time:2

This resource is based on an article originally published here.

Arunkumar

·

Follow

3 min readMay 13, 2018

--

Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.

First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,

This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.

Latencies during Cassandra row writes

The spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.

Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).

Latencies during Cassandra SSTable loads

Also if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.

Network Traffic (Row writes vs SSTable load)

Now let’s get into how to do that in code,

Using CQLSSTableWriter build the SSTables per partition

We need to define the create and insert statements, but it’s easy to build that from the spark dataframe

And stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.

And finally the code that run’s it all,

As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.
Let say the size is 60GB, we will have 256 SSTables of size 256MB each.
Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.

Fyi,

SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.
This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.

We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.

Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala

Related Articles

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

sstable

cassandra

sstables

GitHub - tolbertam/sstable-tools: Tools for parsing, creating and doing other fun stuff with sstables

2/17/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

11/4/2022

cassandra

spark

Can Spark Applications Coexist with NoSQL Databases? | Capital One

11/4/2022

sstable

cassandra

sstables

SSTable

8/28/2022

Explore Further

sstable

cassandra

sstables

GitHub - tolbertam/sstable-tools: Tools for parsing, creating and doing other fun stuff with sstables

2/17/2023

sstable

cassandra

sstables

SSTable

8/28/2022

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

spark

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?