Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations

6/14/2022

Reading time:5

Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations, we discuss how we can use Spark SQL for Cassandra data operations. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

In the walkthrough below, we use the Catalog method from DataStax’s Spark Cassandra Connector. We could use Spark’s SQL shell; however, there are more logs, and using the Spark Scala Shell is more succinct for our purposes.

In the walkthrough below, we cover a few different items. We cover basic Cassandra schema commands, basic Cassandra data operations (Cassandra to Cassandra), and basic Cassandra data operations (source file to Cassandra). There are some limitations with what we can do with Spark SQL for Cassandra data operations, and you can find out more in the live recording of Cassandra Lunch #49 embedded below.

Walkthrough

You can either follow along below, or using this GitHub repository’s README.md

Prerequisites

Docker
Spark 3.0.X

1. Setup Dockerized Apache Cassandra

1.1 – Clone repo and cd into it

git clone https://github.com/Anant/example-cassandra-spark-sql.git

cd example-cassandra-spark-sql

1.2 – Start Apache Cassandra Container and Mount Directory

docker run --name cassandra -p 9042:9042 -d -v "$(pwd)":/example-cassandra-spark-sql cassandra:latest

1.3 – Run `cqlsh`

docker exec -it cassandra cqlsh

1.4 – Run `setup.cql`

source '/example-cassandra-spark-sql/setup.cql'

2. Start Spark Shell

2.1 – Navigate to Spark directory and start in standalone cluster mode

./sbin/start-master.sh

2.2 – Start worker and point it at the master

You can find your Spark master URL at localhost:8080

./sbin/start-slave.sh <master-url>

2.3 – Start Spark Shell

./bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 \
--master <spark-master-url> \
--conf spark.cassandra.connection.host=127.0.0.1 \
--conf spark.cassandra.connection.port=9042 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--conf spark.sql.catalog.cassandra=com.datastax.spark.connector.datasource.CassandraCatalog

3. Basic Cassandra Schema Commands

We will cover some basic Cassandra Schema commands we can do with Spark SQL. More can this can be found here

3.1 – Create Table

spark.sql("CREATE TABLE cassandra.demo.testTable (key_1 Int, key_2 Int, key_3 Int, cc1 STRING, cc2 String, cc3 String, value String) USING cassandra PARTITIONED BY (key_1, key_2, key_3) TBLPROPERTIES (clustering_key='cc1.asc, cc2.desc, cc3.asc', compaction='{class=SizeTieredCompactionStrategy,bucket_high=1001}')")

3.2 – Alter Table

spark.sql("ALTER TABLE cassandra.demo.testTable ADD COLUMNS (newCol INT)")

spark.sql("describe table cassandra.demo.testTable").show

3.3 – Drop Table

spark.sql("DROP TABLE cassandra.demo.testTable")

spark.sql("SHOW TABLES from cassandra.demo").show

4. Basic Data Operations (Cassandra to Cassandra)

4.1 – Read

Perform a basic read

spark.sql("SELECT * from cassandra.demo.previous_employees_by_job_title").show

4.2 – Write

Write data to a table from another table and use SQL functions

spark.sql("INSERT INTO cassandra.demo.days_worked_by_previous_employees_by_job_title SELECT job_title, employee_id, employee_name, abs(datediff(last_day, first_day)) as number_of_days_worked from cassandra.demo.previous_employees_by_job_title")

4.3 – Joins

Join data from two tables together

spark.sql("""
SELECT cassandra.demo.previous_employees_by_job_title.job_title, cassandra.demo.previous_employees_by_job_title.employee_name, cassandra.demo.previous_employees_by_job_title.first_day, cassandra.demo.previous_employees_by_job_title.last_day, cassandra.demo.days_worked_by_previous_employees_by_job_title.number_of_days_worked 
FROM cassandra.demo.previous_employees_by_job_title 
LEFT JOIN cassandra.demo.days_worked_by_previous_employees_by_job_title ON cassandra.demo.previous_employees_by_job_title.employee_id=cassandra.demo.days_worked_by_previous_employees_by_job_title.employee_id 
WHERE cassandra.demo.days_worked_by_previous_employees_by_job_title.job_title='Dentist'
""").show

5. Truncate tables with `CQLSH`

TRUNCATE TABLE demo.previous_employees_by_job_title ;

TRUNCATE TABLE demo.days_worked_by_previous_employees_by_job_title ;

6. Basic Data Operations (Source File to Cassandra)

6.1 – Restart Spark Shell

./bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 \
--master spark://arpans-mbp.lan:7077 \
--conf spark.cassandra.connection.host=127.0.0.1 \
--conf spark.cassandra.connection.port=9042 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--conf spark.sql.catalog.cassandra=com.datastax.spark.connector.datasource.CassandraCatalog \
--files /path/to/example-cassandra-spark-sql/previous_employees_by_job_title.csv

6.2 – Load CSV data to df

val csv_df = spark.read.format("csv").option("header", "true").load("/path/to/example-cassandra-spark-sql/previous_employees_by_job_title.csv")

6.3 – Create temp view to use Spark SQL

csv_df.createOrReplaceTempView("source")

6.4 – Write into Cassandra table using Spark SQL

spark.sql("INSERT INTO cassandra.demo.previous_employees_by_job_title SELECT * from source")

And that will wrap up our basic walkthrough on Spark SQL for Cassandra data operations. Again, if you want to watch this demo live, be sure to check out the embedded live recording below! Also, if you missed last week’s Apache Cassandra Lunch #48: Airflow and Cassandra, be sure to check that out as well!

Resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Modern Business | Comments Off on Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations

Related Articles

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

11/4/2022

cassandra

spark

Can Spark Applications Coexist with NoSQL Databases? | Capital One

11/4/2022

proxy

cassandra

spark

Migrate to Azure Managed Instance for Apache Cassandra using Apache Spark

8/18/2022

Explore Further

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

data.operations

sbt

cassandra

data.operations

Apache Cassandra Lunch #46: Apache Spark Jobs in Scala for Cassandra Data Operations - Business Platform Team

6/12/2022

sed

awk

cassandra

Apache Cassandra Lunch #43: DSBulk with sed and awk - Business Platform Team

6/10/2022

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

Walkthrough

Prerequisites

1. Setup Dockerized Apache Cassandra

1.1 – Clone repo and cd into it

1.2 – Start Apache Cassandra Container and Mount Directory

1.3 – Run `cqlsh`

1.4 – Run `setup.cql`

2. Start Spark Shell

2.1 – Navigate to Spark directory and start in standalone cluster mode

2.2 – Start worker and point it at the master

2.3 – Start Spark Shell

3. Basic Cassandra Schema Commands

3.1 – Create Table

3.2 – Alter Table

3.3 – Drop Table

4. Basic Data Operations (Cassandra to Cassandra)

4.1 – Read

4.2 – Write

4.3 – Joins

5. Truncate tables with `CQLSH`

6. Basic Data Operations (Source File to Cassandra)

6.1 – Restart Spark Shell

6.2 – Load CSV data to df

6.3 – Create temp view to use Spark SQL

6.4 – Write into Cassandra table using Spark SQL

Resources

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

Walkthrough

Prerequisites

1. Setup Dockerized Apache Cassandra

1.1 – Clone repo and cd into it

1.2 – Start Apache Cassandra Container and Mount Directory

1.3 – Run cqlsh

1.4 – Run setup.cql

2. Start Spark Shell

2.1 – Navigate to Spark directory and start in standalone cluster mode

2.2 – Start worker and point it at the master

2.3 – Start Spark Shell

3. Basic Cassandra Schema Commands

3.1 – Create Table

3.2 – Alter Table

3.3 – Drop Table

4. Basic Data Operations (Cassandra to Cassandra)

4.1 – Read

4.2 – Write

4.3 – Joins

5. Truncate tables with CQLSH

6. Basic Data Operations (Source File to Cassandra)

6.1 – Restart Spark Shell

6.2 – Load CSV data to df

6.3 – Create temp view to use Spark SQL

6.4 – Write into Cassandra table using Spark SQL

Resources

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1.3 – Run `cqlsh`

1.4 – Run `setup.cql`

5. Truncate tables with `CQLSH`