Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark

6/17/2022

Reading time:5

Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we discussed how we can do Cassandra ETL processes using Airflow and Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we show you how you can set up a basic Cassandra ETL process with Airflow and Spark. If you want to hear why we used the bash operator vs the Spark submit operator like in Data Engineer’s Lunch #25: Airflow and Spark, be sure to check out the live recording of Cassandra Lunch #53 below!

In this walkthrough, we will cover how we can use Airflow to trigger Spark ETL jobs that move data into and within Cassandra. This demo will be relatively simple; however, it can be expanded upon with the addition of other technologies like Kafka, setting scheduling on the Spark jobs to make it a concurrent process, or in general creating more complex Cassandra ETL pipelines. We will focus on showing you how to connect Airflow, Spark, and Cassandra, and in our case today, specifically DataStax Astra. The reason we are using DataStax Astra is that we want everyone to be able to do this demo without having to worry about OS incompatibilities and the sort. For that reason, we will also be using Gitpod, and thus the entire walkthrough can be done within your browser!

For this walkthrough, we will use 2 Spark jobs. The first Spark job will load 100k rows from a CSV and then write it into a Cassandra table. The second Spark job will read the data from the prior Cassandra table, do some transformations, and then write the transformed data into a different Cassandra table. We also used PySpark to reduce the number of steps to get this working. If we used Scala, we would be required to build the JAR’s and that would require more time. If you are interested in seeing how to use the Airflow Spark Submit Operator and run Scala Spark jobs, check out this walkthrough!

You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

This will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.

1.4 – Define your database

Database name: whatever you want
Keyspace name: whatever you want
Cloud: whichever GCP region applies to you.
Hit create database and wait a couple minutes for it to spin up and become active.

1.5 – Generate application token

Once your database is active, connect to it.
Once on dashboard/<your-db-name>, click the Settings menu tab.
Select Admin User for role and hit generate token.
COPY DOWN YOUR CLIENT ID AND CLIENT SECRET as they will be used by Spark

1.6 – Download `Secure Bundle`

Hit the Connect tab in the menu
Click on Node.js (doesn’t matter which option under Connect using a driver)
Download Secure Bundle
Drag-and-Drop the Secure Bundle into the running Gitpod container.

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

2. Set up Airflow

We will be using the quick start script that Airflow provides here.

bash setup.sh

3. Start Spark in standalone mode

3.1 – Start master

./spark-3.0.1-bin-hadoop2.7/sbin/start-master.sh

3.2 – Start worker

Open port 8081 in the browser, copy the master URL, and paste in the designated spot below

./spark-3.0.1-bin-hadoop2.7/sbin/start-slave.sh <master-URL>

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

mkdir ~/airflow/dags

4.2 – Move spark_dag.py

mv spark_dag.py ~/airflow/dags

5. Update the TODO’s in properties.config with your specific parameters

vim properties.conf

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

If it does not exist yet, give it a few seconds to refresh.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check `previous_employees_by_job_title`

select * from <your-keyspace>.previous_employees_by_job_title where job_title='Dentist';

9.2 – Check `days_worked_by_previous_employees_by_job_title`

select * from <your-keyspace>.days_worked_by_previous_employees_by_job_title where job_title='Dentist';

And that will wrap up our walkthrough. Again, this is an introduction on how to set up a basic Cassandra ETL process run by Airflow and Spark. As mentioned above, these baby steps can be used to further expand and create more complex and scheduled/repeated Cassandra ETL processes run by Airflow and Spark. The live recording of Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark

Related Articles

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

8/3/2024

astra

cassandra

datastax.astra

Vector Databases Compared - Evaluating DataStax Astra DB Serverless (Vector) and Pinecone Vector Database

2/4/2024

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

Explore Further

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

etl

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

1.4 – Define your database

1.5 – Generate application token

1.6 – Download `Secure Bundle`

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

2. Set up Airflow

3. Start Spark in standalone mode

3.1 – Start master

3.2 – Start worker

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

4.2 – Move spark_dag.py

5. Update the TODO’s in properties.config with your specific parameters

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check `previous_employees_by_job_title`

9.2 – Check `days_worked_by_previous_employees_by_job_title`

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the Create Database button on the dashboard

1.3 – Hit the Get Started button on the dashboard

1.4 – Define your database

1.5 – Generate application token

1.6 – Download Secure Bundle

1.7 – Copy and paste the contents of setup.cql into the CQLSH terminal

2. Set up Airflow

3. Start Spark in standalone mode

3.1 – Start master

3.2 – Start worker

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

4.2 – Move spark_dag.py

5. Update the TODO’s in properties.config with your specific parameters

6, Open port 8080 to see Airflow UI and check if example_cassandra_etl exists.

7. Update Spark Connection, unpause the example_cassandra_etl, and drill down by clicking on example_cassandra_etl.

7.1 – Under the Admin section of the menu, select spark_default and update the host to the Spark master URL. Save once done.

7.2 – Select the DAG menu item and return to the dashboard. Unpause the example_cassandra_etl, and then click on the example_cassandra_etllink.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check previous_employees_by_job_title

9.2 – Check days_worked_by_previous_employees_by_job_title

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

1.6 – Download `Secure Bundle`

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

9.1 – Check `previous_employees_by_job_title`

9.2 – Check `days_worked_by_previous_employees_by_job_title`