In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we discussed how we can do Cassandra ETL processes using Airflow and Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we show you how you can set up a basic Cassandra ETL process with Airflow and Spark. If you want to hear why we used the bash operator vs the Spark submit operator like in Data Engineer’s Lunch #25: Airflow and Spark, be sure to check out the live recording of Cassandra Lunch #53 below!
In this walkthrough, we will cover how we can use Airflow to trigger Spark ETL jobs that move data into and within Cassandra. This demo will be relatively simple; however, it can be expanded upon with the addition of other technologies like Kafka, setting scheduling on the Spark jobs to make it a concurrent process, or in general creating more complex Cassandra ETL pipelines. We will focus on showing you how to connect Airflow, Spark, and Cassandra, and in our case today, specifically DataStax Astra. The reason we are using DataStax Astra is that we want everyone to be able to do this demo without having to worry about OS incompatibilities and the sort. For that reason, we will also be using Gitpod, and thus the entire walkthrough can be done within your browser!
For this walkthrough, we will use 2 Spark jobs. The first Spark job will load 100k rows from a CSV and then write it into a Cassandra table. The second Spark job will read the data from the prior Cassandra table, do some transformations, and then write the transformed data into a different Cassandra table. We also used PySpark to reduce the number of steps to get this working. If we used Scala, we would be required to build the JAR’s and that would require more time. If you are interested in seeing how to use the Airflow Spark Submit Operator and run Scala Spark jobs, check out this walkthrough!
You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.
1. Set-up DataStax Astra
1.1 – Sign up for a free DataStax Astra account if you do not have one already
1.2 – Hit the
Create Database button on the dashboard
1.3 – Hit the
Get Started button on the dashboard
This will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.
1.4 – Define your database
- Database name: whatever you want
- Keyspace name: whatever you want
- Cloud: whichever GCP region applies to you.
create databaseand wait a couple minutes for it to spin up and become
1.5 – Generate application token
- Once your database is active, connect to it.
- Once on
dashboard/<your-db-name>, click the
Admin Userfor role and hit generate token.
- COPY DOWN YOUR CLIENT ID AND CLIENT SECRET as they will be used by Spark
1.6 – Download
- Hit the
Connecttab in the menu
- Click on
Node.js(doesn’t matter which option under
Connect using a driver)
- Drag-and-Drop the
Secure Bundleinto the running Gitpod container.
1.7 – Copy and paste the contents of
setup.cql into the CQLSH terminal
2. Set up Airflow
We will be using the quick start script that Airflow provides here.
3. Start Spark in standalone mode
3.1 – Start master
3.2 – Start worker
Open port 8081 in the browser, copy the master URL, and paste in the designated spot below
4. Move spark_dag.py to ~/airflow/dags
4.1 – Create ~/airflow/dags
4.2 – Move spark_dag.py
mv spark_dag.py ~/airflow/dags
5. Update the TODO’s in properties.config with your specific parameters
6, Open port 8080 to see Airflow UI and check if
If it does not exist yet, give it a few seconds to refresh.
7. Update Spark Connection, unpause the
example_cassandra_etl, and drill down by clicking on
7.1 – Under the
Admin section of the menu, select
spark_default and update the host to the Spark master URL. Save once done.
7.2 – Select the
DAG menu item and return to the dashboard. Unpause the
example_cassandra_etl, and then click on the
8. Trigger the DAG from the tree view and click on the graph view afterwards
9. Confirm data in Astra
9.1 – Check
select * from <your-keyspace>.previous_employees_by_job_title where job_title='Dentist';
9.2 – Check
select * from <your-keyspace>.days_worked_by_previous_employees_by_job_title where job_title='Dentist';
And that will wrap up our walkthrough. Again, this is an introduction on how to set up a basic Cassandra ETL process run by Airflow and Spark. As mentioned above, these baby steps can be used to further expand and create more complex and scheduled/repeated Cassandra ETL processes run by Airflow and Spark. The live recording of Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!