Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra with DSBulk

6/24/2022

Reading time:7

Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra with DSBulk - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra, we discussed how to move data from Open Source Cassandra to Datastax Astra using DSbulk migrator. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Apache Cassandra

Apache Cassandra is an open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers
Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability)
- Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner
Note: Demo runs the Open Source version of Cassandra (not DSE)
- Works nearly identically with DSE Cassandra

DataStax Astra

Astra website: https://www.datastax.com/products/datastax-astra
DataStax Astra is a fully managed, serverless database built on Apache Cassandra, and is provided by DataStax
Some additional features of Astra:
- Stargate APIs: Makes it easy for developers to use a Cassandra-based database like Astra to work with data without deep knowledge of CQL
- Zero Lock-In: Deploy on AWS, GCP and Azure and still maintain compatibility with open-source Cassandra
- Global Scale: Data replication across multiple data centers, availability zones, and multiple regions.
  - Additionally, allows a user to scale an Astra database up to multiple petabytes of data without impacting speed or performance
- 80 GB of storage and 20 million read/write operations for free every month

DSBulk

DSBulk: DataStax Bulk Loader for Apache Cassandra is an open source software used to load/unload CSV or JSON data in and out of supported databases
Supported databases:
- DataStax Astra cloud database
- DataStax Enterprise (DSE) 4.7 and later
- Open source Apache Cassandra 2.1 and later
More information about DSBulk, along with an introduction to it and various documentation can be found linked here: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html
Github Repository for the DataStax DSBulk project: https://github.com/datastax/dsbulk
Commands that will be used in today’s presentation/demo:
- dsbulk load
  - This command is used to load data into a cassandra/astra database without a configuration file. Note that necessary parameters will have to be passed in (listed below)
- dsbulk unload
  - This command is used to unload data from a cassandra/astra database without a configuration file, into a CSV or JSON file. Note that necessary parameters will have to be passed in as well.
- dsbulk count
  - This command is used to return information about loaded data in a cassandra/astra database.
Some necessary parameters/flags that must be used if using these commands without a configuration file:
- -k: keyspace
- -t: table
- -b: path to secure connect bundle (only necessary if connecting to astra)
- -u: username, -p: password (to the database)
  - Since recent Astra update earlier this year, need to use ClientID/ClientSecret instead of username/password.
  - Can be left empty if cassandra database user/password is left as default (cassandra/cassandra)
- -url: url from where to pull .CSV or .JSON file from, or a local directory for where to unload data into

Demo Project

For the demo project, we will be running through some sample commands based on the following GitHub repository: https://github.com/DataStax-Examples/dsbulk-to-astra/. Some notes before getting started:

Make sure your local cassandra database is running. For a simple docker command, use the following to startup an open source cassandra database locally:
- docker run -p 9042:9042 –rm –name my-cassandra -d cassandra
Create an Astra database on the Astra website after registering for an account on their website: https://astra.datastax.com/register
- After creating a database, make sure to generate a Client token with some kind of higher permissions that allow you to write into the database and read from it. (For example, Administrator Privileges). Write down the ClientID and Client Secret keys.
- Additionally, download the secure connect bundle for the database. This will be necessary to allow dsbulk to connect to the Astra database.

After making sure that your local Cassandra database is running, we need to set up both the keyspace and table schema for this demo. The following commands should be run on both Astra’s CQLSH console along with your local Cassandra’s CQLSH console, which defines the keyspace and tables we will use:

CREATE KEYSPACE testkeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE IF NOT EXISTS testkeyspace.video_ratings_by_user (
    videoid uuid,
    userid uuid,
    rating int,
    PRIMARY KEY (videoid, userid)
);

Some example output from local Cassandra once these commands have been run and a select * command is run:

screenshot of Cassandra console — Commands used to create the necessary keyspace and table in local Cassandra

Once this is done for both Astra and your local Cassandra database, we can proceed with using DSBulk. Before using DSBulk, it must be downloaded from the following url (which also includes instructions on downloading DSBulk): https://docs.datastax.com/en/dsbulk/doc/dsbulk/install/dsbulkInstall.html. Now we can begin running DSBulk commands.

Loading from a file at a url into local Cassandra:

./dsbulk-1.8.0/bin/dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -h localhost -k testkeyspace -t video_ratings_by_user -u cassandra -p cassandra

Note that the first part of the command is the path to your local DSbulk installation’s DSbulk executable file. Some sample output from the above command:

Loading from a file at a url into Astra:

./dsbulk-1.8.0/bin/dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -b ./secure-connect-testdb3.zip -k testkeyspace -t video_ratings_by_user -u IwxQhWdajNMpHisNlWeFlPYq -p AJ,pr7SG_H3P,,AZxWrYCqSkzUzjxXvbUrWH-c6GAII.h,YCK1S6ghAaItKCC-I0l27ybK6PuTusPbb_vJRz3igAdyvL1KepRF-tACkiMRSRx3jZW,xhBd3LgeIA,Dy2

Note that the parameters that come after -u and -p are not quite username and password, but rather the Client ID and Client Secret Key that are obtained by generating a token for your Astra database. Additionally, the path after the -b flag should point to the secure connect bundle for your Astra database.

In both of the above cases, we are loading from a .CSV file at a url into either local Cassandra or Astra. To move data from local Cassandra into Astra, we will also need to use the DSbulk unload command. We first run the following command in Astra’s cqlsh to make sure that the Astra table does not have any data in it:

TRUNCATE testkeyspace.video_ratings_by_user;

Some sample output from running that command:

screenshot of Astra console — Empty table in Astra after using TRUNCATE command in CQLSH

Now we do a two-step process to completely move data from local Cassandra into Astra:

Step 1: Unload data from local Cassandra into a .csv file:

./dsbulk-1.8.0/bin/dsbulk unload -h localhost -k testkeyspace -t video_ratings_by_user -url ./my_data

Note that the very last parameter is the path to a local folder and it must be empty. Finally, we run the following DSbulk load command to load that local .csv file into Astra:

./dsbulk-1.8.0/bin/dsbulk load -url ./my_data.csv/ -b ./secure-connect-testdb3.zip -k testkeyspace -t video_ratings_by_user -u IwxQhWdajNMpHisNlWeFlPYq -p AJ,pr7SG_H3P,,AZxWrYCqSkzUzjxXvbUrWH-c6GAII.h,YCK1S6ghAaItKCC-I0l27ybK6PuTusPbb_vJRz3igAdyvL1KepRF-tACkiMRSRx3jZW,xhBd3LgeIA,Dy2

And the data migration process of a table from local Cassandra into Astra is complete. For a complete run-through of the commands mentioned above, along with additional commentary, please see the recorded live session below on YouTube!

Recording of the live session is below:

References

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra with DSBulk

Related Articles

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

11/1/2024

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

11/1/2024

cloud

kubernetes

datastax

DataStax Hyper-Converged Database: The Future of Data Infrastructure Is Here | DataStax

7/11/2024

cluster

troubleshooting

datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

4/3/2024

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

12/2/2023

examples

cassandra

datastax

DataStax Examples

12/2/2023

web.scraping

scraping

datastax

Build a Website Scraper with Astra DB + Python Examples | DataStax

12/2/2023

datastax

cassandra

langchain

Super Charge AI Assistants with Superagent and DataStax | DataStax

11/30/2023

Explore Further

datastax

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

11/1/2024

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

11/1/2024

cloud

kubernetes

datastax

DataStax Hyper-Converged Database: The Future of Data Infrastructure Is Here | DataStax

7/11/2024

cluster

troubleshooting

datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

4/3/2024

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

cassandra

acid

open.source

cassandra

GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®

1/16/2025

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

mongo

nocode

elasticsearch

GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository

12/2/2024

migration

proxy

cassandra

GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.

11/1/2024

Apache Cassandra

DataStax Astra

DSBulk

Demo Project

Recording of the live session is below:

References

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?