Apache Cassandra Lunch #55: Migrating PostgreSQL to Cassandra

6/19/2022

Reading time:5

Apache Cassandra Lunch #55: Migrating PostgreSQL to Cassandra - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #55: Migrating PostgreSQL to Cassandra, we will discuss the differences between Relational and Non Relational database, as well as the data modeling process for Cassandra. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Relational Databases

The good

Before we talk about the advantages and disadvantages of Relational Databases like PostgreSQL it’s important to note that one is not better than the other, rather they both have suitable use cases. So starting off with Relational Databases the good thing about them is that SQL databases use ACID transactional guarantees and they query data using JOIN’s on top of existing normalized relational data models. They have Consistent global secondary indexes and Scale vertically. They are great for medium-sized data.

The bad

The problem here becomes when we need to scale this database. Whenever we want to scale we either have to buy increasingly more expensive hardware or try to manually shard. This is not ideal when working with Big Data and we are faced with the problem of having a large team coordinating, manually moving data and schema changes.

Migrating PostgreSQL to Cassandra

So the solution to our scaling problem is a non-relational database such as Cassandra. Cassandra scales linearly, it’s highly available so that means there is no downtime, it is easy to manage operationally. This solves a lot of our problems when handling Big Data but it has a unique data model and it takes some getting used to if you want to use it effectively.

The Ring

Image of a multi-cluster database schema using cassandra.

In Cassandra Data is replicated automatically, the Replication Factor is per keyspace. Consistency levels in Cassandra can be configured to manage availability versus data accuracy. Configure consistency for a session or per individual read or write operation. Within cqlsh, use CONSISTENCY, to set the consistency level for all queries in the current cqlsh session.

Data Modeling

Now the tricky part about migrating PostgreSQL to Cassandra is the way we model our Data for Cassandra. The process is different in a lot of ways and we will cover the basics here.

Table that compares and contrasts Relational databases to Cassandra databases

The first thing we can notice is that Cassandra is query-driven. The way to minimize partition reads is to model your data to fit your queries. Don’t model around relations. Don’t model around objects. Model around your queries. Try to determine exactly what queries you need to support this is different from relational databases like PostgreSQL where you think about the queries after you defined your schema. Primary Keys are different and they represent a lot more in Cassandra.

Migrating PostgreSQL to Cassandra understanding the difference between data modelling strategies. This image shows the Cassandra data modelling strategy.

In order to utilize the best out of Cassandra we need to make sure our Data Model is done right, so to make sure we get the right Data Model we need to follow the Cassandra modeling methodology. First, we have to create our Conceptual Data Model. This is where we take a step back and Identify entities and relationships and their attributes.
This part of the modeling does not require being a developer, and so that means more people without technical skills can be involved in this process.

Migrating PostgreSQL to Cassandra requires understanding the relationships between entities before hand. This image displays an entity relationship for Cassandra.

The next thing in the process is to create our Application workflow. This is generally where we determine what queries we need for our application. Remember we are modeling our Data Model around our queries so we have to determine what those queries are going to be early in the process.

Image of an application workflow when modelling for a Cassandra database.

After we have these two parts ready we can move onto the Logical Data Model.

High level view of the data modelling for a Cassandra Database

Image of a Chebatko diagram used when modelling for a Cassandra Database.

Now that we have the Logical Data Model it’s time to transform it into Physical Data Model and to do that we just have to assign the CQL Types.

Image of the physical data model table used when modelling for a Cassandra Database.

Primary Key / Partition key + Clustering Column

A key thing to understand when migrating from PostgreSQL to Cassandra, As I mentioned before the Primary Key is quite different between PostgreSQL and Cassandra, in fact, this is the most important thing to get right when you are modeling with Cassandra so let’s dive in and look at what exactly is a Partition key and Clustering column.

Let’s start off by taking a look at the CQL and SQL, we can notice that they are quite similar and if you have worked with SQL, picking up CQL will be no challenge. The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database. The other purpose, and one that very critical in distributed systems, is determining data locality. When data is inserted into the cluster, the first step is to apply a hash function to the partition key. The output is used to determine what node (and replicas) will get the data. The algorithm used by Apache Cassandra utilizes Murmur3 which will take an arbitrary input and create a consistent token value. That token value will be inside the range of tokens owned by a single node. In simpler terms, a partition key will always belong to one node and that partition’s data will always be found on that node.

Hopefully, this will give you enough information to start the process, but this is just scratching the surface there is a lot more information you need before you can start actually migrate a PostgreSQL to Cassandra so make sure to research deeply so you can do it right.

As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2, be sure to check that out as well!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #55: Migrating PostgreSQL to Cassandra

Related Articles

api.management

snowflake

oracle

GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform

3/7/2024

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

sparksql

cassandra.lunch

cassandra

Apache Cassandra Lunch #65: Spark Cassandra Connector Pushdown - Business Platform Team

6/27/2022

datastax

cassandra.lunch

cassandra

Apache Cassandra Lunch #68: DataStax Apache Kafka Connector - Business Platform Team

6/25/2022

cassandra.lunch

grafana.dashboard

cassandra

Apache Cassandra Lunch #62: Grafana Dashboard for Apache Cassandra - Business Platform Team

6/25/2022

cassandra.lunch

cassandra

nifi

Apache Cassandra Lunch #60: Apache Cassandra and Apache Nifi - Business Platform Team

6/25/2022

datastax

cassandra.lunch

cassandra

Apache Cassandra Lunch #67: Moving Data from Cassandra to DataStax Astra with DSBulk - Business Platform Team

6/24/2022

Explore Further

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

postgresql

api.management

snowflake

oracle

GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform

3/7/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

cassandra

cluster

troubleshooting

datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

4/3/2024

cassandra

langchain

llamaindex

GitHub - michelderu/chat-with-your-data-in-cassandra: Chat with your data stored in DataStax Enterprise, Astra DB and Apache Cassandra - In Natural Language!

3/26/2024

spring

angular

rest

GitHub - jhipster/jhipster-sample-app-cassandra: This is a sample application created with JHipster, with the Cassandra option

3/7/2024

api.management

node

api

LoopBack

3/7/2024

Relational Databases

The good

The bad

Migrating PostgreSQL to Cassandra

The Ring

Data Modeling

Primary Key / Partition key + Clustering Column

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?