March 7th, 2016

Matija Gobec, Co-Founder & CTO, SmartCat
Matija Gobec (@mad_max0204) is the co-founder and CTO of SmartCat, big data consultancy company with the main focus on solving problems in data intensive environments. He spreads his knowledge by public speaking and through the company blog and also works on open source tools to help out Cassandra community.

SmartCat delivers the top-notch, certified Big Data systems implementation service at lower rates than our competitors in the western countries. We are the evangelists of Big Data stream processing technologies for the region, actively involved in teaching, presenting and training the new generation of data scientists, NoSQL engineers and Big Data integrators. We are the agents of change.


The Problem
Developing a product usually means that, during the period of development, you are going to change requirements and technologies especially if it’s an R&D type of a project. Cassandra as a database is known for great performance but this performance comes from the fact that data modeling is query based, meaning that it is not that simple to extend functionality as is with a traditional relational database where data model is entity based. We were working on our second R&D project when we felt the need for something that would help us apply changes to schema and handle live data in production. The early stages of development and prototyping are easy because you can always drop and recreate the database to reflect the new schema but this was not the case since some clients were already testing the implemented functionalities and evaluating the system we were working on. This was really crucial for our product because it helped us shape the new functionalities and make changes to the existing ones.


The first steps towards schema migration tool made us do a bit of research. There were a few projects that were aimed at solving this issue but none was really what we needed: a lightweight framework that would allow us to make changes to the schema and take care of the data. A big bonus would be the ability to execute this tool with different parameters in different stages of deployment. You can find more details about our investigation on SmartCat blog.


We decided to build a simple tool for executing database schema transformations and keep track of schema version in the database itself. This is required if we want to be able to make changes to schema while database is operational and make these changes through our code so that we can test it before executing in production. Simple schema changes are easy because Cassandra allows adding or removing columns, changing types with some restrictions and other simple updates but we also felt the need to make drastic changes to some tables like changing the primary key. Since we were working on a prototype project which was also serving clients at the same time we couldn’t just drop and recreate the schema but had to keep all the data after the update. The majority of schema and data migrations can be done using Spark or any other ETL tool and many people are using them to solve these problems but Spark brings a lot of weight especially if you need it just to handle migrations. Being able to write migrations and execute them in a unit test gave us confidence before going live and using the application models made it all really simple to use.


The migration tool started as just a runner for migration implementations but then we figured out that there were two stages where we wanted to execute migrations. The first stage is when application code is built and we want to deploy it to the server. In order for the application to run we needed to update the schema. In this step we executed schema type migrations making updates to schema and everything got set so that the application could be started. But there were some cases where, after updating the schema, we needed to handle certain amounts of already existing data. For this purpose, we defined data type migrations. This is usually used when we create a new table for the existing application model in order to serve new queries but there is already a certain amount of data in the database or when a new field is added and we need to calculate values for it based on some criteria. Executing data migration in the pre-deploy stage would take time and prolong the application downtime. We wanted to get the application up and running as soon as possible so this work had to be done asynchronously. Executing newly implemented queries wouldn’t return all results before data migration finishes execution but we would still be able to serve requests and handle incoming data. Here, uptime vs consistency won and application had minimal down time. Possibilities are not limited to this use case and it is completely up to the user to decide when and how to execute migrations.


In this example we have a data model to store items by their system identifier with information on number and external system id. The table looks like this:

The problem occurs if we need to change data model in order to provide querying by number and external system id. This is not supported with this model and if we have data in this table we need to handle the data too.

We can solve this by creating a schema migration with execute method looking like this:

In this example we are executing a schema migration and the migration tool automatically updates the schema database version. If you look closer, when executing the schema altering statement we are using the executeWithSchemaAgreement method. This method is implemented in abstract class Migration and executes statement with regard to schema propagation on all nodes. This is necessary if you have chained migrations that expect this schema.

Next, we execute data migration that iterates over the rows from the original table and inserts them into a newly created one:

By using one of the continuous deployment tools, schema migrations can be executed on a successful build with this code:

The application startup code can execute data migrations like this:

By the end of executing migrations our database will have updated schema and all data available in both tables and queries by number and external id will be supported.

If this change is required on production application we can choose when to execute which migration. Schema migration can be executed as part of a build and deploy process and when the application is started it can trigger data migration which would populate data in the background.



What is important here is how easy it is to upgrade schema and handle historical data. What is convenient here is that this is a java application and basically there are no limitations to what can be executed in migration. For example, we can update the existing data with data received from a third party service by implementing service calls in migration execute method and run that in background. On the other hand there can be a dedicated service to executing migrations that we trigger via API or some messaging system and we also used this tool to migrate data from MySql to Cassandra.

After successfully using this tool on three projects in production, our experience is that it makes our job to handle data model updates much easier. All the updates are stored in a schema table created by this tool and provide both functionality to track versions but also serve as an audit trail.

Feel free to check out the project on SmartCat-labs github and questions or feedback are welcome.