August 25th, 2015

Andrey Kashlev, Big Data Researcher at Wayne State University
Andrey Kashlev is a PhD candidate in big data, working in the Department of Computer Science at Wayne State University. His research focuses on big data, including data modeling for NoSQL, big data workflows, and provenance management. He has published numerous research articles in peer-reviewed international journals and conferences, including IEEE Transactions on Services Computing, Data and Knowledge Engineering, International Journal of Computers and Their Applications, and the IEEE International Congress on Big Data. Catch Andrey at Cassandra Summit 2015, presenting “World’s Best Data Modeling Tool for Apache Cassandra“.


Data modeling is one of the most important steps ensuring performance and scalability of Cassandra-powered applications. The existing Chebotko data modeling methodology [1,2] lays out important data modeling principles, rules and patterns to translate a real-world domain model into a running Cassandra schema. While this approach enables rigorous and sound schema design, it requires specialized training and experience.

In this article, we will showcase our online tool that streamlines and automates the Cassandra data modeling methodology proposed by Chebotko et al. [1,2]. Our tool, called KDM, minimizes schema design effort on the user’s part, by automating the entire data modeling cycle, starting from a conceptual data model and ending with a CQL script. Specifically, KDM automates the most complex, error-prone, and time-consuming data modeling tasks: conceptual-to-logical mapping, logical-to-physical mapping, physical optimization, and CQL generation. We will design data models for two real life use cases from the IoT and media cataloguing domains. KDM is available for free and can be used by developers to build Cassandra data models, as well as by course instructors and students to teach/learn NoSQL data modeling.


The Big Picture

Fig. 1 presents an overview of the data modeling process in KDM. A Cassandra solution architect, a role that encompasses both database design and application design tasks, starts data modeling by building a conceptual data model and specifying data access patterns (steps 1 and 2). The access patterns are blueprints of queries that the application will need to run against the database. Next, KDM automatically produces a set of correct logical data models. A logical data model specifies Cassandra tables that efficiently support application queries described during step 2. The solution architect selects the preferred data model (step 3), and KDM generates a physical data model, which the user needs to configure (step 4). Finally, KDM generates a CQL script that the solution architect can download to instantiate the physical data model in a Cassandra cluster (step 5).

Screen Shot 2015-08-25 at 2.05.43 PM

Fig. 1. KDM’s data modeling automation workflow.


The crux of Cassandra data modeling is a conceptual-to-logical mapping, performed according to the rules defined in [1,2]. An error in this mapping may lead to a data loss, a poor query performance or even to an inability to support queries. The crucial innovation of KDM is that through automation, the tool eliminates human error from conceptual-to-logical mapping, thereby ensuring sound schema design.


Use Case: An IoT Application

We use KDM to create a data model for an IoT application. The application needs to store and query information about sensor networks, sensors themselves, and their measurements, which include temperature, humidity, and pressure. Common application tasks include finding information about sensors in a given network as well as displaying all measurements of a particular sensor.


Step 1: Design a Conceptual Data Model.

We start data modeling by describing the data that our application will manage. We drag-and-drop and connect entity, relationship, and attribute types, as shown in Fig. 2. Whenever needed, we can pan the canvas by dragging it with the right mouse button. Two things are essential here.

First, it is important that we specify a key for each entity type.  A key is a minimal set of attribute types that uniquely identify an entity or a relationship. For example, name is the key of the Network entity type, id is the key of the Sensor entity type, and a combination {, records.timestamp, Measurement.parameter} is the key of the Measurement type. Intuitively, each measurement is uniquely identified by the combination of sensor id, timestamp, and the parameter being measured, such as temperature or pressure.

A key is specified by selecting an appropriate attribute type(s) (e..g, name) and pressing the KEY button in the toolbar. The key of Measurement type is specified by right-clicking on Measurement -> Set keys, which opens a dialog to set a compound key. Optionally, we may specify keys for relationship types.

Second, we must specify cardinality constraints, such as 1:1, 1:n, and m:n. Fig. 2 shows the complete conceptual data model for our use case. We will later see how KDM uses the information we have provided when generating a logical data model.

Screen Shot 2015-08-25 at 2.05.53 PM

Fig. 2. A conceptual data model for our IoT application.


Step 2: Specify Access Patterns

Cassandra data modeling is a query-driven process that treats data access patterns as first-class citizens. An access pattern prescribes what the query is searching on, searching for, and how the results should be ordered. Based on the queries that our application will run against the database, we compose a list of three access patterns:
: Find information about all sensors in a given network.

Q2: Find measurements of a given sensor on a given date. Order results by timestamp, show most recent results first, i.e. order results by timestamp (DESC).

Q3: Find measurements of a given sensor of a particular quantity (i.e. a parameter, such as temperature, humidity, etc.). Order results by timestamp (DESC).

To specify data access patterns, we switch to the second tab (step 2). In Q1, since we perform equality search on network name, we right-click on the name attribute type and choose “given value (=)”, as shown in Fig. 3(a). The “name” should appear in the Q1 tab of the Access Patterns panel under “GIVEN”. We now specify searched-for attribute types: sensor id, location, and characteristics. We right-click on the id attribute and choose “find value”, as shown in Fig. 3(b).       

Screen Shot 2015-08-25 at 2.05.59 PM

Fig. 3. Specifying the Q1 access pattern.


We do the same for location and characteristics. For documentation purposes, it is a good idea to type in a verbal description of the access pattern in the text area. This completes the specification of Q1. The Access Patterns panel should now look as follows:

Screen Shot 2015-08-25 at 2.06.09 PM

Fig. 4. The Access Patterns panel after specifying Q1.


The Q2 access pattern is specified similarly. Recall, that Q2 asks to find all measurements of a given sensor on a given date, and order results by timestamp (DESC). Thus, two attributes are in the “GIVEN” category – sensor id and date. To request query results to be ordered by timestamp in descending order, instead of “find value” we click “find and sort DESC” after right-clicking on the timestamp attribute type:

Screen Shot 2015-08-25 at 2.06.13 PM

Fig. 5. Requesting the query results to be ordered by timestamp in descending order in Q2.

The Q2 tab of the Access Patterns panel should look as follows:

Screen Shot 2015-08-25 at 2.06.19 PM

Fig. 6. The Q2 access pattern.

The access pattern Q3 is specified similarly:


Screen Shot 2015-08-25 at 2.06.24 PM

Fig. 7. The Q3 access pattern.

After defining access patterns we need to perform a conceptual-to-logical mapping. This is a critical step that translates a conceptual data model into a set of tables, based on the specified access patterns. It is crucial that this mapping is performed correctly. An incorrect data model may lead to a data loss or an inability to support application queries. Using its advanced algorithms, KDM automatically produces correct logical data models. This eliminates the need to manually map a conceptual data model onto Cassandra tables. To generate logical data models, we click the Screen Shot 2015-08-25 at 2.06.31 PM button on the toolbar.


Step 3: Select a Logical Data Model

KDM has produced table schemas for each of the three access patterns (Fig. 8). It automatically determined logically correct primary keys, and produced all possible solutions that meet the requirements we have specified in the previous two steps.

Screen Shot 2015-08-27 at 10.40.55 AM

Fig. 8. A logical data model generated by KDM for our IoT use case.


The above table schemas are shown using the Chebotko notation [1,2], where K denotes a partition key column, and C denotes a clustering key column whose values are stored in ascending (↑) or descending (↓) clustering order. Finally, {} denote a set column, [] and <> are used for list and map columns, respectively.

While for some queries (such as Q1) there is only one correct table schema, other queries (Q2 and Q3) can be accommodated by several alternative table schemas. In such cases KDM produces all of such schemas, so that the user could choose the schema he prefers. By default, the first table schema for each query is selected (highlighted red). If needed, the user can select a different schema. For example, we will select the last table schema for Q2 (table3), since it has smaller partitions than the first two schemas table0 and table1. Intuitively, because {id, date} is a more selective partition key than {id} (or {date}), such partition will store fewer rows. We will also select the measurements_by_sensor3 table for Q3, using a similar rationale. Whenever possible, KDM suggests meaningful table names, e.g., sensors_by_network. In other cases, the default name is produced, such as table0, table1, etc.

After selecting a logical data model we press the Screen Shot 2015-08-25 at 2.06.43 PM button to have KDM generate a physical data model with default settings.


Step 4: Configure the Physical Data Model.

Fig. 9 shows a physical data model produced by KDM for our use case. A CQL query equivalent to the access pattern is shown underneath each of the corresponding table schemas.

Screen Shot 2015-08-25 at 2.06.50 PM

Fig. 9. A physical data model with default data types generated by KDM for our IoT use case.


As needed, we customize table and column names. Next, using the dropdowns, we select data types for all the columns. Finally, we perform appropriate physical optimizations, such as splitting or merging partitions, adding/removing columns, etc. This should be done with great caution, particularly when it comes to altering the primary key structure.

As an example of physical optimization, consider the measurements_by_sensor3 table above, that stores all measurements performed by a given sensor of a particular parameter (e.g., temperature). Each partition of the table stores all the measurements of this parameter. Suppose we decide to split partitions into buckets, such that each bucket stores one month-worth of measurements. To achieve this, we use the Screen Shot 2015-08-25 at 2.17.08 PM button to add a new column to the partition key, and name it “month”. The configured physical data model looks as follows:

Screen Shot 2015-08-25 at 2.06.56 PM

Fig. 10. A configured physical data model for our IoT use case.

Once we have configured and optimized the physical data model, we press the Screen Shot 2015-08-25 at 2.07.06 PM  button to generate a CQL script capturing our data model.


Step 5: Download a CQL Script

KDM frees us from having to write the CREATE TABLE statements by generating a CQL script capturing the database schema (Fig. 11). We choose a keyspace name, a file name, specify replication strategy and factor, and download the .cql file. The demo.cql script can be executed against Cassandra to instantiate our physical data model.

Screen Shot 2015-08-25 at 2.07.11 PM

Fig. 11. A CQL script generated by KDM for our IoT use case.


In this article we introduced automated data modeling for Cassandra with KDM using a basic example from an IoT domain. We demonstrated how KDM automates the most complex and error-prone steps of the data modeling process: conceptual-to-logical mapping, logical-to-physical mapping, physical optimization, and CQL generation. This use case is available in KDM, under “Use Cases -> Internet of Things” menu. In the second part of this series we will build a more elaborate data model for a media cataloguing application, and will demonstrate some of the advanced automation features of KDM, such as the support of roles, alternative keys, cyclic access patterns, and queries involving multiple entities and relationships.



This work would not have been possible without the inspiring ideas and helpful feedback of Dr. Artem Chebotko and Mr. Anthony Piazza. Andrey Kashlev would also like to thank Dr. Shiyong Lu for his support of this project.



[1] Artem Chebotko, Andrey Kashlev, Shiyong Lu, “A Big Data Modeling Methodology for Apache Cassandra”, IEEE International Congress on Big Data, (In Press), 2015. download

[2] DataStax Training, DS220: Data Modeling with DataStax Enterprise.