September 2nd, 2015

Andrey Kashlev, Big Data Researcher at Wayne State University
Andrey Kashlev is a PhD candidate in big data, working in the Department of Computer Science at Wayne State University. His research focuses on big data, including data modeling for NoSQL, big data workflows, and provenance management. He has published numerous research articles in peer-reviewed international journals and conferences, including IEEE Transactions on Services Computing, Data and Knowledge Engineering, International Journal of Computers and Their Applications, and the IEEE International Congress on Big Data. Catch Andrey at Cassandra Summit 2015, presenting “World’s Best Data Modeling Tool for Apache Cassandra“.

This is the second article of our series on KDM, an online tool that automates Cassandra data modeling. We highly recommend that you first read part 1 of this series where we overview our tool and use it to build a data model for an IoT application. In this article, we will demonstrate a more complex use case and will cover some of the advanced features of KDM.


Use Case: A Media Cataloguing Application

We will create a data model for a media cataloguing application that manages information about artists, their albums, songs, users, and playlists. Users often browse through artists, albums and songs, create, play, and share playlists, and invite their friends to sign up for the app. We use KDM to design a database that will efficiently support our application. Database design using KDM consists of five steps, starting with a conceptual data model and ending with a Cassandra database schema.


Step 1: Design a Conceptual Data Model.

As we discussed in part 1, we first design a conceptual data model for our application (Fig. 1).

Screen Shot 2015-09-01 at 3.07.22 PM

Fig. 1. A conceptual data model for our media cataloguing application.

To ensure correctness of our data model, we must carefully specify key and cardinality constraints. Note that a user can be identified by either username, or email. Thus, we specify the two alternative keys by right-clicking on User -> Set keys, as shown in Fig. 2. We will later see that KDM uses this information when generating a logical data model. Optionally, we may annotate username and email as key1 and key2 on the conceptual data  model (Fig. 1). We also assume that an album is uniquely identified by a combination of title and year.

Screen Shot 2015-09-02 at 3.52.36 PM

Fig. 2. Specifying alternative keys username and email for the User entity type.

Step 2: Specify Access Patterns

We now specify data access patterns associated with the application tasks. Consider the application workflow shown in Fig. 3. It models a web-based application that allows users to interact with various web pages (tasks) to retrieve data using well-defined queries. For example, upon logging in, a user starts with browsing albums of a particular artist, genre or date (Q1Q3). Once the user finds an album of interest, he searches for playlists featuring this album (Q4, Q5), etc.

Screen Shot 2015-09-02 at 3.53.43 PM

Q1: Find albums by a given artist. Order results by year (DESC).

Q2: Find albums by genre released after a given year. Order results by year (DESC).

Q3: Find albums of a given genre released in a given country after a given year. Order results by year (DESC).

Q4: Find playlists by a given album.

Q5: Find distinct playlists by a given album.

Q6: Find music genres featured by a given playlist. Show distinct genres.

Q7: Find a user who created a given playlist.

Q8: Find users who shared a given playlist.

Q9: Find users invited by a given user.

Q10: Find a user who invited a given user.

Q11. Find distinct playlists created by a given user, featuring music of a given style.

Q12. Find distinct albums featured by a given playlist.

Fig. 3. An application workflow for our media cataloguing use case.

We now discuss several advanced features of KDM concerning access patterns.

The Q4 access pattern retrieves all playlists featuring songs from a given album. By default, using the “find value” feature on, name and tags attribute types would create a schema storing all playlists related to a given album. In such a case, if a playlist P features two songs from a given album, it will appear twice in the query result. While such information might be useful for some application tasks (e.g., to find how much a playlist is related to a given album), in many cases, such as in Q5, the application needs to display only distinct playlists. To accommodate the latter scenario, KDM provides an advanced feature, called “find distinct”, that we use on, name and tags to find distinct  playlists (i.e., distinct combinations of id, name and tags) in Q5, as shown in Fig. 4(a). Fig. 5 shows the complete Q5 access pattern. Similarly, to find distinct genres in Q6, we use “find distinct” on Album.genre, as shown in Fig. 4(b).

Screen Shot 2015-09-02 at 3.54.49 PM

Fig. 4. Finding distinct playlists in Q5 (a), and distinct genres in Q6 (b).

Screen Shot 2015-09-02 at 3.55.28 PM

Fig. 5. The Q5 access pattern.

Q7 and Q8 differ from any of the access patterns discussed so far in an important way. As shown in the figure below, Q7 and Q8 retrieve users who created and shared a given playlist, respectively.

Screen Shot 2015-09-02 at 3.56.25 PM

Note that there are multiple ways in which a user can be related to a playlist – via creates, shares, or plays relationships. Therefore, Q7 as well as Q8 must explicitly define which relationship path is considered, as this affects the resulting logical data model. Indeed, we will see that Q7 and Q8 require two different table schemas. Because specifying relationship paths explicitly is only needed in ER models with cycles, we call such access patterns cyclic.

As shown in Fig. 6, to specify Q7 we change the default “Simple Access Pattern” setting to “Cyclic Access Pattern”. Upon selecting what is given and what is to be found in the query using the “given value” and “find value” menu items respectively, KDM creates dropdown lists for entity types in the GIVEN and FIND tables (Fig. 6). In the case of both Q7 and Q8, we leave the default selections in these dropdowns, as shown in the figure. We will explain the purpose of these dropdowns when we discuss Q9 and Q10.


Screen Shot 2015-09-02 at 3.57.16 PM

Fig. 6. Specifying the Q7 access pattern.

KDM lists all entities/relationships involved in the query, and asks the user to configure each path. We click the “configure path” button, to have KDM find and list all the paths between the Playlist and the User types. We choose “,creates,” from the generated list, as shown in Fig. 6. This completes the definition of Q7. Q8 is specified similarly, as shown in Fig. 7.

Screen Shot 2015-09-02 at 3.58.11 PM

Fig. 7. Specifying the Q8 access pattern.

Screen Shot 2015-09-02 at 3.59.15 PMAs shown on the left, the Q9 and Q10 queries retrieve users invited by a given user, and a user who invited a given user, respectively. The Q9 access pattern is specified as shown in Fig. 8. First, since the User entity type appears twice in Q9 in two different roles, to distinguish between the two, we choose “User_inviter” in the dropdown under “GIVEN”, and “User_invitee” in the dropdown under “FIND” (Fig. 8). Next, we press the “configure path” button in the bottom. If an entity type appears on both ends of a path, e.g., User_inviter and User_invitee in both Q9 and Q10, we need to specify the path ourselves, by choosing the “specify custom path” option, as shown in Fig. 8(a). Fig. 8(a,b,c) shows the step-by-step process of entering the path between User_inviter and User_invitee, i.e. “User_inviter-inviter-invites-inviteeUser_invitee”. Fig. 9 shows the final Q9 tab.

Screen Shot 2015-09-02 at 4.01.24 PM

Fig. 8. Entering a relationship path User_inviter-inviter-invites-inviteeUser_invitee.

Screen Shot 2015-09-02 at 4.02.44 PM

Fig. 9. The complete Q9 access pattern.

The Q10 access pattern is specified similarly, except that the path direction is now reversed: “User_invitee-invitee-invites-inviter-User_inviter” (Fig. 10).

Screen Shot 2015-09-02 at 4.03.41 PM

Fig. 10. The Q10 access pattern.

The Q11 access pattern retrieves playlists created by a given user featuring music in a given style. Since there are multiple paths between the User entity type and the style attribute (whose parent entity type is Artist), we must specify the path between the User and the Artist types, as shown in Fig. 11.

Screen Shot 2015-09-02 at 4.04.36 PM

Fig. 11. The Q11 access pattern.

Once all the access patterns have been defined, we generate a Cassandra logical data model, by clicking the Screen Shot 2015-09-02 at 4.05.32 PM  button.

Step 3: Select a Logical Data Model

Fig. 12 shows a logical data model generated by KDM for our use case. It consists of a set of tables that can efficiently support the specified access patterns. The tables are shown in the Chebotko notation [1,2], where K denotes a partition key column, C denotes a clustering key column whose values are stored in ascending (↑) or descending (↓) clustering order, and S denotes a static column. Finally, {} denote a set column, [] and <> are used for list and map columns, respectively.
KDM automatically computes logically correct primary keys, and determines which columns are static (e.g., in table0). For some of our access patterns, several alternative table schemas are possible, in which case KDM produces a list of such schemas for the user to choose from. Whenever possible, KDM derives meaningful table names from the access patterns. The table schemas are generated according to the mapping rules defined in [1,2]. This frees us from having to manually perform the conceptual-to-logical mapping. We choose one table schema per access pattern, and click theScreen Shot 2015-09-02 at 4.06.44 PM  button to have KDM generate a physical data model with default data types.

Screen Shot 2015-09-02 at 4.07.17 PM

Fig. 12. A logical data model generated by KDM for our use case.

Step 4: Configure the Physical Data Model

The screenshot in Fig. 13 shows a part of a physical data model produced by KDM for our use case. A CQL query equivalent to the access pattern is shown underneath each corresponding table schema. We perform various physical optimizations concerning data types, table and column names and partition sizes. Among other things, we have changed table names table0 and table4 to albums_by_genre and albums_by_genre_country, respectively. The SELECT queries in the final CQL script will reflect this change.

Screen Shot 2015-09-02 at 4.08.52 PM

Fig. 13. The first three tables of the physical data model for our use case.

The complete physical data model is visualized in Fig. 14 as a Chebotko diagram. Chebotko diagram [1,2] presents a database schema as a combination of individual table schemas and query-driven application workflow transitions (KDM does not currently draw Chebotko diagrams). As part of the physical optimization, we have split partitions in the Playlist_by_album and Playlist_by_album_distinct tables (queries Q4 and Q5, respectively). We added a bucket column to each table, so that playlists will be uniformly distributed across the buckets. Each bucket will store the bucket number.

Screen Shot 2015-09-02 at 4.11.35 PM

Fig. 14. The Chebotko Diagram for our use case.

We now press the Screen Shot 2015-09-02 at 4.12.48 PM  button to generate a CQL script capturing our physical data model.

Step 5: Download a CQL Script

Fig. 15. shows a CQL script produced by KDM, that can be readily executed against a Cassandra cluster to create a schema.

Screen Shot 2015-09-02 at 4.13.28 PM

Fig. 15. A CQL script generated in KDM.


In this second article of our series, we have used KDM for a more complex real-life data modeling use case – a media cataloguing application. The use case involves roles, cyclic queries, and queries over multiple entities and relationships. KDM supports such complex real-life scenarios, by providing a number of advanced features, such as setting alternative keys, finding distinct values, and specifying cyclic access patterns. The use case is available in KDM, under “Use Cases -> Media Cataloguing” menu.



This work would not have been possible without the inspiring ideas and helpful feedback of Dr. Artem Chebotko and Mr. Anthony Piazza. Andrey Kashlev would also like to thank Dr. Shiyong Lu for his support of this project.



[1] Artem Chebotko, Andrey Kashlev, Shiyong Lu, “A Big Data Modeling Methodology for Apache Cassandra”, IEEE International Congress on Big Data, (In Press), 2015. download

[2] DataStax Training, DS220: Data Modeling with DataStax Enterprise