April 17th, 2014


Python Driver Overview Using Twissandra” was created by Lyuben Todorov, Software Engineer at DataStax.

Twissandra, a Twitter clone using Cassandra for storage, has had a makeover to use the new python driver. This allowed the clone to make the switch from the thrift API to using CQL3 over the native protocol. Let’s go through some examples of using the python driver, taken from the updated Twissandra code.

Twissandra Datamodel Overview

Twissandra is composed of six tables that store users, tweets, tweet order (of the user and their timeline) and who users follow (and are followed by). Since we can’t use joins in Cassandra, tables are partially denormalised to allow for necessary flexibility, meaning there are more writes to make reads more performant.

Twissandra ER Diagram

The users table simply stores usernames and passwords:

Tracking latest tweets

Tweets are stored in a simple table where the primary key is a UUID column, ensuring the tweet’s uniqueness. We don’t track when the tweet was added in this table as that’s handled by the user’s timeline (see the userline and timeline table creation below).

TimeUUIDs are used for tracking the time of the tweet, to ensure uniqueness in the primary key, as they are composed of a random component and a timestamp. This allows us to retrieve unique tweets by time and also allows for tracking when the tweet was added. Cassandra sorts the timeline and userline based on the clustering key time. Since the aim is to retrieve the latest tweets WITH CLUSTERING ORDER BY (time DESC) is added to the table creation statements to invert the sorting.

Because the username is the partition key, we can easily select the most recent tweets for a specific user. The LIMIT clause can then be added to enforce a limit on how many tweets are retrieved:

An important note. The data-model presented here is only partially denormalized. Denormalizing the tweets table completely into the timeline and userline tables would improve query time, by letting us directly query the tweets from them, instead of requiring a second set of SELECT’s to retrieve the content of the tweets

Tracking Followers
The followers table allows for retrieval of the users that are following you. The friends table allows for retrieval of the users that you follow. The primary key for both tables is a composite key. This is important because the first component of the composite key, the partition key, decides how to split data around the cluster. One set of replicas will store all the data for a specific user. The second component is the clustering key which is used to store data in a particular order on disk. Although the ordering itself isn’t important for either table, the clustering key means all rows for a particular user will be stored contiguously on disk. This optimises reading a user’s friends or followers by allowing for a sequential disk read.

To retrieve all the followers or friends for a specific user, the username is added to the WHERE clause just like in SQL. Something worth noting is that we can use the username in the WHERE clause because it’s part of the primary key.


Setting up a connection

To connect to Cassandra we first import the driver’s Cluster class. The next step is to create a cluster and a session. We then supply the list of IPs for nodes in the cluster and tell the session what keyspace to connect to. Note that sessions automatically manage a pool of connections so they should be long-lived and re-used for multiple requests.



The various things that twitter can do, whether it’s inserting a tweet, retrieving your followers, updating your password or unfollowing someone, are examples of create / read / update and delete operations that can be carried out on Cassandra.

Tweeting – Create
Adding tweets is done via Twissandra’s save_tweet function where four kinds of queries are carried out:

  1. Insert the tweet
  2. Update the current user’s userline with the tweet_id
  3. Update the public userline with the tweet_id
  4. Update the timelines of all of the user’s followers with the tweet_id

Inserting a message into Twissandra

Inserting the tweet message is as simple as supplying the username, the message, and generating a UUID. Note, if we didn’t need to save the UUID for use in later inserts, it could have been created using the uuid() function available in Cassandra 2.0. For a full list of CQL3 functions take a look at the DataStax docs.

Adding to the user’s and public userlines requires a username, the tweet’s ID and a time uuid:

Finally to complete the tweeting process, the tweet has to be inserted into each one of your follower’s timelines. This requires the username of the follower, the tweet’s creation time in the form of a Time UUID and the tweet’s ID in the form of a UUID.

Retrieving Tweets – Read

Retrieving tweets is done using one of two functions in Twissandra. The get_timeline and get_userline functions are both calls to _get_line. Retrieving either all of our tweets or all of someone else’s tweets is done via _get_line. To carry out the querying we require a username, a tweet starting time and the number of tweets to fetch. Since we don’t want to fetch the entire feed, first the range of tweets that we want to retrieve is selected.
Retrieving messages from Twissandra

If we need to start our page further back than the latest tweets, the less-than predicate, time < %s, can be used to retrieve tweets further back in the timeline.

Again, because we want to page through the timeline rather than retrieving all of it in a single query, we want to check if we reached the end of the timeline, and if not to store a marker to tell us where to start the page during the next query.

Once the array of tweet IDs is retrieved, they are used to fetch the actual tweets.

Queries are sometimes executed using session.execute and other times session.execute_async is used instead. The difference between the two is that execute waits for a response before returning whilst execute_async returns a “future” so it can send multiple messages concurrently, without waiting for responses, therefore there is no guarantee on the order of the responses. The returned ResponseFuture can be used to verify the query’s success for both serial and concurrent queries. On failure an exception would be raised.

Changing Password – Update
Updates and inserts have mostly identical behavior with Cassandra. They both blindly overwrite existing (or non-existing) data. Twissandra doesn’t use UPDATE statements but for completeness here is a theoretical example of updating a password:

Unfollowing – Delete

Removing a user from your feed requires two queries since in CQL3 there are no foreign keys to enforce relationships between the friends and followers table. The first query removes the user from your feed while the second tells them you are no longer following them. Prior tweets from this user won’t however be deleted from your timeline.


Enhancing Twissandra With New Cassandra Features

Modelling in Cassandra frequently requires denormalization as there is no joining of tables. Denormalization can be summed up as the process of adding redundant data to tables in order to optimise read performance. The frequent use-case in the relational model of having users with multiple email addresses is usually modelled by creating a user table and an email table where there is a one-to-many relationship. Cassandra’s alternative is to use CQL3 collections where a column can store a list, set or a map of fields. If Twissandra’s user table also required each user’s email address (see example below) and allowed for more than one, the set collection could be used to store them.

Light Weight Transactions

Lightweight transactions weight transactions (LWT) are another piece of functionality that was added to satisfy commonly used patterns requiring strong consistency, like for example the need to ensure that a username is unique before allowing someone to register said username. LWT aren’t available in version 0 of the python driver but are on their way in the new version 2.0. release. But here is an example of what inserting a username would look like using a LWT from cqlsh. We execute the INSERT as usual, but also append IF NOT EXISTS

LWT can also be used to verify a row exists by appending IF EXISTS to the end of the query: