Python Driver Overview Using Twissandra

April 17, 2014



Python Driver Overview Using Twissandra” was created by Lyuben Todorov, Software Engineer at DataStax.

Twissandra, a Twitter clone using Cassandra for storage, has had a makeover to use the new python driver. This allowed the clone to make the switch from the thrift API to using CQL3 over the native protocol. Let’s go through some examples of using the python driver, taken from the updated Twissandra code.

Twissandra Datamodel Overview

Twissandra is composed of six tables that store users, tweets, tweet order (of the user and their timeline) and who users follow (and are followed by). Since we can’t use joins in Cassandra, tables are partially denormalised to allow for necessary flexibility, meaning there are more writes to make reads more performant.

Twissandra ER Diagram

The users table simply stores usernames and passwords:

Tracking latest tweets

Tweets are stored in a simple table where the primary key is a UUID column, ensuring the tweet’s uniqueness. We don’t track when the tweet was added in this table as that’s handled by the user’s timeline (see the userline and timeline table creation below).

TimeUUIDs are used for tracking the time of the tweet, to ensure uniqueness in the primary key, as they are composed of a random component and a timestamp. This allows us to retrieve unique tweets by time and also allows for tracking when the tweet was added. Cassandra sorts the timeline and userline based on the clustering key time. Since the aim is to retrieve the latest tweets WITH CLUSTERING ORDER BY (time DESC) is added to the table creation statements to invert the sorting.

Because the username is the partition key, we can easily select the most recent tweets for a specific user. The LIMIT clause can then be added to enforce a limit on how many tweets are retrieved:

An important note. The data-model presented here is only partially denormalized. Denormalizing the tweets table completely into the timeline and userline tables would improve query time, by letting us directly query the tweets from them, instead of requiring a second set of SELECT’s to retrieve the content of the tweets

Tracking Followers
The followers table allows for retrieval of the users that are following you. The friends table allows for retrieval of the users that you follow. The primary key for both tables is a composite key. This is important because the first component of the composite key, the partition key, decides how to split data around the cluster. One set of replicas will store all the data for a specific user. The second component is the clustering key which is used to store data in a particular order on disk. Although the ordering itself isn’t important for either table, the clustering key means all rows for a particular user will be stored contiguously on disk. This optimises reading a user’s friends or followers by allowing for a sequential disk read.

To retrieve all the followers or friends for a specific user, the username is added to the WHERE clause just like in SQL. Something worth noting is that we can use the username in the WHERE clause because it’s part of the primary key.


Setting up a connection

To connect to Cassandra we first import the driver’s Cluster class. The next step is to create a cluster and a session. We then supply the list of IPs for nodes in the cluster and tell the session what keyspace to connect to. Note that sessions automatically manage a pool of connections so they should be long-lived and re-used for multiple requests.



The various things that twitter can do, whether it’s inserting a tweet, retrieving your followers, updating your password or unfollowing someone, are examples of create / read / update and delete operations that can be carried out on Cassandra.

Tweeting – Create
Adding tweets is done via Twissandra’s save_tweet function where four kinds of queries are carried out:

  1. Insert the tweet
  2. Update the current user’s userline with the tweet_id
  3. Update the public userline with the tweet_id
  4. Update the timelines of all of the user’s followers with the tweet_id

Inserting a message into Twissandra

Inserting the tweet message is as simple as supplying the username, the message, and generating a UUID. Note, if we didn’t need to save the UUID for use in later inserts, it could have been created using the uuid() function available in Cassandra 2.0. For a full list of CQL3 functions take a look at the DataStax docs.

Adding to the user’s and public userlines requires a username, the tweet’s ID and a time uuid:

Finally to complete the tweeting process, the tweet has to be inserted into each one of your follower’s timelines. This requires the username of the follower, the tweet’s creation time in the form of a Time UUID and the tweet’s ID in the form of a UUID.

Retrieving Tweets – Read

Retrieving tweets is done using one of two functions in Twissandra. The get_timeline and get_userline functions are both calls to _get_line. Retrieving either all of our tweets or all of someone else’s tweets is done via _get_line. To carry out the querying we require a username, a tweet starting time and the number of tweets to fetch. Since we don’t want to fetch the entire feed, first the range of tweets that we want to retrieve is selected.
Retrieving messages from Twissandra

If we need to start our page further back than the latest tweets, the less-than predicate, time < %s, can be used to retrieve tweets further back in the timeline.

Again, because we want to page through the timeline rather than retrieving all of it in a single query, we want to check if we reached the end of the timeline, and if not to store a marker to tell us where to start the page during the next query.

Once the array of tweet IDs is retrieved, they are used to fetch the actual tweets.

Queries are sometimes executed using session.execute and other times session.execute_async is used instead. The difference between the two is that execute waits for a response before returning whilst execute_async returns a “future” so it can send multiple messages concurrently, without waiting for responses, therefore there is no guarantee on the order of the responses. The returned ResponseFuture can be used to verify the query’s success for both serial and concurrent queries. On failure an exception would be raised.

Changing Password – Update
Updates and inserts have mostly identical behavior with Cassandra. They both blindly overwrite existing (or non-existing) data. Twissandra doesn’t use UPDATE statements but for completeness here is a theoretical example of updating a password:

Unfollowing – Delete

Removing a user from your feed requires two queries since in CQL3 there are no foreign keys to enforce relationships between the friends and followers table. The first query removes the user from your feed while the second tells them you are no longer following them. Prior tweets from this user won’t however be deleted from your timeline.


Enhancing Twissandra With New Cassandra Features

Modelling in Cassandra frequently requires denormalization as there is no joining of tables. Denormalization can be summed up as the process of adding redundant data to tables in order to optimise read performance. The frequent use-case in the relational model of having users with multiple email addresses is usually modelled by creating a user table and an email table where there is a one-to-many relationship. Cassandra’s alternative is to use CQL3 collections where a column can store a list, set or a map of fields. If Twissandra’s user table also required each user’s email address (see example below) and allowed for more than one, the set collection could be used to store them.

Light Weight Transactions

Lightweight transactions weight transactions (LWT) are another piece of functionality that was added to satisfy commonly used patterns requiring strong consistency, like for example the need to ensure that a username is unique before allowing someone to register said username. LWT aren’t available in version 0 of the python driver but are on their way in the new version 2.0. release. But here is an example of what inserting a username would look like using a LWT from cqlsh. We execute the INSERT as usual, but also append IF NOT EXISTS

LWT can also be used to verify a row exists by appending IF EXISTS to the end of the query:

Post a Comment with Disqus

Cassandra 2.0 Support for DataStax C# Driver

April 16, 2014


We’re glad to release today the first beta of our C# Driver 2.0 that supports Apache Cassandra 2.0 and DataStax Enterprise 4.0, while remaining fully compatible with Cassandra 1.2 and DSE versions relying on it. This driver is intended to be aligned with the feature set that comes in our Java Driver 2.0. It

In practice this means that C# developers can now enjoy:

We have several other improvements and changes to come in the next coming weeks as we’ll iterate through several beta versions:

  • Task based API
  • Using some Interfaces instead of Classes in the API to make it easier to mock every parts of the driver
  • Automatic paging is a feature that has been introduced in Cassandra 2.0. It’s not part of this first beta but will be included in the next one.

This new C# Driver 2.0.0-beta1 is now available on NuGet. Feel free to give it a try!


Post a Comment with Disqus

UPDATE: Cassandra Migration Yields Insane (10x) Performance Improvements at Rekko

April 16, 2014


Cassandra Migration Yields Insane Performance Improvements” was created by Robert Thanh Parker, CEO and Founder at Rekko.


Wanted to let you guys know that I posted the message on our FB page. It’s amazing stuff and something the team has worked incredibly hard on both refactoring our code and architecting and optimizing our infrastructure over Cassandra. More importantly, this update solves two remaining issues:

(1) From time to time, we’d see spikes in load time, while unusual,  resulted from a key update lock in Mongo. The new infrastructure removes this 100% and performance should be and is completely smooth since the update. Average server response times are less than 5ms for our most complex campaign delivery requests. Previously they were at closer to 50ms with spikes that were much higher.

(2) Data manageability. We collect inordinate amounts of data (much more than most Analytics providers), complicating the management of visitors while we scale. This makes performance an ongoing challenge. The new infrastructure largely solves this, but more importantly, at a cost structure that will continue to allow us to deliver increasingly more sophisticated technology at lower costs. A key tenant of our long-term vision of bringing our technology to every online piece of real-estate in the world. Big step there.


Here’s the quick post:

New Update: New Rekko Big Data Engine updated. Core services go live and we’re 10x faster overnight!

Speed. Speed. Speed.
We were fast before, but now we’ve concluded a major infrastructure refactoring. Our vision is to make accessible and automate big data personalization for small and medium sized businesses. This is a huge and crucial update towards our goal.

Lowering the Cost/Customer
Driving down the cost of providing enterprise level technology and services such that SMBs can EASILY leverage them is the most essential step in taking this technology mainstream. The first HD plasma TV I saw cost $29,999. The one I just bought cost significantly less. Our first Rekko customers paid $42k/month, our new ones pay a world less.

The Migration.
After a period of running Cassandra DB simultaneously with Mongo DB, the team completed the majority of our migration last night – we’re now completely live on Cassandra. While there are some small portions of infrastructure that will continue to use Mongo, almost everything material is now migrated.

The Results.
To summarize, the slowest of response times on Cassandra (for real-time profiling and campaign delivery) average more than 10x better than fastest we had utilizing the Mongo DB code and infrastructure. We’re now able to intelligently deliver dynamic, tailored content to the a visitor’s browser in less than 1/200th of a second. Consistently, and without spikes.

This is our last infrastructure step prior to rolling out…


Some days, we just smile because we know the hard work pushed us three steps forward. Today is one of those days…

Founder, Rekko

If you’re interested in learning more about migrating from MongoDB to Apache Cassandra, visit the MongoDB to Cassandra Migration page for resources and how-to’s.

Post a Comment with Disqus

Gilt Hackathon Dives into Apache Cassandra with DigBigData and DataStax

April 14, 2014




Lauri Apple


Lauri Apple Technology Evangelism Specialist at Gilt


In late March Gilt’s Dublin team partnered up with Dublin-based consultancy DigBigData to offer a free Cassandra workshop near our Dublin office. Twenty-five technologists from Gilt and other area companies came together for a full day of hands-on learning, experimentation and fun taught by DigBigData’s Niall Milton (an official MVP for Apache Cassandra). Gilt’s Cassandra workshop was part of our free tech education initiative, by which we offer full-day tech courses at no charge to both Gilt and non-Gilt technologists. Since launching this program in June 2013, we’ve offered classes in Scala, Hadoop, R, Machine Learning and other topics of interest to us–and more courses are on the way. Nearly half of our Dublin team signed up for the Cassandra workshop, while other attendees came from Workday, Dun & Bradstreet and other companies.

Gilt currently doesn’t use Cassandra in production, but as NoSQL enthusiasts and open source advocates we’re quite interested in learning more about how it works. Several workshop attendees had prior experience working with older versions of Cassandra and wanted a quick refresher. Others on the team had very minimal experience, or had read the Dynamo and BigTable papers but never tried using it. Because everyone in the class was an experienced technologist, however, getting started posed very few problems.

The biggest challenge for me was switching to working with a column-based database, having always worked with traditional row-based databases,” says Gilt Lead Software Engineer John Kenny. Adds Emerson Loureiro, another Gilt engineer: “I had no prior experience with Cassandra itself, but was familiar with most of the concepts behind it, so getting started was quite OK. To me it was more about looking at data modeling from a different perspective.”

After giving an introduction to Cassandra, Milton split the course into six teams who then set to work on building a variety of applications. Over the course of the day, teams asked lots of questions about performance, replication, fault tolerance, and other nuts-and-bolts aspects of Cassandra. By workshop’s end, the teams had created several exciting projects, including a CPU temperature monitor, a tweet sentiment analyzer, a multi-player, web-based game, and BigChat—a SnapChat-inspired service.

Though some of the students said they’d have benefited from more time to develop their projects, others were pleased with the end results of their work. “I think it was a nice use case for Cassandra,” says Emerson about the course. “It gave me the opportunity to stress the bits we had learned in the course and to get some more hands-on experience.”

Post a Comment with Disqus

Jonathan Ellis, Apache Cassandra Chair, discusses the history of Apache Cassandra

April 14, 2014


Post a Comment with Disqus
1 2 3 113