September 26th, 2013


“Really easy to operate, easy to scale and could cope with heavy write loads.”

-Omid Aladini, Data Infrastructure Engineer at SoundCloud

Omid Aladini Data Infrastructure Engineer at SoundCloud


Today we have with us Omid Aladini, Data Infrastructure Engineer at SoundCloud. Omid, thanks for joining us today; to get things started, could you tell us a little bit about what SoundCloud does?

SoundCloud is an online audio distribution platform which allows musicians to collaborate, promote and distribute their music. It has sometimes been described as being for audio what Flickr is for photos, or Vimeo is for video.


Excellent. And how are you using Apache Cassandra at SoundCloud?

SoundCloud uses Cassandra for multiple applications; one of these applications is the activity feed, which allows users can see the new tracks and sounds from the people they follow.


Cassandra is also used in our real-time stats and premium stats product.  This product allows creators to see statistics about their tracks and creations. Additionally, it gives insight into the real-time series of how often a track has been played or if they were played or downloaded, which the user can access across the site in multiple places.


I think that our first Cassandra cluster was created three or four years ago for our activities feed. It required very high write loads and we needed to make sure that we can scale to multiple machines because the data set size was supposed to be large, and it is large.


And what was the motivation for using Cassandra? Where there other technologies that it was evaluated against?

Cassandra looked like a good choice because it was really easy to operate, easy to scale and could cope with heavy write loads, in comparison to other choices which had fallen short. For example, MySQL was operationally more complicated to handle.  Cassandra seemed like a good decision.


Great. Can you share some insights on what your deployment looks like? 

We have four clusters in production in three data centers. The data set size is between 1 terabyte to 20 terabytes. In terms of nodes size, the smallest one is eight nodes while the largest one is 48 nodes, which is on EC2. I think the two new clusters we have are running on SSD’s. The older ones are running on spinning disks. The new clusters we are positioning are mostly running on SSD. It’s easy to scale and supports multiple data centers, which is very nice feature.  The Hadoop integration also comes in handy, we have many applications that aggregate data on Hadoop and their integration is really nice and straight forward. Counters are also really nice. We chose to use Cassandra for our real time time-series data because it supports distributed counters.


And what would you like to see out of Apache Cassandra in future versions?

At the moment there are multiple issues in Cassandra, which makes it operationally hard to use. Namely anti-entropy repair, which is kind of hard to operate.


Also, counters are intricate in a way which I know that the Cassandra community is working on to make them better and easier to use. I’m looking forward to seeing how Cassandra is going to develop.


What is your experience with the Apache Cassandra community?

The Cassandra community is very responsive. I’ve had positive experience with it, reporting bugs and submitting patches is very easy and you get immediate feedback. You can also include features that you would like in Cassandra. The mailing lists are very responsive. It’s easy to get your request answered quickly.