“Cassandra was the one that sounded like the best fit for us based on its lack of single points of failure and more importantly probably, the SSTable (Big Table) design.”
-Axel Liljencrantz, Backend Engineer at Spotify
Axel Liljencrantz Backend Engineer at Spotify
I’m an avid Spotify user and almost every music lover knows what your service is, but for those of us out there who haven’t heard of Spotify: What is it that you do?
Spotify is a streaming music web service. We have over 20,000,000 music tracks that you can listen to. There’s a free tier, so that you don’t have to pay anything; if you don’t pay, you will not get access to the mobile version, except for radio, and you will also have some ads in there. If you pay $10 per month, in the US, you get access to every track on your cellphone, laptop, tablet, and desktop.
Why do you use Cassandra at Spotify?
Well, Spotify started out as strictly a postgreSQL shop. I would say that postgreSQL is a really awesome product; it has a very nice, desirable feature set but after we had scaled up to one or two million users we started to experience some scalability problems with certain services.
Basically, once you hit multiple data centers, streaming replication in postgreSQL doesn’t really work that well for high write volumes and so on. So, we looked at a big data solution and Cassandra was the one that sounded like the best fit for us based on its lack of single points of failure and more importantly probably, the SSTable (Big Table) design. It gives us a level of trust in that we won’t lose data, even though it is still a young product. Basically, if there are bugs or crashes we are confident that we won’t lose our data, that is very important to us.
How much data does Spotify store in Cassandra?
I can’t really give you an exact number because we have almost a dozen different services in separate clusters that use Cassandra. The largest service that we have with the most data has somewhere just north of 50 terabytes of compressed data.
What are some of the services within Spotify that Apache Cassandra is being used for?
Playlists, radio stations, the events for the little notification popups and the list of artists you follow are examples of some of the things we store in Cassandra.
Do you store your data in a physical data center or in the cloud?
We are, as for now, a physical data center shop.
What are your thoughts on the Cassandra community?
That is a very interesting question seeing how I am from Sweden; there isn’t really all that much of a Cassandra community in Sweden yet but you, I and many others are trying to setup meetups in Stockholm and so on. So, hopefully we will see an exciting and vibrant community. Overall, I would say that there are a lot of interesting developments in the Stockholm area; it used to be all kind of companies that were isolated but I feel there is an increased level of interest that people want to meetup, talk and share knowledge. Stockholm has always had a fantastic level of competency and engineers and I feel that we haven’t yet, in our beautiful capital of Sweden, done enough to leverage that in order to drive more things when compared to Silicon Valley, New York or some of the larger local Cassandra communities; it’s definitely beginning to change.
What’s something that you’ve learned along the way using Cassandra, that in hindsight, you would have done differently? Maybe something you’ve learned the hard way?
With Cassandra, if you poke it you lose a finger; it is still very rough in a lot of places when compared to SQL services that have been fine-tuned for every possible use case over the decades.
I think that the really important thing to do is before start doing any major development work, you need to understand how Cassandra stores data and the underlying algorithms; it’s a really simple thing, it’s a nice well-maintained code base: you can just download it, open up your editor and you will actually understand what’s there. Keep very much in mind that Cassandra’s write performance is almost always awesome but Cassandra’s read performance is highly dependent on your write patterns, both temporal, what you’re writing and how much you’re over-writing. You will almost always, when you start using Cassandra in production, see very nice performance at the start but the big question is how much your performance might degrade over-time; it’s important that you’re using the right algorithms and schema.