Behance is the leading online platform to showcase and discover creative work. At a high level, we are a social network for creatives to share their projects to gain visibility and feedback.
Each user on Behance has their own activity feed, much like a Facebook News Feed or Twitter Timeline, which displays work that their network has interacted with in some way. The original version of the activity feed, launched in 2009, was initially powered by MySQL and switched to MongoDB in 2011. As of March 2015, it’s completely powered by Apache Cassandra.
We’d largely stretched the limit of what we could reasonably expect out of Mongo for performance — partly because it was an old
implementation that wasn’t given a lot of love, but also because our data size grew out of control over the years. At some point, we hit a ceiling and MongoDB wasn’t scaling for us anymore.
With Cassandra, we really like that we are able to self-manage the platform. From an operations perspective it’s great because when a node goes down, the application remains stable and nobody needs to panic. It is true linear scalability with relatively low operational overhead. It’s fast, nice to use, and pretty simple once you get familiar with it.
Compared to our Mongo cluster, there is also significantly less hardware involved. It uses significantly less RAM, less disk, less CPU, and fewer nodes. It will reduce our bill for this service by about 70% each month.
For the migration piece, our implementation was very similar to the use case Unity recently described in their interview with Planet Cassandra. We both moved from Mongo but ran Cassandra next to it for a period of time; they ran in parallel for 6 months while we did about 1 month before cutting over completely.
While we were in parallel, every write that went to MongoDB also went to Cassandra — that was how we back filled our data. We never ran any type of explicit migration, it was more passively waiting for data to come in. To help us gauge how reads would perform under production load, we submitted requests to our Cassandra cluster in the background each time a request to MongoDB occurred. This means we forked not just writes, but also reads behind the scenes. It allowed us to know, without a doubt, what our performance would look like in our actual use case.
We cut traffic over to Cassandra over the course of about an hour, starting with 2% of traffic and incrementing slowly to 100% while carefully monitoring dashboards and logs. Once fully cut over, our traffic looked exactly the same as our simulations.
We were running behind the scenes in production for about two months, but the last month is when we had a schema that was finalized and working well for us. We revamped our schema 4 or 5 times, which meant a decent amount of changed application code each time. There were single weeks where we would rewrite the entire schema and application. It was pretty crazy, and really quick iteration, but we found the right setup in the end.
We run all of our queries with consistency level one. Due to the nature of our data, it doesn’t matter much if something is slightly out of date. It just gives us a little extra speed for free. The fact that our data is ephemeral helps us run everything very quickly and smoothly.
We initially deployed on 8 nodes, but as soon as this post goes live, we’ll be doubling to 16 for some extra comfort. We are running in one data center, with NetworkTopologyStrategy and a replication factor of 2. We are spread across 3 availability zones on Amazon Web Services and using ephemeral storage. All nodes are running on i2.2xl instance types, which give us 2x800GB SSD.
Compared to our Mongo cluster, there is significantly less hardware involved. It uses significantly less RAM, less disk, less CPU, and fewer nodes. It will reduce our bill for this service by about 70% each month.
Our primary language for backend services at Behance is PHP. Because there was no DataStax PHP driver at the time and we wanted to use official supported software as much as we could, we wrote a micro service in Python which we call Pyve. In front of that is a PHP web service we call BeHive which acts as an interface to Cassandra for the Behance web application. Everything is done through HTTP API calls.
The DataStax Python driver is very easy to use. It gave us a lot of cool options that we could optimize with, like token aware load balancing, setting routing keys, and concurrency out of the box.
The presentation that made things “click” for me was John Berryman’s CQL talk on how it maps to Cassandra’s internal data structure at Cassandra Summit 2014. It is immensely helpful to be able to visualize what CQL queries are actually doing and why what you’re doing could be really bad. That was a revelation when just learning the basics of Cassandra. Patrick McFadden’s talk, “Real Data Models of Silicon Valley”, was helpful as well. Planet Cassandra is a really good resource, especially the Cassandra Twitter community Every so often, we’d run into an issue and tweet about it with #Cassandra. The community would respond quickly, with help from people like John Haddad or Scott Hirleman.
One of the biggest things that we did that was valuable was try to model our data around the queries we’d be running. It was really fascinating. While trying some of our failed data models, we attempted to make them work by tuning the JVM. This led to some small improvements, but nothing near what we saw when we found the model that actually worked for us. If we could give one piece of advice to first time Cassandra users, it would be to always make sure that the data model is based on the queries being performed.
For more info on migration best practices, check out the MongoDB to Apache Cassandra migration guide.