June 14th, 2013

By 

 

 

Nathan Milford: US Operations Manager for Outbrain

DataStax: Commercial company behind Apache Cassandra

 

DataStax: Nathan, we appreciate you taking time to chat with us today. What do you guys do at Outbrain?

 

Nathan:  Outbrain helps people discover the most interesting, relevant and trusted content out there. We are the leading company for content discovery. Folks like CNN, Fox News, and lots of other premium publishers and companies trust us on thier sites. The company was founded in 2006 and we do somewhere around 90 billion recommendations on over 10 billion page views per month.  Right now we reach ~86% of the online US population.

 

DataStax: What’s your tech infrastructure look like?

 

Nathan:  Top to bottom we use open source technologies on the front and back end.  On the back end we employ a menagerie of technologies and data stores.  Cassandra is one of the central pieces of our back end and serves as one of our primary data stores  We also use Hadoop and Hive heavily and rely on Solr, MogileFS, MySQL, Storm, Kafka for various products and algorithms. We even have a few MongoDB small instances for various projects.

 

DataStax: How long have you been using Cassandra?

 

Nathan:  It’s been quite a while. We started way back with version 0.4.  As an ops guy, I like Cassandra because, after you do your data modeling, you can turn Cassandra on and pretty much leave it alone except when you need to add new nodes online to increase capacity.

 

DataStax: What’s your deployment of Cassandra look like?

 

Nathan:  We’ve got a number of Cassandra clusters. We have a general use cluster, an ‘on line’ cluster that is directly in the path of how fast we serve up recommendations (which is critical), a few clusters tied to specific algorithms or processes, and then a number of other production and development clusters.

 

We like standing up different Cassandra clusters that are devoted to a specific application domain vs. having one big multi-tenant cluster that services many different apps. That seems to give us the best performance and makes finding problems easier, plus it’s also simpler to manage from an automation perspective.

 

We also run these clusters across multiple data centers; we have three data centers now and more on the way.

 

DataStax: What kind of use cases does Cassandra help you solve?

 

Nathan:  Cassandra tackles a number of different use cases for us such as creating personalized sets of recommendations for various user profiles, ingesting lots of content data that’s analyzed for recommendation purposes, and so on.

 

DataStax: What were the primary factors that caused you to pick Cassandra?

 

Nathan:  A number of years ago we were looking at what would be the best way to deploy our data across multiple data centers. A number of other databases we tried just wouldn’t work and were operationally immature. Our primary reason for using Cassandra over something like HBase is the ease with which we can span multiple data centers; it’s very simple. We’re talking very little if any operational overhead in doing what we do today with Cassandra.

 

By contrast, try doing something like that with MySQL and have network partition or other failure can be expensive in terms of engineering time to repair in many cases.

 

DataStax: What kind of difference do such things make where your business is concerned?

 

Nathan:  During Hurricane Sandy, we lost an entire data center. Completely. Lost. It.

Our application fail-over resulted in us losing just a few moments of serving requests for a particular region of the country, but our data in Cassandra never went offline.

 

Then, when that data center came back online, we spent a lot of time manually rebuilding other databases that we use, shipping drives across country and what not, but for Cassandra, all we did was turn it back on in that data center and ran a single command, and it brought itself back up to speed with no other manual work.

 

DataStax: How do you decide what database to use for what task?

 

Nathan:  We look at how the data will be queried, its size, and how it needs to be distributed. We might use things like MySQL for historical reasons and MongoDB for smaller tasks, and then Cassandra for situations where data doesn’t all fit into memory or where it spans multiple machines and possibly data centers.

 

Cassandra is something we have a pretty good handle on now and know what to expect from it.  When we architect and design properly we trust it to not fold under pressure.

 

DataStax: What advice would you give to people who are just getting started with Cassandra?

 

Nathan:  The primary thing is to do your data modeling in a way that fits with Cassandra. Don’t treat it like MySQL, Mongo, or a pure KV store and you’ll get the full benefits of it’s replication and caching models.  Get that right and your management and maintenance isn’t hard. Any real problem we’ve had with Cassandra has occurred because we tried to treat the Cassandra data model like some other data store and you can’t do that.

 

Additionally, don’t get stuck in the trap of fewer large servers.  Lots of medium-size servers is much better.