John Miller: Director of Technical Operations at Ampush
Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service
TL;DR: Ampush is an ad technology company and Facebook Strategic Preferred Marketing Developer (sPMD). We help large-scale brands and direct response advertisers to achieve performance at scale using Facebook ads.
They just completed their migration and was MySQL- based beforehand; they were running into issues with data volume, especially with high availability and performance within MySQL. They chose Cassandra because nothing else could meet the availability and performance at scale requirements they had; hadoop was a close 2nd but it didn’t fit our use case properly and the operational overhead vs. Cassandra was much higher.
Ampush is using tens of nodes in one data center. They’re working with many, many terabytes of data that’s looking to double within the next year.
Hi, everybody. I am joined today by John Miller, Director of Technical Operations at Ampush. John, why don’t you start out by telling us a little bit about what Ampush does?
Ampush is an ad technology company and Facebook Strategic Preferred Marketing Developer (sPMD). We help large-scale brands and direct response advertisers to achieve performance at scale using Facebook ads. We deliver great advertiser ROI with our fully managed solutions, powered by the AMP Marketing Platform, across mobile and desktop native advertising platforms. We have clients in the gaming, retail, travel, financial services, technology industries, and more. Some of the clients that we work include MasterCard, Rdio, HotelTonight, Kellogg’s, Warrior Dash, and Sojo Studios. We were founded in 2009 and we’re based in San Francisco with offices in New York and Chicago.
John, you mentioned MasterCard. I have a MasterCard so if I’m on Facebook and I see a Sponsored Story from MasterCard, could that be Ampush behind that?
Absolutely. We work with the entirety of the Facebook ads platform including the News Feed, Sponsored Stories, and Mobile App Install Ads as well as many of the latest platform features.
Brilliant. John, could you tell us a little bit about how you came to use Cassandra at Ampush? Did you come from a relational background before that or was Ampush built up around Apache Cassandra?
Cassandra within Ampush is pretty new. Actually, we just completed a pretty major migration. We were MySQL- based beforehand. We were running into issues with data volume, and especially with high availability and performance within MySQL. Much of the data was time series events so we started looking at our migration options including Hadoop, Cassandra, etc. We chose Cassandra because nothing else could meet the availability and performance at scale requirements we had. Hadoop was a close 2nd but it didn’t fit our use case properly and the operational overhead vs. Cassandra was much higher.
High availability, time series data and massive volume, we see time and time again; those are three top criteria for why people choose Cassandra.
Yeah, absolutely. Being able to scale within the data center using commodity hardware, as well as having multi-datacenter replication to either replicate to another site, another regional locale, or even back into our staging and development clusters is key so that we have real time performance data.
You’re just touching right there on multi-datacenter replication. Could you talk a little bit about what the Cassandra environment looks like at Ampush?
Right now, we’re in around the tens of nodes. I can’t give exact numbers and we’re working with many, many terabytes of data that’s looking to double within the next year. Currently, our cluster lives in one data center so we’re not actively using the multi-center replication. We’re going to start testing replication of our data back into our staging and development cluster, which will be located here at the office or within a different data center. We have developers in Argentina as well as New York so making sure that the data can be available for them to work with is hugely important. This makes the multi-datacenter replication something that will be coming in very handy during the next phase of the project.
Great. For people coming from a relational background, maybe MySQL and going through the same journey that you did, what advice would you pass along?
Well, one of the major benefits that we found, at least from an operational perspective, is that with MySQL when you have these overly large data sets you end up having to use many, many SSDs and extremely large expensive servers to get the performance you need. Going with Cassandra, you’re able to use standard spinning disks, which are much cheaper, and you can just stripe them across a RAID1 setup on a per node basis which gives you great IOPS performance for a lower overall investment. Oh, and make sure you’re using Virtual Nodes if you’re building a new cluster as there is much less hassle with token management.
John, that’s a question we get quite frequently about how you actually spec out hardware for your Cassandra environment. What did you guys do? Did you do some trial and error or did you just say, “Hey, let’s go with our lowest cost option and see how it performs?” What was the decision-making process there?
Since the majority of our aggregation and analytics is done outside of the Cassandra cluster we don’t have heavy cpu/memory requirements so we started off with a handful of nodes initially with only single disks. We did read and write performance tests to see what kind of IOPS we were getting out of the cluster using similar data volumes of what we would be doing in production on a regular basis. The remainder of the process was doing further testing with different levels of replication, consistency and various levels of required quorum responses. We ended up going with a couple spindles behind a RAID controller using striping to give us a performance kick for more IOPS without needing to go with the expensive option of SSDs. That brought us to where we are now, which is a much better performance per dollar than any legacy MySQL system we had in place previously.
And as a fairly new user to Cassandra is there anything that you would like to see in future versions?
We are primarily a Python shop and jumped right into the CQL3 bandwagon, so the Python driver support for CQL3 is extremely important. More support for that Python driver and CQL3 would be excellent for us.