April 11th, 2013



Joe Stein: Chief Architect at Medialets

Brady Gentile: Community Manager at DataStax


Brady:             Hello Planet Cassandra users, I’m here today with Joe Stein who’s a Chief Architect at Medialets.  Joe, how are you doing today?


Joe:                 I’m doing good.  How are you doing Brady?


Brady:             I’m doing well; thanks for coming on the show today.  Joe, to start off, what does Medialets do? 


Joe:                 Medialets is the rich media mobile platform for mobile and tablet advertising.  We enable high value premium mobile advertising at scale.  We work really hard to make advertising easier and faster for advertisers and agencies and work with all the premium publishers in the mobile community to deliver really great creative and large campaigns. 


Brady:             And how does Cassandra (C*) work into the mix at Medialets? 


Joe:                 We’ve been using Cassandra now for a little over a year and a half in production.  We’ve been following it since day zero, when Facebook originally released it to the Apache community; we originally built Cassandra to work behind our ad servers.  For 100% of our ad server traffic, Cassandra basically powers all of the user distributed hash maps that are involved in the business logic for all the ad serving that we do. Over the last year, we’ve been working on a new system that will continue to leverage the existing functionality that we have for Cassandra; we will start to stream all of our impressions and pixels and events in the system in order to start to do real time analytics.


Brady:             It sounds like you guys were very hard-core first adopters, so that’s awesome to hear. 


Joe:                 Yeah, it definitely took a while for Cassandra to get to the point where it made sense for us and made sense for the system.  When we started out, we basically started our business the day the App Store opened; there were practically no mobile devices at that time.  It took about a year for enough people to get iPhones and for there to be enough inventory for this mobile ad industry to even start.  Then it took about another year after that for us to get our product and systems in a place where we needed to have the distributed hash map to not only hold not only a very large amount of data but, also, to be able to handle the number of requests that we receive at scale.  At any one time we can see 7, 800,000 simultaneous connected devices and we need to be able to respond to each of those requests within a few hundred microseconds.  Having to hop to another server is very time consuming and being able to essentially stream that data out of memory from the Cassandra boxes and the ring really allows us to hit the delays.  It truly gives us time to respond to the massive amount of mobile users out there. 


Brady:             Very interesting.  And it sounds like Medialets had some really good foresight driving the business. Have you always used C* or did you switch to C* from another database offering?


Joe:                 We didn’t feel that the features we wanted to build in to our application were able to run on any other system.  Availability is extremely important to us, as well as performance, and it took awhile (around ’07) until Cassandra started to receive more features and functionality, in 08 for counters, and then the one 1.X line with compression.  It has kind of become the perfect storm: our business and our product were ready to begin holding a lot of this user information and Cassandra has matured by that point, so that we could have all of our data compressed. 


                        We have a lot of data, so compression was really important for cost conscious folks. We also had distributed counters as well, which was also very important to us.  It really never felt that there was another system besides Cassandra that could really meet all the different use cases that we had between features, functionality, performance, and distributed counters. At the same time, having the availability that we need was important as well; we really have to be at 100% up-time because if we can’t get the data that we need from Cassandra, we basically can’t serve an ad… which means we can’t make money!


Brady:             You stated that you store a lot of data; could you give us a rough estimate on how much data you’re storing in Cassandra? 


Joe:                 Right now in Cassandra we hold a few terabytes of compressed data but we’re only holding user information.  We’re not actually holding all the different designs and impression information that we get at any one time.  We get a couple of terabytes of data per day in that range.  Within the next two to three months we’ll be streaming all of that data into Cassandra and we’re going to be hooking up download Kafka, behind our ad servers, to log and hold all of the impressions and events that come in.  This data  being logged and held by Kafka will then go into Cassandra.  And then the new front end that we’re working on will have essentially real time analytics; within one to three seconds, everything that’s happening for any campaign will be viewable inside of our front desk. 


Brady:             Wow. So for the data that you’re storing in Cassandra; is it in a data center or the cloud? 


Joe:                 Yes, everything we do is on bare metal.  We are very conscious of the amount of IOPs that we’re doing as well as the amount of bursts that we can have.  At any one time we’re at 30 or 40,000 rates concurrently, at a low point.  At normal day usage we could be at 100,000 to 200,000 concurrent writes.  And our bursts can get up into the millions.  And sharing infrastructure, whether it’s spindles or network, is just something that we haven’t had positive experiences around,  even in testing environments. It’s really something that, at least for us, we like to have a dedicated infrastructure to support.


Brady:             That makes sense.  So what are your thoughts on the physical and/or virtual Cassandra community? 


Joe:                 I think that the mailing list and folks on Twitter and a lot of the other online communities have always been really great. Everyone is always very helpful and willing to talk about different problems and many folks have different blogs and that’s really great. The New York City Cassandra meet up has been going strong and we’ve had a lot of great speakers.  I know there were some others in Austin and the west coast (South Bay Cassandra Users) that have been doing very well.  And it’s been really great to see both the online and real world communities growing over the years and to really see the Cassandra Summit 2013 and the other meet up groups to start to come about. 


Brady:             Very good.  It sounds like you’ve had a great experience with the Cassandra community so far; that’s really great to hear.  Is there anything that you’ve learned with C*, that in hindsight, you might have done differently.  Maybe some tips or tricks for someone who’s just starting out? 


Joe:                 Yeah, it really is important to understand your use case and to make sure that, for whatever problem you’re trying to solve, you’re actually solving the problem. It’s definitely not about taking the system and figuring out “how can I modify” or “how can I have this work just to meet my needs” in order to solve your problem.  A lot of folks I’ve seen say that they may want to get into Cassandra (and that’s great, because it is a cool system) but sometimes MySQL is good enough… however, sometimes it’s not.  It really comes down to what problems you are trying to solve.  How do you want to solve them?  What languages do you need to work with?


                        Once you know what language you’d like to work with, and you feel that it’s the solution, what client libraries are out there for you? Understanding and going through client libraries is really important in order to understand if you’re going to use CQL and just start to really learn and understand that.  And then once that’s done, start looking at “how do I build my column families” , “how do I hold this data” and “How am I going to query it? “  Topics like the ones listed above are important to touch on before even getting into the nuts and bolts around things like applications or virtual nodes.  It’s important to identify the use case and look at how you’re going to solve, from a design and development perspective, the solution for your big data problems. 


Brady:             Joe, thank you so much for meeting with me this afternoon. I really appreciate your time. Best of luck to you and Medialets! 


Joe:                 Thanks Brady. 


Brady:             Thanks Joe.