April 18th, 2013

By 

Dstillery

Rod Hook: CTO at Dstillery

Brady Gentile: Community Manager at DataStax

 

Brady: Hello, Planet Cassandra users. Today I have Rod Hook, CTO of Media6Degrees, here to discuss their use of Apache Cassandra. Rod, how are you doing?

 

Rod: I’m great.

 

Brady: Excellent. So to get started here, what does Media6Degrees do?

 

Rod: M6D is a marketing technology company. Our goal is to help marketers find customers online. We refer to this as prospecting and we do this by first understanding how your existing customers act and where they go around the web. Secondly, we find people that act in a similar manner. We use programmatic buying to deliver high performing ad campaigns to those customers, while they’re consuming media around the web. Our campaigns currently focus on video, mobile and display media.

 

Brady: Interesting stuff! How are you using Cassandra at Media6Degrees?

 

Rod: We use Cassandra for a couple of different things. Primarily, we use it as a user store where, as we see users around the web, we can place certain pieces of information into this user store in real time. Those data points are then available to our real time bidders instantly. With some prior database technology implementations, we had a lot of lag between when we collected a piece of data and when it was actionable.

 

Another piece of our system that we migrated to Cassandra was for high scale machine learning. At one point, we use to store all the training data within HDFS (Hadoop Filesystem) but what we were having trouble with was really fine grain access to a particular marketer’s training data. To be able to pull that out of HDFS quickly, you had to have really small partitions, which led to lots of small files. A lot of what you might know about the Hadoop namenode is that it doesn’t like lots of small of files.

 

We decided that, that particular use case would fit very well with Cassandra so we basically started storing these training data instances within Cassandra so that it didn’t matter so much that the files might be small. We could just add a key for a particular training time chunk and then put all the rows right underneath the key as columns.

 

Then we can just pull it right out with one get from Cassandra, whenever we want to access that particular time chunk or marketer for instance. That has worked really well for us in taking a lot of the load out of our Hadoop namenode and placing that load somewhere where it’s a little more suited.

 

Brady: You had mentioned your prior implementations. It sounds like you used another database offering before to switching to Cassandra. Could you tell us a little bit about that?

 

Rod: Sure. Some of the approaches we used in the past were things like pre-baked MySQL tables with lots of user IDs in them and various data points about those user IDs, then we could publish that dataset periodically out to the bidders and the bidders would have the data to use about the users. That process was a really clunky way of doing it; we had to have basically MySQL instances on each of our bidders that had this large lookup table in it. We migrated that approach to Cassandra.

 

Brady: Very good. Is your C* data stored in the cloud or is it in a physical data center?

 

Rod: It’s in a physical data center.

 

Brady: Okay. How much data are you storing in it?

 

Rod: We have thirty-nine nodes in one of our clusters and they each have at least 100 gigabytes per node. We run several different clusters but that’s our largest one.

 

Brady: Interesting. What are your thoughts on the physical and/or virtual Cassandra community?

 

Rod: Most of my experience in regards to the C* virtual/physical community would be in the New York C* meetup group. We’ve had a couple of sessions here at M6D with that group. I think it’s been going really well; I saw a couple of different real laboratory experiments where we basically walked people through learning how to setup a Cassandra cluster and how to make it replicate to a second Cassandra cluster. I thought that was very cool to be able to do that in just a couple of hours.

 

Brady: Awesome. I guess for my last question, is there anything that you’ve learned while using Cassandra that in the hindsight you might have done different? Any tips or tricks for someone who’s a new user?

 

Rod: One thing that we’ve learned is that we initially setup one Cassandra cluster with lots of different column families in it and felt like we could just throw more hardware at the one Cassandra cluster and that would be all we needed to do…  but we’ve learned overtime that you can get more flexibility (if you have a really hot column family, know it’s going to be large or take a lot of reads) by peeling that out to a separate cluster that is dedicated to that column family. Then you can take more advantage of the cache and the tuning parameters for that column family and it won’t fight with the other ones.

 

Brady: Rod, thank you very much for meeting with me today. We really appreciate all of your insights into Cassandra and how Media6Degrees is using Cassandra. I wish the best of luck to you.

 

Rod: Sure.

 

Brady: Thanks.

Vote on Hacker News