October 14th, 2013

By 

 

Brett Hoerner: Engineering Lead at Mass Relevance

Matt Pfeil: Founder at DataStax

 

TL;DR: Mass Relevance partners with all of the big social networks to help customers like ESPN and the Grammy’s bring social media to television and their websites.

 

As their company grew, Mass Relevance migrated from Redis to Cassandra. They use Cassandra primarily for their customer’s timelines. Dealing with social network content that’s streaming through the data that they’re filtering, they need to store millions of items for a single customer’s “stream” that they can search, display, filter, etc.


They have two Cassandra clusters using multi-datacenter replication over AWS. One cluster is 8 SSD nodes, the other is comprised of 3 nodes and it’s actually more used as a cache. It has a low replication factor and it’s mostly fit for availability and key-value storage.

 

Hello, Planet Cassandra listeners and readers, this is Matt Pfeil.  In today’s Apache Cassandra use case I’m joined by Brett Hoerner, Engineering Lead at Mass Relevance. Brett, thanks for taking some time today.

Sure, no problem.

 

Why don’t we start things off, and can you tell everyone what Mass Relevance does?

We partner with all of the big social networks and help customers bring social media to television and their websites. If you ever see a Tweet on TV, it’s most likely us.

 

I’ve seen those a lot. What’s an example customer you can share with everyone?

The NBA, MLB, ESPN, FOXSports, and every major US network to name a few. Events like the Grammy’s and things like that, are all customers for television, many, many more on the Internet-side.

 

That’s awesome. So how do you guys use Cassandra?

We got into it mostly for timelines, because we’re dealing with social network content that’s streaming through the data that we’re filtering. We need to store millions of items for a single customer’s “stream” that they search, display, filter, etc. From there we also now use a lot of counters and store some key- value data in Cassandra.

 

For the stream example you gave, is the data model one where you’re storing a given stream per row or can you share some insight on that?

Yeah, each stream that a user sees is actually broken into multiple backend streams that might be like approved content, stuff that they like that they want to filter into a separate stream, or content that’s yet to be approved or that’s rejected and deleted from the stream. Each of those is a single row. We use the Thrift API still for everything and a time UUID ID for the column, and the value is the entity ID, which could be like “tweet:300”, that references another Cassandra database where we store the entities (shared) in K/V format for all customers.

 

That’s awesome. What was your initial motivation for using Cassandra and what other technologies were evaluated against it, or did it replace?

Before Cassandra we were using Redis for timelines, which, as you might imagine, kind of got painful when we grew, because it was all in memory. It was really easy to do the sorted sets which is nice and moving to Cassandra was pretty simple in a data-modeling way. The actual migration predates me, my boss, our CTO, did it soon before I joined. I would say that there wasn’t much else that makes a good option for huge, huge time lines. None of the other databases that I’ve used would make a good fit for that, except maybe a really scary large relational database. I would say that it pretty much won for time lines and then after that, since we have good operational experience, we use counters and everything else.

 

That’s awesome.  What can you share about what you’re infrastructure looks like?

We use AWS for everything. We’re in US East. We multi do AZs for stuff that is possible right now, such as Cassandra. We have two Cassandra clusters. One is 8 SSD nodes, the hi4.1xlarges, that people are pretty familiar with. Another one is 3 nodes and it’s actually more used like a cache. It has a low replication factor and it’s mostly fit for availability and key-value storage.

 

I see. When did you guys start using Cassandra?

Two years ago, almost exactly, I guess.

 

Great, in that 2 years, what’s the one thing that you’ve seen come out of the Cassandra evolution that’s made you the happiest?

Our new cluster having vnodes, the new hashing algorithms and we finally got away from the Property File Snitch. We need to figure out the migration plan for our old cluster. The vnodes are great, because we go up and down in scale for events.