December 27th, 2013

By 

 

John Emhoff: Engineer at Embedly

 

Embedly provides a web API where a user can send a URL, and Embedly sends back metadata about the URL.  For instance,  if you send Embedly a YouTube URL, it would send back the necessary HTML to directly embed that YouTube video, title of the video and various other things like that you might want to grow around the player.

 

Part of Embedly’s value proposition is to provide usage numbers and basic analytics to their customers that are updated hourly. Embedly uses Cassandra’s ability to easily scale as a big bit bucket that they can shove all their data into and without having to worry about capacity, with the simple ability to add it later if needed.

 

Embedly is running Apache Cassandra 1.2.8 across 2 datacenters spanning a total of 12 nodes. Embedly uses multi datacenter replication and raves about Cassandra’s really hard to kill” foundation and time to live (TTL) options.

 

John, thanks for joining us today to talk about your Apache Cassandra use case. why don’t you start off by telling us a little bit about what Embedly does.

Sure.  Embedly provides a web API where you send a URL, and we send back all sorts of metadata about that URL.  The idea is you can use that metadata to embed that page into your blog or into your forum or something.  There’s a couple quick examples like if you sent Embedly a YouTube URL, we would send back the necessary HTML to directly embed that YouTube video.  Then, also, like the title of the video and various other things like that you might want to bridle or grow around the player.

                   

Or if you send a New York Times article, we will provide the title of the article and maybe a description and author and such so that you can put the little blurb directly into your news post or social media feed.  Although really since it’s a very general purpose product we provide, that URL metadata, you can do anything you want with it.  Some people have been pretty creative.

 

Great.  Can you talk a little bit about why Cassandra?  How you came to choose it?  That would be great.

Sure.  One of the things we like to do is provide, or need to do as well, provide usage numbers and basic usage analytics to our customers for billing, but also so people can look on their blog or on their social media site and see what URLs people are sharing which is actually very powerful.  We aggregate  usage numbers hourly to see it’s something about popularity for people.  Cassandra is basically just a big bit bucket that we can shove all this data into and it’s been really nice because we didn’t have to worry about sizing it correctly because we knew we could always add capacity later. That really enabled us to be pretty flexible.

 

Talking about adding capacity.  What’s your environment like for Cassandra?

Right now we’re running in 2 datacenters.  We have a, I think it’s an 8 node setup in rackspace, and then we also have a 4 node cluster running in NISCA 2.  I guess those are technically, they use the Cassandra nomenclature, those are the two data centers in the same cluster.

 

Do you replicate between those two different data centers?

That’s right.

 

Great.  Did you look at anything else, knowing you needed to store and analyze this data, did you come straight to Cassandra or did you try other things along the way first?  

Honestly, the actual selection of Cassandra was before my time at Embedly, but I know that we didn’t want or need something, a traditional relational database because we’re just basically storing key value data.  I know that we also want it to be optimized for writes because we’re going to write this data a whole lot more than we read it. We’re constantly writing to it.  We can go back to last year and see exactly how many months Gangnam Style was in the top five YouTube videos.

 

How many months was Gangnam Style in the top five  YouTube videos?

Oh, geez.

 

A lot.  A lot of months.

Yeah.  Almost all of them.

 

It felt like an eternity. What  version of Cassandra are you running right now?

I think we are running 1.2.8.  

 

You came into Embedly, and I take it you were new to Cassandra or had you used it elsewhere before you came over?

I was pretty new to it.  I knew about It, but that was more or less it.

 

What did you find hard, and what do you wish you had known at the time?

I think the hardest thing about administering Cassandra is that you really need to know how it works to be able to do it effectively.  I know that’s not necessarily a bad thing  It’s always good to understand your tools so that you can use them correctly, but there is kind of a bit of a learning curve.

 

What is easy?  What do you find easy about administering Cassandra?

What I find easy about it it seems to be really hard to kill.  We’ve certainly had some missteps with configuration and such and we’ve never lost any data.  We’ve never had anything corrupted beyond repair.  After going through that a few times, you learn to be a little more comfortable with it.

 

Absolutely.  It gives you some confidence too that you can push the envelope a little bit and not risk losing any data.

Yeah, absolutely.

 

As far as you’re writing tons of data, but I imagine it’s not that vast amount of data that takes up quite a small footprint.  Would that be accurate?

Yeah.   I would say we only have maybe a couple terabytes of data and that could probably be pared down because we keep quite a bit.  It’s mostly just it’s a place for us to shove whatever data we want to live for more than just a few hours.

 

Is it your goal to keep, as you grow and move forward, to keep an all-time history?  Will you ever get rid of any data?  Or is it really just keeping it for as long as you can?

We keep almost all of it.  It’s actually beeen really helpful to have the analytics going back because you can think of some interesting metric you want to,  a new metric you want to look at and you have all this data to run it against.  What we don’t keep is the per minute analytics data.  That’s actually been kind of nice with the TTL feature of Cassandra that just naturally works.  You can say keep this minutely analytics data for 1 month and it will take care of it.

 

Excellent.  John, anything else you’d like to add or if not you’ve provided some great information here so thank you very much.

I’m sure I’ll think of something as soon as we’re done here.  Actually, the Cassandra RSS feed channel, the guys in there are very helpful.  There’s almost always someone there with pretty deep knowledge and it’s good that they’re willing to help.

 

Brilliant.  Thank you very much, John.