December 31st, 2013

By 

 

 

Rodrigo Castro: Director of Software Engineering at NewsWhip

 

TL;DR: NewsWhip wants to find the stories that matter the most to people. In order to do that, NewsWhip uses multiple sources including: newspapers, blogs, and magazines. NewsWhip finds articles, sees how important they are becoming, how important they are on social networks, and scores them to deliver their users the most interesting content.

 

NewsWhip was in search for a database that could easily scale.  Initially, for the project MongoDB and Cassandra were both considered.  MongoDB fell short because NewsWhip felt it wasn’t a great fit for the kind of data they were trying to save and were not fond of the master-slave architecture of MongoDB. While they did not think MongoDB was a bad choice, Cassandra had “better features like multi data center replication, which works seamlessly, and the fact that it’s very easy to add nodes”.

 

NewsWhip started their Cassandra use for internal, time-series data metrics about how the platform is doing, or as a health check tool.  From their success they have moved on to integrate Apache Cassandra with their actual platform. Now all the articles and all the stories they find around the web are stored in Cassandra. Accounting for roughly 200,000 articles a day, along with their corresponding social data.  Lastly, Cassandra is used to manage time-series data for their news content creators ranking system; indexing the stories per journalist or per blog writer or creator.

 

Cassandra at NewsWhip is hosted in Amazon Web Services with a small two node cluster with m1.large instances.

 

Hello, Planet Cassandra. This is Brady Gentile, Community Manager at DataStax. I’m here with Rodrigo Castro. Rodrigo is the Director of Software Engineering at NewsWhip. Rodrigo, thanks so much for joining us today.

Thanks for inviting me, Brady.

 

Absolutely. So, to start things off, Rodrigo, could you tell us a little bit about what NewsWhip does?

The basic idea behind NewsWhip is quite simple: finding the stories that matter to people in real-time. In order to do that, we have lots of sources at the newspapers, blogs, and magazines. We find the articles they write, and we score them. We measure how fast they’re spreading through  social networks, effectively surfacing the world’s most interesting content which we display on our website: NewsWhip.com.

 

We have a professional tool for journalists and people who do marketing and media experts so they can find, in real-time, the stories they should be writing about while they’re still small. This tool, called Spike, is also quite useful for tailoring and defining content marketing strategies, as it lets power users have a clear picture of what people are looking for right now. That’s the idea.

 

Excellent. As you’re doing all that, how does Apache Cassandra fit into the mix, there?

Well, there’s multiple points, multiple use cases where we use Apache Cassandra. I think we started back in May, using Apache Cassandra after doing some research. Our first use case was something very simple. We wanted to have this internal metrics tool about how the platform is doing, basically a health check tool.

 

What we do is that we periodically persist values of the state of the platform, so we can later display them in a graph. It’s a time-series use case. We thought Apache Cassandra was a great fit for this. This has been working great. That’s in production already. Obviously, people don’t see that, but it’s working very well.

 

Then, because we really liked Cassandra’s write performance, we tried using it somewhere else. What we did is that we integrated Apache Cassandra into the actual NewsWhip platform, so now all the articles and all the stories we find around the Net, or on the web, are stored in Apache Cassandra. That’s about 200,000 articles a day, along with their social data, which can be about 1,000 data points per article.

 

We also rank the people that write the stories. We call them news creators, and we are using Cassandra for indexing the stories per journalist or per blog writer or creator. That’s been working very well too, because that dataset is a time-series problem too which is a perfect fit for Cassandra.

 

Wow. That sounds like a lot of data that you’re storing, then.

Well it’s not actually that huge if you compare it to Netflix, for instance. It’s not a small dataset though.

 

Every month we aggregate all the information we collect using Hadoop queries.  We rank publishers and how good they’re doing, we get the biggest stories every month, and other interesting facts that are available in our blog every month: blog.newswhip.com.

 

Excellent. Was there other NoSQL databases that were evaluated against Cassandra?

Yeah, we looked at MongoDB, but we didn’t think it was a great fit for the kind of data we were trying to persist. We also didn’t like the master-slave architecture of MongoDB. But, personally, I don’t think MongoDB would have been a bad choice, it was just that Cassandra had better features like multi data center replication, which works seamlessly, and the fact that it’s very easy to add nodes. So, if we need to scale up, it’s a matter of just adding one server, and that’s pretty much it.

 

Excellent. Would you be able to share with us some insight into your deployment? Are you hosted in the cloud or your own data center? How many data centers?

We’re hosted in Amazon AWS. For Cassandra, we have a small cluster. We have two nodes, and I think they’re m1.large instances in the United States, and then we have one node in Ireland, which is pretty handy, as it acts as a real-time back-up system and as a reporting server.

 

That’s where we run Hadoop every month. These queries never use data US side, so all our servers that need to be real-time are never affected by the analytics system. If Hadoop is running, if it queries too much the database or if the node goes down, it doesn’t really matter. It’s not a problem.

 

Excellent. It sounds like it’s really working well for you and utilizing the built-in features of multi-datacenter replication seems to be great.

Yeah, we’re really happy with Cassandra.

 

Cool.  So, I had met you when I was visiting the Dublin Cassandra user’s group. It looks like you guys have an excellent community over there in Dublin. It’s really thriving. Could you tell us a little bit about your experience with the Cassandra community? How’s that been?

Sure. Well, it’s always been great. You know, whenever there’s free beats and free beer, I’m a very happy person. The presentations I’ve been to and meet-ups and people, all have been great. I’m happy that DataStax is supporting the community and I look forward to future meetups.

 

So, I think that’s all the questions that I have for you today, but before we sign off here, is there anything else that you would like to add?

 

Otherwise, I just wanted to add that the DataStax documentation is great. It really helped us out when we were first starting, and I really like the support DataStax gives to the community.

Vote on Hacker News