November 11th, 2013

By 

 

 

Alon Pilberg: Infrastructure Team Leader at Taboola.

 

Alon, can you give us a quick overview of what your company does?

Taboola is a world leader in the content discovery world which means what we do is we allow users to find and seek content that they wouldn’t otherwise have access to because it’s an information area that does a huge amount of data and finding the appropriate content is a hard task, so that’s what we do. We do it by linking publishers and content end marketers.

 

Let’s say you’re a user and you go to a site like Weather Channel, Huffington Post, USA Today, Fox Sports, etc., you’ll get recommendations for all the content on that site and external links that take you to other sites, but they’re both relevant to you and both irrelevant to you.

In the last year we grew about ten times and right now we serve roughly 3 billion recommendations each day. That’s our scale.

 

What business or the technical challenges drove you toward using NoSQL technology?

When we started we used MySQL servers that replicated to other machines. We executed a process that updated the content lists and rolled to that MySQL server; then all the data got replicated to the slaves. As we grew the company, we saw that this solution just wasn’t holding up because we needed more power. We straightened our backend and that caused the slaves to get a replication lag, so all the work we did wasn’t really paying off because we couldn’t get the recommendations to the end user. We realized that we had a bottleneck in MySQL and that’s when we started looking at NoSQL solutions and decided to use Cassandra.

 

When you began looking at NoSQL technologies, was Cassandra the first and only technology you evaluated or did you do a bake-off with others?

We looked at three major technologies – Cassandra, MongoDB and HBase. We liked Cassandra because it was easy to install, easy to set up and play with, and we liked the symmetric architecture of having no central single point of failure at all.

 

It’s basically a simple product to work with which is really nice when you’re introducing a new technology. It works straight out of the box. We needed something that continuously replicates data across data centers and gave us very high input and output. Cassandra fit the bill. We looked at MongoDB, which seems to have scalability issues, but Cassandra scaled very well. If you need more power then you just add more nodes and then you’re good.

 

Here’s an example. Last year we found that in Istanbul we had more and more records in the data center. Every time we wanted to change the schematic, then we had to take down the servers for longer and longer periods.

 

What parts of your application is DataStax Enterprise serving?

I mentioned the first one, which is the distribution of recommendation lists, generated in the back-end and serving that form to the end users. That is the most basic thing we do. We’re also using Cassandra directly in the front-end to store user data.

 

We prefer to save it on our end so users can easily move from server to server without noticing anything because you can write Cassandra from each node then it’s much simpler than, let’s say Master/Master application.

 

Another thing that we use from the Enterprise solution is Solr. We store in Cassandra all the metadata items, we serve the text and video, and Solr allows us to provide our publishers with item level reports and search capabilities that work seamlessly out of the box.

 

We’ve got another customer, Ooyala, that’s doing some of the same type of thing. You had said at the outset of the call that you serve up about three billion recommendations a day. Are those all being pushed through Cassandra in some form or fashion?

Nearly all of them, yes.

 

How do you support your recommendation engine from an infrastructure perspective? Do you do a lot on premise or in the cloud?

We have our own data centers. We have three production data centers and we’re working on bringing two more online at the end of this month. We have one master data center in which the recommendations are generated and the other data centers are used mostly to serve the traffic. We’re adding more data centers to deal with the increase in traffic that we are seeing. In each data center we have a Cassandra cluster.

 

Okay. Do you have any idea about how much data is held in your cluster, or clusters, across those data centers?

It’s about 2TB per cluster right now.

 

How easy is Cassandra to take care of and manage on an ongoing basis?

We use OpsCenter for monitoring and run tasks from it. For day-to-day cleanup and repairs, we run command line scripts.

 

As for setup and management, we don’t really have any issues. Let’s say setting up another data center; it was a bit tricky the first time we did it, but we had some assistance from the support guys and they really helped us through.

 

We recently upgraded and the upgrade procedure was completely problem free; we had no downtime doing it. That was great, just following instructions. Cassandra is really central to our system and if it broke down then it would be instantly visible to the users and having the ability to upgrade without any sort of downtime is really important to us.

 

Have you found DSE to be cost effective for you? Any other benefits you can point to since you’ve implemented Cassandra?

 

Well, let’s say I think that the cost that you spend on such solutions is something you get back tenfold when you can scale up and you can add more clients and you have no downtime. When we used MySQL exclusively we had a lot more issues, both external and internal, visible downtime and, which was driving our IT folk crazy trying to keep it all together. That really adds up and with DataStax Enterprise we don’t have that.

 

If somebody was brand new to Cassandra or NoSQL in general, what kind of advice would you give them for a smooth transition?

Well, they need to think slightly differently when they start, especially if they come from a relational database background, which most people do, because you want to model your data in a way that is most suitable for Cassandra. You need to think of how you’re going to access the data and plan ahead, plan your schema ahead. Learning to think not as a MySQL table translated into a different model, but as a completely different solution. If you do that, then you can really tap into the potential of Cassandra.

Vote on Hacker News