Nicolas Lalevée: Data Officer at Scoop.it
Nicolas, thanks for joining us to talk about your Apache Cassandra use case at Scoop.it. What does Scoop.it do?
Scoop.it combines big data semantic technology with content curation. Our vision is to organize the Web into a smarter and more relevant place, and to do so, we believe that both algorithms and humans must play a role. Scoop.it lets users easily discover, enrich and share awesome articles, blog posts, media and more on their topic of interest. Our proprietary search engine crawls over 10 million web pages daily. Roughly 100 million people have visited Scoop.it since our launch two years ago.
How is Cassandra in the mix there?
We are using Cassandra in 3 different cases: first we have our live analytics which uses a lot of series and counters; we also have asynchronous analytics: after some heavy computation in a Hadoop stack, job results are stored into Cassandra; and last but not least, our suggestion engine keeps track of the content that has been suggested and that which has not, and for that we leverage Cassandra’s bloom filters.
As of today, we have a cluster of 6 machines. It is handling about 6k read/s and 1k write/s on a nominal load. At peaks, we’re at about 15k ops/s.
Where are you running Cassandra, in the cloud or a datacenter?
We are running Cassandra on our own servers in a datacenter. For us, managing the machines ourselves costs less than using a cloud infrastructure.
You mentioned earlier you made the switch from MySQL. What caused you to migrate to Cassandra?
Almost 2 years ago, we had a simple datastore architecture: one MySQL to store them all. Scoop.it became more and more popular making it difficult for our database. We also knew that this growth would not end soon, so the new datastore solution has to be as easily scalable as our growth. Since we love the latest and best technology, we looked into the NoSQL data stores, especially the one which promises easy cluster expansion without downtime.
The first data we wanted to move from MySQL had a lot of different counters, and some fit perfectly well within Cassandra’s data model. This was a big win, just a switch of a DAO implementation. On the other hand, some counters need batch processing of raw data, but we found out that making Hadoop write the results of its jobs into Cassandra works very nicely. So, still a win.
There are some other comparable data stores that would fit our use case, for example, HBase. But we felt a lot more confident with the simplicity of operation of Cassandra.
That’s great. Lastly, what tips do you have for someone just starting with Apache Cassandra?
The very important first step is understanding the data model. We are coming from the classical SQL world and Cassandra was our initiation into the NoSQL world. We had to change our mindset a bit, but I recommend that if people struggle with this part like we did, they should still take time to understand it well because as soon as you learn how the data can fit, everything starts making sense.
Another point I think is important is monitoring. This seems obvious. But we get lazy sometimes and since Cassandra is quite easy to operate, if it is working like a charm, why bother?
At one point we got some very high pressure on our cluster. We were only monitoring CPU, I/O and RAM consumption, and we didn’t have many stats about Cassandra internals. So it took us some time to figure out what went wrong. And it is a shame because Cassandra is exposing a lot of useful internal data on its JMX interface. We fixed that, and we got a clear view of what was wrong. I recommend to leverage the monitoring capabilities of Cassandra earlier than later. It also helps understanding Cassandra behavior and doing capacity planning.