Anastasjev Oleg: Lead Developer at Odnoklassniki (Ok.ru)
Brady Gentile: Community Manager at DataStax
Brady: Hello, Planet Cassandra users. Joining us today is Oleg Anastasyev, Lead Developer at Odnoklassniki (Ok.ru). Oleg, thank you so much for joining us. Can you tell us a little bit about who you are and what Ok.ru is all about?
Oleg: Odnoklassniki one of the largest social networks for Russian-speaking audiences. It’s used across 22 countries and translated in multiple languages. We have 200 million registered users and roughly 80 million unique visits monthly. We started in 2006 and since that time we have been growing fast.
Brady: What does Ok.ru stand for?
Oleg: Ok.ru is a short name for Odnoklassniki.ru. “Odnoklassniki” means “classmates” in English. When we first started, the website was meant to help people find old classmates and relatives, so you can reunite. Today, Odnoklassniki is an entertainment social network. It has broad functionality, like photos, videos, music, games, etc. Many people use it for entertainment and communication.
Brady: Very good. And how does Apache Cassandra fit into the mix?
Oleg: Odnoklassniki is using cassandra for its business data, which doesn’t fit in RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network.
The way we use Casssandra is somewhat unusual – we don’t use thrift or netty based native protocol to communicate with Cassandra nodes remotely. Instead, we co-locate Cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also this helps us to do many little hacks on its internals, making huge gains on efficiency and ease of distributed servers development.
For example, we can query the sstables bloom filters right from business logic to make fast preliminary filtering. We also can process a whole dataset right on our cassandra nodes, without moving data between servers. Also, we’ve developed some extensions that help us to make things faster and more stable.
Brady: Are the extensions open-sourced into the community? Can other people use these?
Oleg: Actually, yes. We have these extensions on the GitHub of Odnoklassniki (http://github.com/odnoklassniki).
Brady: Excellent. What was the motivation for using Cassandra, and were there other technologies that you evaluated it against?
Oleg: We started to evaluate different technologies back in 2010 when we reached the first 1,000 servers in our data center. We were searching for a storage solution that allows for multiple data center configurations and that can survive a data center failure.
We looked at HBase, Cassandra and Voldemort. We currently use Voldemort and Cassandra at Ok.ru, but for different tasks. With HBase, the system has too many moving parts and a well-known issue with NameNode being a single point of failure.
Now we are at 5,000 servers in our data centers and we survived actually several data center failures. For Cassandra’s clusters, it went very well, almost unnoticed for consumers; so, that was good.
Brady: Very nice. And what does your deployment look like?
Oleg: We have more than 5,000 servers across several data centers, all of which are located in Moscow. We have 23 clusters of Cassandra and 370 nodes in production; our largest cluster is 96 nodes. The most loaded cluster makes 1 million business operations per second, and the largest storage cluster has 80 terabytes of data, about 2 terabytes per node. We use both SSDs and spinning disks, depending on what we want: storage space or speed.
Brady: What are the applications running on those 23 clusters?
Oleg: I’ll break them down for you:
- Like! Button Data: 48 nodes with over 70 Billion likes, 1 Million ops/sec. 12TB total data store. This one is the most loaded system.
- Messages: (Stores all messages of private conversations between users) 96 nodes, >200 Billion messages, 80TB of data and up to 2TB per node. Unlike other clusters, which are almost all on SSDs, this one uses platter disks, 22 per node. We have to do some hacks to make data evently distributed among them.
- Timeline: 3 clusters, one stores events made by users, the next is a timeline feed for user-to-user subscriptions and the last calculates statistics on subscriptions. There are 54 nodes in this feature total.
- Social Graph Servers: 70 nodes total. These are not using Cassandra as a whole; instead, they use only persistance part of it (i.e. memtables & sstables) to persist social graph data. (This one is even more loaded than Like! but as they are not fair cassandra nodes, so I don’t count them as such)
- Mediatopics: Rich media blog posts by users with embedded music, images, video; this takes up 24 nodes total. Many smaller clusters, less than 20 nodes in size, store: music feeds, recommendation servers, video metadata, 3 clusters for photo related info storage, social game notifications, etc.
Brady: Excellent. Do you have plans to upgrade to Cassandra 2.0 when it becomes available? If so, is there anything that you’re looking forward to with that?
Oleg: Yes, actually. Most of our deployments are currently in a branched version. It’s kind of simple and fast, but sometimes we want simplicity for our programmers. We are now looking into the 2.0 version and trying to adopt it into production.
Brady: What would you say is your experience with the Cassandra community? I believe you had said you’re based in Moscow. Is that correct?
Oleg: Not exactly. Actually, we have three offices. I am personally in Riga, and we have a Moscow office and a St. Petersburg office, as well.
Brady: If Cassandra users in your Moscow office are interested, we have a Moscow Cassandra Users group that I enocurage them to join. What are you most looking forward to in future versions of Apache Cassandra?
Oleg: A weak point of Cassandra from our point of view is lack of conflict detection, because it has a simple time stamp based resolution of conflicts, and this is not enough. We were forced to use Voldemort for that kind of workload, where we had a lot of concurrent updates to the same row.
I know that in version 2.0, you’ve tried to address this issue implementing conditional modifications, based on Paxos. But Paxos requires you to make several network round trips to perform a conditional modification operation, which could be too much, if you have your operation latency limited to 1-2ms. Paxos will fail to accept modifications when quorum of nodes is not available, which could be critical if you need to accept modifications even when datacenters have a partition. A vector clock based conflict resolution could be one of the possible options here.
Another issue is that stock Cassandra nodes require a lot of scripting. You have to write scripts to make it run repair tasks, make snapshots, clear or copy archived commit logs, rebalance cluster, etc etc. If nodes would be self-managing, so you don’t need to script or install a separate python environment ops center and its agents in order to get the clusters up and running that would be good for us. It would simplify the management of these nodes in production.
Another interesting problem that we are now facing is the distributed queue; there are very few queue implementations that work with the high availability that we need. I believe this kind of distributed, always available queue is very important; not only for us, but for other users as well.
Brady: Excellent. Oleg, thank you so much for joining us today and for all the great information about how you’re using Apache Cassandra at Ok.ru. Before we end here, is there anything else that you’d like to touch on about Ok.ru, Cassandra or the community?
Oleg: Actually, I want to thank you guys for the job you are doing. This is very exciting; you are on the cutting edge of technology.
Brady: Thank you, Oleg.
Oleg: Thank you.