February 28th, 2013

A few weeks ago me and a few friends were bouncing back and forth some tweets about how awesome we think Cassandra’s multi-datacenter replication is. I am not going to link to the tweet but more or less we got trolled by a statement to the effect of, “It is great when correctness does not matter”. This statement drums up an old tired misinformation campaign against eventual consistency that has been debunked time and time again.

I was going to let this entire thing go, (which as most of you know I am bad at). However a situation came up in which I believe highlights how “correctness” is not as easy to achieve as many people seem to pretend it is.

We are using apache kafka a distributed messaging system. The way kafka works all writes are directly written to disk. Clients are organized into consumer groups. Consumers write entries to zookeeper every N seconds to record the offset. This way when clients die or topics are rebalanced the same data is not consumed twice by the same group.

We started ramping up kafka our kafka usage and we noticed something a very high amount of IOWait on our zookeeper servers.


However, if you look at the overall disk traffic the number is very low.

So now we are in a bit of a pickle. Unlike Cassandra, Zookeeper can NOT have nodes added dynamically. Nodes are defined statically in their configuration file


Even if we could add nodes, each write goes to every server. Adding more nodes like mysql master/slave, only scales reads, it does not scale writes.

Since adding nodes will not help, we could upgrade our hardware. Lets investigate this. Adding RAM does not help, because the actual zookeeper database is less then 100 MB so this is not a caching issue. We could go out and purchase memory and get a RAID card with battery backed cache. This will help us to a point. We could purchase servers with large RAID systems. 

All those options above kinda suck. Scalable to me does not mean forklift upgrades every time something gets slow. I decided to dig deeper because frankly I was very surprised that periodically writing offsets to Kafka from a few hundred clients would cause this much io/wait.

When I dug in a bit (with the help of some fomr kafka IRC) I found this:


At this point, you you might be getting an idea where my story arc is going. I set forceSync to no.

As you may have guessed the IO/WAIT dropped to near 0. But what does that mean? Am I a zookeeper tuning master? No it does not mean that at all.


It means that if I use this setting, and someone yanks the power cord from all my zookeeper servers at approximately the same time, it is possible that I could lose a write!

This type of thing is not uncommon, XEN for example can virtualize an fsync call

As always, be on the lookout when someone scoffs at Cassandra’s eventual consistency. You might want to ask them if they made the decision to spend $100,000 on a big iron zookeeper cluster, or turned off forceSync and lost durability in their quest for “absolute correctness”.