February 26th, 2013

Episode 8 | Listen

Elapsed Time 24:10


Multi-Data Center Replication

Is it common for Cassandra users to be taking advantage of MDC replication features?

Yes, using MDC is common but it’s most likely not implemented by the majority of users.  I’d say that a little over 1/3 of deployments are MDC. Almost all of the Cassandra users in production are planning on going MDC, but aren’t willing to tackle it at the moment; there’s either no need for it, they don’t want to invest the extra money, or they want to make sure Cassandra works in one DC before going to two.


Do other SQL and NoSQL competitors share this MDC feature?

MDC in Cassandra is a pretty unique feature.  There are other systems that do MDC deployments but there are only a select few who share a similar architecture to Cassandra on how it’s done. 


Matt’s Recommendation “Cassandra takes care of replicating the data that is missed because it intuitively understands that links between data centers go down and may come back up.”


Which other DB offerings perform the same type of MDC replication implementation as Cassandra?

Riak does, it’s only similar though and doesn’t fully mimic the way Cassandra handles MDC replication. Both systems expect for data centers to go down and nodes to fail; both systems recover form that very gracefully, without a whole lot of interaction required.  In most cases, no interaction is required at all by the administrators.  


It’s a pretty unique style to go among those two systems. None of the other NoSQL systems really do that; they are all some variation of master slave, shared master slave or something else.


How do relational systems go about doing MDC replication?

There are legacy systems that allow for MDC replication but, essentially, they just do log shipping replication: so, it’s one single master accepting writes from one datacenter.  This means that anytime you do a write from anywhere, you’re always going to one datacenter.  If you’ve multiple of them, then that means some of your data centers are always transiting the long-haul, high latency link to get to the other datacenter to write and then ship the data back over the logs to all of the other data centers. This is a single point of failure in it’s own right; if that master goes down, then you can’t do your writes.


Matt’s Recommendation “I would go on to say that running a master-slave system in two data centers is not a true multi-datacenter deployment; it gives you very few of the benefits but all of the problems.”


If somebody were attempting to do the MDC implementation are there anything we should be aware of?

Lack of Sufficient Bandwidth Between Data Centers

Cassandra does an excellent job at hiding the latency between data centers but does a terrible job when dealing with a smaller pipe than needed. It’s one thing if you’re periodically spiking your traffic to max it out, Cassandra can deal with that just fine, but if you’re trying to pump 12mb of traffic and your pipe is 10mb between data centers it’s not going to work. Cassandra tries to deal with that.. and sometimes the result is that you’ll see nodes going down: if you look at the nodes in DC1, they’ll say the nodes in DC2 are down and if you look at the nodes in DC2, it will say the nodes in DC1 are down.  Of course, you’ll see the pipe itself maxed if you’ve got something monitoring throughput.  You’ll see build up on both sides of the data center and it gets worse from there, so make sure you have enough bandwidth.


Going from Single Data Center to Multi-Data Center

Adding a new DC to an existing deployment can be a pain.  There are documents (http://planetcassandra.org/blog/post/datastax-developer-blog-multi-datacenter-replication-in-cassandra) that you should follow and read carefully before you do it; there is a strict order you need to do things in. For example, something NOT to do is create a new data center and only put one node in that datacenter and then change your replication factor, to say that the new datacenter is supposed to have a copy of the data —What that means is all of the nodes in the initial data center will be sending all of their data to that one node because it’s the only one in the new data center.  


The appropriate way to do this is to actually change the property files, or whatever snitch that you are using, which lists IP addresses to the DC and rack that they belong to.  So you change the settings for the snitch first, to list the new machines, then you change your node in that datacenter to have a replication factor of 0.  After that, you add how many ever nodes you’re going to add in that data center and set all of those up. After the nodes are online, you go back to your schema and change the replication for that data center to be whatever you want; be sure to do a repair on the nodes in that data center. After all of those repairs are finished, you can start using your new data center.


Matt’s Recommendation on MDC General Practice “You don’t want to have your clients in one data center talking to Cassandra in the other data center. You should always point your clients at the local nodes and the local data center that they share with it and have Cassandra do the replication across data centers.  Also, if you’re doing something like Amazon and running across data centers, you should use encryption because it’s not transmitting over a private network.”


Performance Tips & Tricks


It’s somewhat counter intuitive but you probably want to run with a smaller batch; sending 1,000 rows or 10,000 rows at a time in a batch, while doable, is not likely to increase the performance.  All the rows that you send in a batch are being sent as one message and when it receives that message it has to break it up to figure out which of those rows in that batch go to which nodes.


You can obtain a higher throughput in many cases by making smaller batches and then doing those batches in parallel. So, instead of doing 1 batch of 1,000 maybe you do 100 batches of 10, and then you issue those 100 batches in parallel from your clients.  Cassandra will process each of those batches of 10 serially but because you’ve got a 100 batches, you utilize your hardware resources more efficiently.  If you have a batch that fails due to a node going down, flakey network connection, system crash or the like, then the whole batch has to be retried. If you have smaller batches, then only that small batch will need to be retried.


Matt’s Recommendation: “Imperially, batches between 10 and 100 seem to work but it’s something really something you need to test in order to get the best performance out of.”


Matt’s Recommendation: “It’s cheaper to send 1 row with 10 columns in it than it is to send that same row 10 different times with one column each time. This many not be true for rows, if you have 10 rows with 1 column each you would likely have better throughput and latency by sending each of those as a single request in parallel than you would by putting them all onto one batch.”


Is there anything that can be done when upping or downing nodes, to make the process more efficient?

The new version of OpCenter (http://www.datastax.com/products/opscenter) allows you to shut-down and restart nodes.  You can even disable and enable thrift taken gossip on the nodes, independently of everything else. If you need to have a node up, so you can inspect what’s going on, then you can make it appear down to other nodes by disabling gossip or thrift; the other nodes will mark that node as down.  This prevents other nodes from sending requests for it and begin to queue hints; this reduces the traffic load on that node, so you’re able to poke around.