September 18th, 2013


This posting was created by Jon Haddad. Check out more of Jon’s blog postings here.

A frequently asked question on the mailing list by developers new to Cassandra is if it’s possible to start with a single node and scale up as their needs grow. This seems to come most often from people familiar with MySQL, Mongo, or another database which uses replication to scale reads.

The short answer to this question is yes, you can absolutely run a one node cluster. However, it’s important to understand the caveats of doing so. Cassandra was built with the intention of running in a cluster. This means that there are several reasonable defaults for a cluster either aren’t practical or don’t apply with a single node.

Like many other databases, Cassandra uses a commit log to save changes. The default setting is to sync the commitlog to disk once every 10,000ms. To understand why this is sane, it’s important to understand how writes are handled.

Let’s assume for a second we’re running a 3 node cluster, and we have replication factor set to 3. That means that each row of data will exist on 3 machines. When you do a write to Cassandra, you get to choose a consistency level. This effectively allows you to say “Did this INSERT successfully happen on X machines?” If you’re using quorum, you divide your replication factor by 2 and add 1. In a 3 node cluster, this means 2 nodes have to successfully acknowledge the write for Cassandra to return success. If a write is successful, it’s immediately in memory and written to the commit log.

Looking back, you can see the default is 10s. The reason why this is a sane default is because it’s highly unlikely that you will lose 2 nodes that wrote the data in the same 10 second window. In a single node system however, this could leave you very vulnerable. So if you’re going to try to rock out with only one node, you should adjust this parameter to match your business requirements for data durability.

Since with a single node we no longer have the advantages of Cassandra’s distributed architecture, does that mean it’s not worth building your application on it? This again will depend on your business requirements. If you aren’t familiar with a column oriented data store, you’ll have a slight, but not terrible, learning curve. If you’re coming from a relational background, Cassandra may seem backwards. However, if you somehow know that you’re going to move to Cassandra in the long run, you might be better off sucking it up and writing your application with it to start. There’s a lot of value in not having to switch databases while people are using your application in a live environment. It’s also perfectly possible to start with a single node, add a second, third, etc. Just be sure to change your replication factor to more than 1 once you’re running multiple nodes. As always, read the docs before you put any unfamiliar database in production.