Cassandra belongs to the family of databases NoSQL . It is a database that we hear more and more about when you are faced with having to manage large volumes of data to be very efficient and will easily scalability.
By design, Cassandra was natively designed to be highly distributed, high performance and high capacity scalability.
Cassandra is a very nice implementation of p2p model. All nodes within a cluster have the same Cassandra role. There is no concept of master / slave …
In short, I’m not going to list the benefits of using Cassandra as others have already done very well before me. You will find on the wiki Cassandra a set of links to presentation will describe the key concepts around Cassandra. Just a last note to say that Cassandra will be out soon in v1.0.
Once your choice is made: “ok, one hand on Cassandra”, what are the issues that you will be experiencing during its implementation? I will try to provide some answers … It’s gone!
Design your data model:
One of the first things to do when you start working with Cassandra, is to forget everything you’ve been told and what you do with a relational database: The key word is denormalization .
Once the model including Cassandra, we must focus attention on how you querying your data. System querying Cassandra is limited (no joins for example …). So we must begin by to be sure you can do what you want to do.
Install and configure a Cassandra cluster
The force is Cassandra function with a large number of nodes in a cluster. The largest cluster I’ve heard (in this interview with Jonathan Ellis ) would consist of 400 knots … just huge. Without going that far, despite everything must quickly familiarize yourself with the operation and administration of a Cassandra cluster. In short, we must put hands inside. I recommend this video or this documentation DataStax describing the rapid implementation of a cassandra cluster. Cassandra also works in Single Node, it can still be used in terms of development but I will return later.
You also need to understand the main configuration file for each node Cassandra: cassandra.yaml. Here ‘s documentation .
If your servers support the Debian packaging, I strongly advise you to use that proposed by Cassandra. It allows you quickly install and especially to easily update!
There is also a little ways … To be able to test your cluster, you need a machine environment to set values capabilities Cassandra. It also helps to reassure you with valid performance tests before arriving in a production environment. Virtualization on multiple nodes (“small”) server does not allow to test conclusive performance.
Administer and monitor a Cassandra cluster
Cassandra natively provides a set of tools to manage your cluster:
- cassandra-cli : this is the command line utility to read / write in your cluster
- nodetool : utilisaire is the command line to monitor and administer the cluster nodes
On top of that, DataStax offers a Web application for monitoring and administering your cluster Cassandra OpsCenter . This is really a great tool. There is a free version for development. I understand (but I may be wrong) that there will soon also free production, but is not yet release.
Manage “Eventually Consistent”
Cassandra offers a “Eventually Consistent” model. These words should normally put the chip to ear any developer and do a little scared.
Simply put, according to the CAP theorem , it is not possible to have both consistency (Consistency), availability (Availability) and resistance to fragmentation (Score tolerance). It is only possible to have 2 to 3. Cassandra has chosen to focus on the “A” and “P” and the opportunity to choose the level of consistency. It is therefore possible to have a high consistency with Cassandra but at the expense of degradation of latency (because you have to ask more node in the cluster agree on the value of a given during a reading, for example).
There is therefore possible to choose from a range of consistencies. Here is documentation on the site DataStax on the subject to guide you in your choice.
Cassandra does not know how to Lock. It is a disadvantage of its architecture. It is therefore not possible to prevent multiple clients to read or write a data set on the same key for your model at the same time. This can be highly annoying, do not deny it … In this case: “the answer is elsewhere.”
It is not impossible to do but lock your data using an external system that will handle this. Zookeeper will allow you to do this. It is a system of distributed synchronization semaphore that will allow you to make such locks. Zookeeper does not cause SPOF because it also works in cluster. Cage is a Java bookseller who uses zookeeper and offers a lock system based on paths. Here is an article that explains in detail its usefulness to supplement the use of Cassandra.
Ok, you lock the lock but does not mean … Cassandra transaction ensures atomicity of data at the same key (even in different columnFamily). However, what if you need atomicity data on several key because your data is strongly coupled and you can not do otherwise.
The answer is “Just do it (yourself).” One approach is to implement a system transaction log in a dedicated columnFamily. The principle is:
- logger the transation serializing data that need to be atomic (json, xml, …) and inserting the results serialization once in a columFamily column.
- make your treatment by inserting your data in your model.
- mark or remove the log of the transaction once processing is complete.
The system then gives you the ability to replay the transaction log if a concern has occurred during treatment. This system is discussed in this presentation (from slide 24).
How to interact with Cassandra
Cassandra exposes all its client API via Thrift . The use of native thrift is not recommended because it is rather a low-level API speaking to developer customers rather than client applications.
You have to choose your client to contact Cassandra: They are listed here .
In java, there are several. Hector seems clearly the most used and the most configurable. This is the one I use.
If you are using maven, most libraries you need are mavenisées and available in the central repository (the exception is Cage who requires to declare a specific repository).
Like do unit testing and TDD, no problems, it is possible and it is done very well. It is possible to run Cassandra and Zookeeper so “embedded” in your JVM.
It is even possible if you want to start Cassandra so embedded and load data before you run your unit tests. I suggest you have a look on the side of cassandra-unit if you want to do that.
Ecosystem constantly changing
Whether on the functionality of Cassandra, its tools, its ecosystem, there are many changes, developments and improvements.
Large and new features are constantly emerging. For example, there are some months between the 0.7.8 version and 0.8.0, a query language the “CQL” appeared and soon fate Cassandra version 1.0 … It’s going very fast.
It should generally remain on standby on blogs, twitter, mailing list (Cassandra and Hector, for example) to see what happens, what comes out. The community around Cassandra is magnifying and is very responsive.
Do not be afraid of defeat
As things change and evolve, should not be afraid to throw a little, often modify, refactor your code a lot. This is due either to a change in understanding of Cassandra makes you say to do far better, or the appearance of a new feature or a reading on a blog …
There is no secret, my best weapon to manage changes are: my unit tests and my tests integrations and it goes well!
To go further
You can have a look in the source code in svn Cassandra here or on the mirror in Git here .
You can go read the documentation on the wiki on the internal architecture of Cassandra here and see this video .
You can help by starting here .
Using Cassandra as a backend to store your data in a real impact on the way you develop your application. It’s a fairly normal hand in the sense if you use Cassandra is that you normally have a lot of data to manage and that it somehow must take this into account to develop your application …
Once the efforts made, what a pleasure to add a node in the cluster for scalability …
In fact, Cassandra, it’s very “DevOps” :-).