What is Apache Cassandra?
Apache Cassandra Overview
Apache Cassandra, a top level Apache project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers in multiple geographic areas). At this scale, small and large components fail continuously. The way Cassandra manages a persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies of familiar database environments, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. The Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency, helping drive down costs of ownership while greatly increasing the value of a business’s big data environment.
These aims have been widely met and the benefits of Cassandra are being realized across the industry spectrum. Many companies have successfully deployed and benefited from Apache Cassandra including some large companies such as: Adobe, Comcast, eBay, Rackspace, Netflix, Twitter, and Cisco. The larger production environments have hundreds of TB of data in clusters of over 300 servers. Cassandra is available under the latest Apache license.
Key features of Apache Cassandra
- Elastic scalability - Allows you to easily add capacity online to accommodate more customers and more data whenever you need.
- Always on architecture - Contains no single point of failure (as with traditional master/slave RDBMS’s and other NoSQL solutions) resulting in continuous availability for business-critical applications that can’t afford to go down, ever.
- Fast linear-scale performance - Enables sub-second response times with linear scalability (double your throughput with two nodes, quadruple it with four, and so on) to deliver response time speeds your customers have come to expect.
- Flexible data storage - Easily accommodates the full range of data formats including: structured, semi-structured and unstructured, that run through today’s modern applications. Also dynamically accommodates changes to your data structures as your data needs evolve.
- Easy data distribution - Gives you maximum flexibility to distribute data where you need by replicating data across multiple datacenters, the cloud and even mixed cloud/on-premise environments – all of which are becoming extremely common deployment environments. Read and write to any node with all changes being automatically synchronized across a cluster.
- Operational simplicity - with all nodes in a cluster being the same, there is no complex configuration to manage so administration duties are greatly simplified.
- Transaction support - Delivers atomicity, isolation and durability of ACID compliance through its use of a commit log to capture all writes and built-in redundancies that ensure data durability in the event of hardware failures, as well as transaction isolation, atomicity, with consistency being tunable.
Performance Comparisons: Apache Cassandra, HBase and MongoDB
NoSQL Database Performance Testing Overview
Understanding performance behavior of a NoSQL environment, including Cassandra, under various conditions is critical. Conducting a formal proof of concept (POC) in the environment in which the database will run is the best way to evaluate platforms. POC processes that include the right benchmarks such as: configurations, parameters and anticipated data and concurrent user workloads, give both IT and business stake holders powerful insight about platforms under consideration and a view for how business applications will perform in the environments tested.
Independent benchmark analyses and testing of various NoSQL platforms have been performed and consistently identified Apache Cassandra to be the platform of choice for businesses interested in adopting NoSQL. One benchmark analysis by engineers at the University of Toronto, which in evaluating six different data stores, found Apache Cassandra the “clear winner throughout our experiments”. Their report Solving Big Data Challenges for Enterprise Application Performance Management is thorough and involved. Also, End Point Corporation, a database and open source consulting company, in an analysis commissioned by DataStax, benchmarked the top NoSQL databases including: Apache Cassandra, Apache HBase, and MongoDB “using a variety of different workloads on AWS EC2″. The comprehensive results of their testing are available in the whitepaper Benchmarking Top NoSQL Databases.
Before viewing some of the general results, it’s important to take a moment to give a quick description of what the differences are in the NoSQL databases tested.
- Apache Cassandra, as previously mentioned, is a highly scalable, high performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
- Apache HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.
- MongoDB is a cross-platform document-oriented database system that eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.
End Point conducted the benchmark of these NoSQL database options on Amazon Web Services EC2 instances, which is an industry-standard platform for hosting horizontally scalable services. In order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times on 3 different days. New EC2 instances were used for each test run to further reduce the impact of any “lame instance” or “noisy neighbor” effects sometimes experienced in cloud environments, on any one test.
NoSQL Database Performance Testing Results
Read/Write mix workload
Read/Write mix workload is an indicator of throughput, or transactions per second, that can be achieved by an OLTP database environment. For an environment to handle increased throughput is critical as it shows how well the database will handle growing levels of business. The better a database can handle increasing throughput also informs you about how well the database can scale. Cassandra has been proven to handle growing transactions at rates much more capably and efficiently than other environments ensuring that your business can scale and be successful.
Read latency across all workloads
For business applications, latency is a key productivity metric. This describes the amount of time necessary for a transaction, or transactions, to complete. Excessive latency can be a killer for a business that depends on a user experience or transaction rate that must be as close to instant as possible. Longer latency negatively impacts the user experience, as Amazon was able to quantify, where it was determined that for every 100ms of latency cost them 1% of sales revenue. Typically, latency will increase with the workload so deploying a database environment such as Cassandra can help to ensure that latency is mitigated at scale.
NoSQL Database Conclusion
The metrics analyzed in this review are just a few of the many that have solidified Apache Cassandra as the NoSQL database of choice by the business and technical leaders of hundreds of companies. Each database option (Cassandra, HBase and MongoDB) will certainly shine in some scenarios, it’s important to select an option that has the ability to adjust to the needs of your business today and tomorrow. Whether you are primarily concerned with throughput or latency, as seen previously, or more interested in the architectural benefits such as having no single point of failure or being able to have elastic scalability to ensure that your environment keeps pace with the instantaneous pace of your business. While HBase is a very good option for analytics (HBase run on top of Hadoop is a mainstream option), HBase is not a good choice for business applications relying on high transaction rates. It’s also very complex from an operational standpoint and can have issues supporting multiple data centers.
MongoDB is a great choice for smaller environments or those that will not require much in the way of scalability. But, MongoDB does lack many of the features that enterprises demand that go beyond scalability such as no single point of failure, guaranteed uptime and a track record of multi datacenter support.
One other aspect to consider – and certainly just as important as any of the technical aspects- is the level of activity in the community around the platform. Not only does the Apache Cassandra platform provide superior capabilities to help ensure the success of your applications and in turn your business by providing massive scale (both in storage growth and in throughput), success with multi-datacenter installations, manageability, etc. but the Apache Cassandra community is incredibly well represented by users of the platform across many various industries and in a variety of roles both technical and business-centric.