Top Posts This Month
Upcoming Webinars

Anticompaction in Cassandra 2.1

July 28, 2014


Cassandra 2.1 introduces incremental repairs which makes repair a lot more lightweight as it does not do repair on already repaired data. Anticompaction is one of the things that makes incremental repairs possible. This blog posts aims to explain what anticompaction is and how it affects regular compaction.

Incremental repairs

The concept of incremental repairs is explained more in detail in this post, but the general idea is that we mark all SSTables that were involved in the repair with a timestamp to indicate that it was repaired, then we simply don’t include that SSTable when we do the next repair since we know that data that was once repaired will stay that way.


Since SSTables can contain any range, we need to split out the ranges that were actually repaired, this is called anticompaction. It means that one SSTable is split in two – one containing repaired data and one containing unrepaired data.

If the range in the SSTable is fully contained within the range that was repaired, we don’t actually rewrite the SSTable, instead we just change the SSTable metadata to indicate when it was repaired.

Since we now have two sets of SSTables that we can’t compact together, we need to do some adjustments to our compaction strategies.

Size tiered compaction

With size tiered compaction it is pretty simple, we split the SSTables in two sets, one with repaired and one with unrepaired. Then we try to find compaction candidates within those two sets and run the compaction on the candidates that would have the biggest benefit.

Major compaction will create two SSTables instead of one. If you use major compaction to get rid of tombstones, it should still work just as well since we can drop tombstones if the SSTables included in the compaction contain the tombstone plus all older occurrences of the key, and in this case it will, since the unrepaired set of SSTables will almost always only include newer data. This means that we do not need to check if the key is in one of those SSTables as the tombstone can’t cover any data there.

Leveled compaction

For leveled compaction we do leveling on the repaired SSTables and then size tiered compaction on the unrepaired ones. This means that once you do an incremental repair you will have to continue doing them (there are ways to clear out the repair-state to revert this, more about that later). Otherwise you will not run leveled compaction, just size tiered.

The complicated part is how to migrate to using incremental repairs since it is only after the first incremental repair run that we want to separate repaired and unrepaired SSTables, before that we want a leveling on all the SSTables.

After the incremental repair is done, we iterate over the SSTables included in the repair and run anticompaction on them one at a time. This means that after the first SSTable has been anticompacted, we will have to move all the currently leveled but unrepaired SSTables out and end up with only the first repaired and anticompacted SSTable in the leveling and possibly thousands in the unrepaired set. After that we continue and anticompact the rest of the SSTables.

We do a few things to make it better though. First, when we clear out the unrepaired SSTables from the leveling we keep the original SSTable level to make it possible to re-add the SSTable at its original position after it has been anticompacted. For example, if an SSTable is in L3 before anticompaction, it is likely that we can add it in level 3 after the anticompaction. This is especially important as we anticompact one SSTable at a time during an anticompaction session, meaning many SSTables will just temporarily go into the unrepaired set because they might have just been repaired, just not yet anticompacted.


Running the first incremental repair will affect many nodes at the same time, to avoid that there is a way to migrate one node at a time, though it requires a bit of manual labour;

  1. Disable compaction on the node (nodetool disableautocompaction)
  2. Run a classic full repair
  3. Stop the node
  4. Use the tool sstablerepairedset to mark all the SSTables that were created before you did step 1.
  5. Restart cassandra

If you run regular repairs you could note when you last ran a full repair on the node and use that time. SSTables are immutable, meaning if an SSTable has not changed since the repair started, it is still repaired. Note that you need to check when you last ran a full repair (not -pr) and you will need to do it on every node.

This tool can also be used to clear out the repaired-state on the SSTable, stop the node run the command tools/bin/sstablerepairedset –is-unrepaired <sstable> on all SSTables and restart, now all your data will be leveled again.

Anticompaction in Cassandra 2.1” was created by Marcus Eriksson

Revisiting 1 Million Writes per second

July 25, 2014

In an article we posted in November 2011, Benchmarking Cassandra Scalability on AWS – Over a million writes per second, we showed how Cassandra (C*) scales linearly as you add more nodes to a cluster. With the advent of new EC2 instance types, we decided to revisit this test. Unlike the initial post, we were not interested in proving C*’s scalability. Instead, we were looking to quantify the performance these newer instance types provide.
What follows is a detailed description of our new test, as well as the throughput and latency results of those tests.

Node Count, Software Versions & Configuration

The C* Cluster

The Cassandra cluster ran Datastax Enterprise 3.2.5, which incorporates C* The C* cluster had 285 nodes. The instance type used was i2.xlarge. We ran JVM 1.7.40_b43 and set the heap to 12GB. The OS is Ubuntu 12.04 LTS. Data and logs are in the same mount point. The mount point is EXT3.
You will notice that in the previous test we used m1.xlarge instances for the test. Although we could have had similar write throughput results with this less powerful instance type, in Production, for the majority of our clusters, we read more than we write. The choice of i2.xlarge (an SSD backed instance type) is more realistic and will better showcase read throughput and latencies.
The full schema follows:
create keyspace Keyspace1
 with placement_strategy = ‘NetworkTopologyStrategy’
 and strategy_options = {us-east : 3}
 and durable_writes = true;
use Keyspace1;
create column family Standard1
 with column_type = ‘Standard’
 and comparator = ‘AsciiType’
 and default_validation_class = ‘BytesType’
 and key_validation_class = ‘BytesType’
 and read_repair_chance = 0.1
 and dclocal_read_repair_chance = 0.0
 and populate_io_cache_on_flush = false
 and gc_grace = 864000
 and min_compaction_threshold = 999999
 and max_compaction_threshold = 999999
 and replicate_on_write = true
 and compaction_strategy = ‘org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy’
 and caching = ‘KEYS_ONLY’
 and column_metadata = [
   {column_name : 'C4',
   validation_class : BytesType},
   {column_name : 'C3',
   validation_class : BytesType},
   {column_name : 'C2',
   validation_class : BytesType},
   {column_name : 'C0',
   validation_class : BytesType},
   {column_name : 'C1',
   validation_class : BytesType}]
 and compression_options = {‘sstable_compression’ : ”};
You will notice that min_compaction_threshold and max_compaction_threshold were set high. Although we don’t set these parameters to exactly those values in Production, it does reflect the fact that we prefer to control when compactions take place and initiate a full compaction on our own schedule.

The Client

The client application used was Cassandra Stress. There were 60 client nodes. The instance type used was r3.xlarge. This instance type has half the cores of the m2.4xlarge instances we used in the previous test. However, the r3.xlarge instances were still able to push the load (while using 40% less threads) required to reach the same throughput at almost half the price. The client was running JVM 1.7.40_b43 on Ubuntu 12.04 LTS.

Network Topology

Netflix deploys Cassandra clusters with a Replication Factor of 3. We also spread our Cassandra rings across 3 Availability Zones. We equate a C* rack to an Amazon Availability Zone (AZ). This way, in the event of an Availability Zone outage, the Cassandra ring still has 2 copies of the data and will continue to serve requests.
In the previous post all clients were launched from the same AZ. This differs from our actual production deployment where stateless applications are also deployed equally across three zones. Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library. As a further speed enhancement, Astyanax also inspects the record’s key and sends requests to nodes that actually serve the token range of the record about the be written or read. Although any C* coordinator can fulfill any request, if the node is not part of the replica set, there will be an extra network hop. We call this making token-aware requests.
Since this test uses Cassandra Stress, I do not use token-aware requests. However, through some simple grep and awk-fu, this test is zone-aware. This is more representative of our actual production network topology.

Latency & Throughput Measurements

We’ve documented our use of Priam as a sidecar to help with token assignment, backups & restores. Our internal version of Priam adds some extra functionality. We use the Priam sidecar to collect C* JMX telemetry and send it to our Insights platform, Atlas. We will be adding this functionality to the open source version of Priam in the coming weeks.
Below are the JMX properties we collect to measure latency and throughput.


  • AVG & 95%ile Coordinator Latencies
    • Read
      • StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated
    • Write
      • StorageProxyMBean.getRecentWriteLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated


  • Coordinator Operations Count
    • Read
      • StorageProxyMBean.getReadOperations()
    • Write
      • StorageProxyMBean.getWriteOperations()

The Test

I performed the following 4 tests:
  1. A full write test at CL One
  2. A full write test at CL Quorum
  3. A mixed test of writes and reads at CL One
  4. A mixed test of writes and reads at CL Quorum

100% Write

Unlike in the original post, this test shows a sustained >1 million writes/sec. Not many applications will only write data. However, a possible use of this type of footprint can be a telemetry system or a backend to an Internet of Things (IOT) application. The data can then be fed into a BI system for analysis.

CL One

Like in the original post, this test runs at CL One. The average coordinator latencies are a little over 5 milliseconds and a 95th percentile of 10 milliseconds.
Every client node ran the following Cassandra Stress command:
cassandra-stress -d [list of C* IPs] -t 120 -r -p 7102 -n 1000000000  -k -f [path to log] -o INSERT


For the use case where a higher level of consistency in writes is desired, this test shows the throughput that is achieved if the million writes per/sec test was running at a CL of LOCAL_QUORUM.
The write throughput is hugging the 1 million writes/sec mark at an average coordinator latency of just under 6 milliseconds and a 95th percentile of 17 milliseconds.
Every client node ran the following Cassandra Stress command:


Mixed – 10% Write 90% Read

1 Million writes/sec makes for an attractive headline. Most applications, however, have a mix of reads and writes. After investigating some of the key applications at Netflix I noticed a mix of 10% writes and 90% reads. So this mixed test consists of reserving 10% of the total threads for writes and 90% for reads. The test is unbounded. This is still not realistic of the actual footprint an app might experience. However, it is a good indicator of how much throughput can be handled by the cluster and what the latencies might look like when pushed hard.
To avoid reading data from memory or from the file system cache, I let the write test run for a few days until a compacted data to memory ratio of 2:1 was reached.

CL One

C* achieves the highest throughput and highest level of availability when used in a CL One configuration. This does require developers to embrace eventual consistency and to design their applications around this paradigm. More info on this subject, can be found here.
The Write throughput is >200K writes/sec with an average coordinator latency of about 1.25 milliseconds and a 95th percentile of 2.5 milliseconds.
The Read throughput is around 900K reads/sec with an average coordinator latency  of 2 milliseconds and a 95th percentile of 7.5 milliseconds.
Every client node ran the following 2 Cassandra Stress commands:


Most application developers starting off with C*, will default to CL Quorum writes and reads. This provides them the opportunity to dip their toes into the distributed database world, without having to also tackle the extra challenges of rethinking their applications for eventual consistency.
The Write throughput is just below the 200K writes/sec with an average coordinator latency of 1.75 milliseconds and a 95th percentile of 20 milliseconds.
The Read throughput is around 600K reads/sec with an average coordinator latency of 3.4 milliseconds and a 95th percentile of 35 milliseconds.
Every client node ran the following 2 Cassandra Stress commands:


The total costs involved in running this test include the EC2 instance costs as well as the inter-zone network traffic costs. We use Boundary to monitor our C* network usage.
The above shows that we were transferring a total of about 30Gbps between Availability Zones.
Here is the breakdown of the costs incurred to run the 1 million writes per/second test. These are retail prices that can be referenced here.
      Instance Type / Item       
     Cost per Minute     
     Total Price per Minute     
Inter-zone traffic
$0.01 per GB
3.75 GBps * 60 = 225GB per minute


Total Cost per minute
Total Cost per half Hour
Total Cost per Hour

Final Thoughts

Most companies probably don’t need to process this much data. For those that do, this is a good indication of what types of cost, latencies and throughput one could expect while using the newer i2 and r3 AWS instance types. Every application is different, and your mileage will certainly vary.
This test was performed over the course of a week during my free time. This isn’t an exhaustive performance study, nor did I get into any deep C*,  system or JVM tuning. I know you can probably do better.  If working with distributed databases at scale and squeezing out every last drop of performance is what drives you, please join the Netflix CDE team.
Revisiting 1 Million Writes per second” was created by Christos Kalantzis, Engineering Manager of Cloud Database Engineering at Netflix

Apache Cassandra & Python for the The New York Times ⨍aбrik Platform

July 24, 2014


In this session, you’ll learn about how Apache Cassandra is used with Python in the NY Times ⨍aбrik messaging platform. Michael will start his talk off by diving into an overview of the NYT⨍aбrik global message bus platform and its “memory” features and then discuss their use of the open source Apache Cassandra Python driver by DataStax. Progressive benchmark to test features/performance will be presented: from naive and synchronous to asynchronous with multiple IO loops; these benchmarks tailored to usage at the NY Times. Code snippets, followed by beer, for those who survive. All code available on Github!

Presentation by Michael Laing, Infrastructure Architect at The New York Times

Global Gaming Company DeNA Levels Up Their Apache Cassandra EXP to Track Yours

July 23, 2014









 “We had several criteria when we were evaluating NoSQL solutions: write speed, ability to handle large data (blob), ability to scale and fault tolerance.

- Ashley Martens, Developer at DeNA


 Ashley Martens Developer at DeNA




DeNA is a mobile gaming company, primarily. Currently I am an engineer that assists game teams.


Evolving from MySQL to Cassandra

At the time we were using MySQL for our solution but we had come to it’s blob limits.  We had several criteria when we were evaluating NoSQL solutions: write speed, ability to handle large data (blob), ability to scale and fault tolerance.  We had evaluated Cassandra and MongoDB, but at the time MongoDB’s global write lock killed it’s performance.


Cassandra for game states

We use Cassandra as a large data store, mostly for things like game state. As a player progresses through a game they accumulate items, experience, etc.  and Cassandra helps keep track of that.  For games that do not want to run a server that tracks this data we have a service that acts as a per user and game key value store. 
Screen Shot 2014-07-23 at 3.15.43 PM

We use two versions of Cassandra, 0.6 and 0.7, because of the cluster downtime requirement to upgrade and the Ruby driver’s inability to talk to two clusters of different versions at once.



We are in two data centers with 15 to 20 nodes per data center. In each data center we are storing around 10 TB of data. We modeled our IP ranges for the rack inferring snitch and use the network topology strategy to distribute the data.


Words of wisdom

Be proactive in adding nodes to the cluster. Don’t be afraid to bounce every machine in your cluster. Invest in tools for managing your cluster. Pay attention to the logs. Make sure you understand the data model and it fits your use case. Try to stay near current stable.


Community involvement

When I needed help the community was there for me. Now there aren’t that many people that know the older versions so I spend a lot of time in the source code to solve problems.

60+ Reasons to Attend the Cassandra Summit San Francisco

July 23, 2014


The agenda for this year’s Cassandra Summit is now available and we couldn’t be more excited. Time and again over the past four years fantastic speakers have stepped up to share their use cases, knowledge, and passion for Apache Cassandra and this year we welcome a host of new speakers including Anthony Vigil from Disney, Alexander Filipchik from Sony, John Sumison from Family Search, Robbie Strickland from The Weather Channel, Sean O Sullivan and Tim Czerniak from Demonware, Gary Stewart from ING, Rogers Stephens from FedEx, Roopa Tangilara from Netflix, Dan Cundiff from Target and many more. Go explore all the sessions, and put together your own agenda of talks you don’t want to miss.

If you still haven’t planned your attendance, register now and don’t miss out this year’s one-and-only world’s largest gathering of Cassandra developers.

Thank you so much to everyone who submitted abstracts. The competition for spots was fierce and unfortunately we could not accommodate everyone. If you submitted, but were not accepted this year, we’d love to work with you on a presentation at a meetup, Cassandra Day, or webinar, so please let us know by emailing

So, in addition to great technical content what else can you expect at this year’s Summit? Well…we heard you loudly and clearly about some of the room sizes and facilities last year, so this year there is no breakout room with capacity less than 200. We are also hosting receptions on both evenings to allow more time for networking and fun. We’ll also feature a zone where folks can see great applications built on Apache Cassandra and we will welcome more sponsors than ever, as the ecosystem around Cassandra continues to mature. Don’t miss an opportunity to get trained on Cassandra, whether you are a newbie, or fancy yourself a veteran; Day 1 on September 10th features three training tracks; one for beginners, one for advanced data modeling and one for performance tuning.

We hope to see you at Summit, but if you can’t join and want a bit of Summit flavor, stay tuned for where you can attend a meetup group to watch Jonathan Ellis’ keynote live. This year promises to offer a dive into the new features of Cassandra 2.1 which will be generally available by then.

1 2 3 127