March 7th, 2013

Shodan, LLC

John Matherly | Founder 

Our database is currently being migrated from MongoDB to Cassandra for the bulk amount of data (~1 billion rows). It should be going live on the production website within the next week, but all newly collected data is solely being written to Cassandra.

You’re probably wondering why we made the switch.. well, here’s our story:

MySQL to MongoDB

The project started off with MySQL but it got much too cumbersome and something less structured was needed. MongoDB is very developer-friendly and easy to get started (especially as a single-server installation). And it did rather well for a long time, but I reached a point where writing to it got too slow (tons of page faults) and the sharding/replication model meant that new servers wouldn’t help with write throughput significantly (writes >>> reads for me).  I did end up trying their replica sets just to see whether it would be feasible, so I could avoid switching technologies, but it ended up failing at very basic things (setting up the initial replica failed due to a known bug in the way it handles unique indices – I don’t want to spend the majority of my time on operations, therefor scaling MongoDB horizontally was out of question.

Maybe Riak?

I then looked at Riak, but the write throughput was disappointing on my commodity hardware (3 node cluster, tried both LevelDB and Bitcask) and Python support was lackluster (problems w/protobufs at the time). It was very easy to setup though, has a nice administration interface but the lack of a major developer ecosystem and iffy write performance made me hesitant so I continued looking around.

Finally, Cassandra

I was researching some YCSB data in which Cassandra came out very favorable (I think it was the DataStax article in the 2012 review of Casssandra found here) and decided to check it out more seriously. Previously, I was under the impression that it was similar to setting up Hadoop/HDFS/HBase and that was too much to manage for myself.. I wanted something simpler. After looking into it a bit more, it turns out I was completely wrong and not only was it easy to setup but there was a user-friendly administration interface provided by Datastax.

A few hours later I was up and running with a 3 node cluster (I had used the Datastax apt. repository to install; it can be found here), writing the same data as I wrote to MongoDB and taking nodes in and out of the cluster without any problems. I’ve had to remodel the way I store data and rethink how certain operations can be performed, but the linear scalability, ease of maintenance, good developer community and raw performance have made it worth the switch. My only nitpick so far is that I wish CQL3 offered some of the flexibility that the Thrift interface provides, but maybe that will be fleshed out in the future (I’m using Thrift for my stuff because I wanted dynamic columns and I couldn’t get that done nicely w/CQL).

Additional Information

On a side note, using Cassandra has allowed me to increase the amount of data I can collect and I’m looking at writing upwards of 500 million items/month now.  Those are the basics of my use case and how Cassandra has fit into it. I think a lot of my sentiments are echoed by the community, and my ramblings above probably sound like a rehash of a lot of Cassandra users. Btw I love DataStax OpsCenter!