Brent Theisen Lead Developer at Womply
Womply is a next generation payments data company. We partner with merchant-facing companies including credit card processors and acquirers to analyze data for millions of merchants across the US.
My role at Womply is that of a lead developer who gets to work on a wide variety of technologies. On any given day I might be automating our VM provisioning with Chef, writing an ETL process in Java/Scala or implementing some new feature using Ruby on Rails.
Cassandra and Spark
We use Cassandra for dashboards that allow merchants to analyze their revenue and compare it against merchant aggregates in their area and/or vertical.
We evaluated HBase pretty heavily but found that the operational demands imposed by it were much greater than Cassandra.
We are currently using Cassandra 2.0.8 in combination with Apache Spark 1.0. The revenue data we collect gets stored directly in Cassandra. From there we use Apache Spark to precompute and persist to Cassandra several time series aggregates with partition keys like category, city/state and nearest merchants.
Our revenue data needs to be aggregated into several different kinds of time series which is an excellent fit for Cassandra.
Cassandra was a good fit with our product requirements; storing time series data is one of Cassandra’s sweet spots and is a core feature of our product.
It’s also relatively easy to manage from an operations perspective. On top of that, having no single point of failure (SPOF) is great, and it is easy to scale up/down. The Chef cookbook support is also beneficial.
Lastly, Cassandra is very affordable. Our cluster uses entirely open source software so there are no licensing fees to pay.
We have one Cassandra data center that also runs Spark locally on each node. Spark allows you to set max core and RAM usage thresholds so we’ve set max cores to half of the virtual cores on each EC2 instance. This has allowed us to run Spark jobs on the same nodes that also serve real-time queries from our web application with minimal performance degradation.
If you need to run massive parallel processing jobs consider using Spark instead of map reduce. Spark will allow you to write jobs more quickly that will run much faster. Also getting it installed and integrated with Cassandra is much easier/cheaper than Hadoop.
Our experience with the community has been really good. While setting up our initial proof of concept I found a bug in the Cassandra Hadoop input format. The patch I submitted was merged up to trunk within 24 hours and was included in a release a few days after that. We’ve also had similarly good experiences with other open source projects in the Cassandra orbit like Calliope and the Cassandra Chef cookbook.