This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we talked with Aaron Stannard who is the founder and CTO at MarkedUp.
DataStax: Aaron, thanks for taking the time to chat with us today. What’s MarkedUp all about?
Aaron: I would describe MarkedUp as “Omniture for native apps”. MarkedUp is a premium analytics solution for software developers, currently those who work on the Windows platform. We help them understand how their app is used everyday, we log crashes, but we also do things like provide data that helps them boost the trials of their software and ultimately the amount of software that they sell.
DataStax: How does your product work?
Aaron: There are a number of distinct components to our product. The MarkedUp SDK ends up being embedded inside a customer’s application. This helps us instrument and capture all the app data that we need.
When an app is run, it sends data to our API servers that map the data to a specific application, and then that in turn sends the data onto our database, which is DataStax Enterprise. So that’s our data collection piece.
Another component that the customer interacts with is our web-based analytics reporting dashboard that presents all the information on how their app is being used, where it’s being used, and much more. All of the analytics are served up by DataStax Enterprise.
We have a few hundred customers right now, and quite a number of large customers that use us everyday. They especially bring us a hefty amount of data volume to manage and cause some scalability issues. This is ultimately why we turned to Apache Cassandra and DataStax.
DataStax: What’s your database look like?
Aaron: We currently run our database cluster on Ubuntu Linux on AWS and used the DataStax AMI to get things going, which was very convenient and easy to use. However, we’ll likely be moving off of AWS and onto our own high-end blade servers soon.
Our database cluster has about 80% of its resources devoted to Cassandra, 10% to Hadoop, and 10% to Solr. One nice thing that happened when we migrated our data over to Cassandra was that, even though we’ve replicated the data 3x across our cluster for redundancy, Cassandra’s built in compression reduced the data footprint by 75%.
We make heavy use of Cassandra’s distributed counters; our API constructs update these counters in real-time as things happen. We do it this way vs. storing such things in Hadoop because it lets us respond to our customers in real time as the events are actually occurring.
Counters are great because they’re highly available, but they’re not good for everything. So, for example, when we report on the amount of total sales that a customer does and other similar things, we use Hadoop, and specifically Hive for that.
DataStax: What about Solr? How do you use that?
Aaron: For us, Solr solves two specific problems. First, counting large objects such as the total number of users across all apps and across a date range in a highly performant manner is actually a pretty tough computer science problem to solve. But Solr allows us to get back very fast results in an easy manner.
Second, customers want an easy way to search through things like their logs, and Solr’s search functionality makes this easy and fast to do.
DataStax: How do you manage and monitor everything?
Aaron: Actually, we just upgraded to version 3.0 of OpsCenter. We use OpsCenter primarily for performance monitoring, and for things like the built-in alerts so we can be notified for key things like a node going down. We also use OpsCenter for backups. That was the biggest thing we’ve noticed in the new version so far – the ability to easily do restores from backups.
DataStax: Is DataStax Enterprise all that you use for database work?
Aaron: We’re ex-Microsoft guys so we do use some SQL Server to manage some small things like our customer’s subscription plans and such where an RDBMS makes sense.
We had built a SQL Server-based analytics system at another company using Analysis Services, and we originally looked at doing the same thing here. But honestly, the hardware requirements and cost for handling the type of data we do, not to mention how unpredictable the load can be, would have been enormous. We knew pretty early on SQL Server wasn’t going to scale for us in the manner like DataStax Enterprise does.
DataStax: You wrote a very descriptive blog post on how you evaluated various NoSQL options for your company. If you were to boil it down to a couple reasons why you went with DataStax Enterprise, what would they be?
Aaron: First, Cassandra eliminated the need for us to constantly query over our dataset, which was a huge source of problems in our previous system. Our data is time series in format, and Cassandra is really fast at storing and retrieving slices of time over a large dataset. We can keep our data real-time, which makes our customers happy.
That need narrowed the list of NoSQL competitors to just HBase and Cassandra. Being ex-Microsoft guys, we’re not exactly Linux or Java gods, so we were looking for something that had all the pieces we knew we’d need for analytics and search work. DataStax Enterprise brought us everything we needed in a turn-key solution, which made things much easier for us to use and manage.
The whole DataStax package caused the NoSQL learning curve for us to drop dramatically. We were live with our new system in one-third the time it would have taken us if we’d gone down another path.
Lastly, seeing all the big companies out there that use DataStax and Cassandra made us feel real comfortable in going down the road we have. The talk we saw at the last Cassandra Summit by eBay really made a big impression on us because they’re doing time series data and analysis on top of it, which is exactly what we do. We felt like we were investing in a proven solution that would be easy to manage in production. So far, that’s been the case.
DataStax: What advice would you give new people coming to NoSQL?
Aaron: CQL [the Cassandra Query Language] can prove to be a double-edged sword. Again, we’re ex-Microsoft guys so while CQL does help make the jump to NoSQL easy, you have to remember that the underlying data model is different and you need to work differently in Cassandra vs. the relational world.
For example, with Cassandra, you can have very wide rows with the advantage being that all the columns are stored next to each other and sorted. One thing we do, for example, is have a column name that’s the actual date/time of some event and then right next to it we’ll have a counter. This makes it real easy to just grab the data we want for a particular date range.
DataStax: Aaron, thanks for taking time to chat with us.
Aaron: Sure thing.