Rashmi Aroskar: Senior Software Development Engineer at General Sentiment
Joining us for today’s Apache Cassandra use case discussion we have Rashmi Aroskar with General Sentiment. Thanks Rashmi, to start things off what does General Sentiment do?
Billions of people express opinions daily on the Web. That is a lot of unstructured yet rich data. At General Sentiment we analyse, structure and distill that gargantuan amount of information into metrics that gather insight on why these trends occur.
Our customers use these building blocks of social intelligence in a variety of ways :
– Monitoring brand health
– Conducting silent surveys of aggregate opinion
– Finding shared social affinities of a brand’s / topic’s social media audience with other brands / topics.
That’s great and how are you using Apache Cassandra?
We use Apache Cassandra to host all our text analytics. We use Cassandra (storage/hosting) in tandem with Hadoop MapReduce (processing). Making optimal use of Cassandra database is dependent on how one structures their schema. In our case we have structured the schema such that we can present the same data in different ways based on query by user.
What was the motivation for using Cassandra and what other technologies was it evaluated against?
General Sentiment operates with 6 TB of temporal data (thousands of time-points spread across tens of millions of keys) that we update daily. We also generate 10+ GB of (mostly temporal) analytics data daily. On top of that we host data for multiple metrics and these needed to be in their separate “tables” (or CFs in this case). All of these statistics called for a column-oriented database that could scale to our needs.
Back when we started looking for a solution (2010) , we had a choice between HBase (since we use Hadoop) and Cassandra. HBase was proving difficult to set up and run at the time, so we went with Cassandra 0.6 .
Can you share some insight on what your deployment looks like?
Our cassandra cluster is hosted on Amazon EC2 using general purpose extra large machines (m1.xlarge). Our 24 node cluster currently hosts about 6 TB with compression enabled. We spread this cluster across 2 availability zones with 12 nodes each.
What would you like to see out of Apache Cassandra in future versions?
We would like to see the Vnodes feature support multi-rack topologies.
We’ve had trouble getting a balanced cluster configuration with a more recent version of Cassandra v1.2 ( more specifically v1.2.9 , details below) . If we are to upgrade to v2.0 and beyond , we will need to figure out if this was end-user error or a problem with the version itself.
Better documentation of the nodetool command as well as cassandra.yaml would be nice. We find that with the introduction of leveled compaction since v1.0 , documentation on which parameters [nodetool / yaml] affect compaction performance for different strategies has been confusing if not sparse.
A better set of tools to monitor read / write / compaction performance as well as data footprint ( size before compression ) would be nice too. Some of the functionality in JMX can go into nodetool or something similar. If there were an easy way to plug diagnostics into custom tools like triggers for emails, that would be perfect.
What’s your experience with the Apache Cassandra community?
We’ve had excellent experiences with the Cassandra community. We’ve always received quick answers to all our questions and they’ve been useful for quick troubleshooting without exception. We are a small company and getting help from a community full of experienced and smart people certainly helps us in making the right decisions.