December 6th, 2013

By 

 

 

Ebot Tabi: CEO at SiQueries

 

 

Ebot Tabi, CEO and Founder of SiQueries, joins us to talk about their Apache Cassandra use case. Thanks Ebot, to begin, what does SiQueries do?

SiQueries is start-up and our primary business is providing a SaaS for data analysis and visualization for small and medium companies. We let you connect various data sources such as MySQL, Postgres, Amazon Redshift, Google BigQuery & Oracle and run real-time queries over these data sources to gain valuable business insights.

 

Also for end-users with data sources from sensors, server logs etc. we provide a highly available API letting them push data into our data stores and run analysis over to extract valuable knowledge. We believe small and medium companies can become more competitive when they understand their business, market trends and make smart decision from faster analysis performed on these data. At SiQueries we try to make this process extremely simple and easy to get started.

 

How are you using Apache Cassandra?

The vast amount of the data pushed into our infrastructure via our REST API which are being processed is represented internally as time series. The data is by and large ingested and consumed in real-time via a couple of storm topologies consuming directly from our Kafka cluster, either to generate alerts or to be displayed on interactive graphs. We use Apache Cassandra to store this time series data.

 

What was the motivation for using Cassandra? Did you look at any other technologies?

Cassandra give us a nice mix of low latency for durable writes, scalable storage and simple management. The beta version of our custom cloud and on-site deployments started around MongoDB, at the beginning it was easy to setup and developing features was pretty quick but as soon as we hit massive data growth under just a couple of weeks, Ad-hoc query performance dropped badly so we did research on alternatives, and we did try Apache Hbase, Apache Cassandra and Druidio from Metamarkets. We finally settled on baking something around combining Apache Cassandra with Druidio, Storm and Kafka to ingest and serve  analysis results in real-time. This helps the response to end user quicker and with more intelligence reports. Right now Apache Cassandra holds over 60% of data pushed into our infrastructure.

 

Can you share some insight on what your deployment looks like?

We have a 6 nodes Cassandra cluster running on cloud servers and S3 for backup.  Our setup helps to achieve some nice writes: 10,000 writes in a second, which is reasonable for us for this startup phase.  We are currently running demo phases with some clients and we have ingest roughly over 19TB of data into our Apache Cassandra cluster and the data is growing at an exponential rate.

 

What would you like to see out of Apache Cassandra in future versions?

We would definitely welcome performance improvements, especially doing some fast random reads on large datasets.

 

What’s your experience with the Apache Cassandra community?

The community is just awesome, DataStax has some great documentation, every single challenge we faced, there was already an answer on either the community or DataStax documentations. We are definitely grateful for the awesome work done by the community and also looking forward to share and contribute to this community.

Vote on Hacker News