Pollfish is an online survey platform which permits real time surveys targeting mobile app users.
This is accomplished through its Android/iOS SDK which is installed to thousands of mobile apps while the company closely cooperates with mobile app developers for a smooth integration. Currently Pollfish is developing its own analytic pipeline to deliver insights about mobile users. My role here is that of a data engineer. I am part of a small yet great team comprised of Mr. Nick Gavalas the devops engineer who makes sure everything works fine around the clock, and Mr. Euaggelos Linardos our data scientist who builds real-time data insights.
I am currently working on ETL and data management use cases using DataStax Enterprise product, mostly utilizing Spark and Cassandra technologies.
To summarize, I am responsible for bringing data scientist’s results to the production pipeline, pretty cool!
Cassandra is our main database technology for storing/querying the raw data coming from user mobile apps where Pollfish sdk is installed. It also stores system events. It is the raw data repository for the Pollfish application.
Being part of the free DataStax Startup Program we also use the DataStax Enterprise extensions such as the Cassandra CFS to store large files instead of a native HDFS implementation. By using Apache Spark on top of Cassandra (through spark – Cassandra connector), data processing capabilities are really unlimited. Any restrictions at the Cassandra level can always be removed at the spark level assuming the data schema is reasonably tuned. As a last step which we took recently, we integrated Cassandra through DataStax Enterprise with Apache Solr, allowing us to do advanced search queries on top of processed data.This is more like an OLAP usage.
As a first need we wanted an efficient storage technology for writing raw time series data coming from the user side. Cassandra with its tunable consistency and its efficient write path was just a perfect match in the world of NoSQL databases. Another important aspect is that we wanted a tool to start fast and make progress, as there are limited resources for a start-up to support complex setups. Business as always matters the most, and Cassandra along with DataStax Enterprise served our goals.
It was mostly a use case driven evaluation process. After initial stress tests and model validation we moved to this technology via DataStax Enterprise. DataStax provided not only the production certified Cassandra database but also very easy integration with Apache Spark our next targeted technology and Hadoop FS compatibility through Cassandra FS. We have considered other technologies like MongoDB but there was a clear match for our data modeling process, so we moved to that direction.
Cassandra gave us an extendible write capacity, matching our use case and future needs. It is a perfect match for our time series based data. All in all, it is an easily managed technology for cloud deployments, fault-tolerance and with no downtime.
DataStax’ free startup program gave us use of the product without limitations. Giving us the flexibility of fees not coming until company grows, with no node number limitation!
Currently we use DataStax Enterprise version 4.6.2 which ships with Cassandra version: 188.8.131.524 (cqlsh 4.1.1 , CQL spec 3.1.1 | Thrift protocol 19.39.0).
Our production cluster setup has been documented on Microsoft Azure’s blog and is as follows:
- 4 nodes running on Cassandra DC, serving our application needs. These nodes are Standard_A4 machines (8core, 14GB of RAM).
- 3 nodes running on Analytics DC, running spark and hadoop (cfs) analytics. These are Standard_D13 machines (8core, 56GB of RAM)
- 1 node running in Solr DC, with replication of one single keyspace. This last node is a new addition in our cluster and we’re still testing Solr capabilities. However, it’s deployed on a Standard_D4 VM (8core, 28GB of memory), and has its data directory pointing on the temporary SSD disk, again for testing purposes.
- All the machines (except the Solr node) have a 15-disk RAID-0 configuration, in order to achieve maximum disk throughput.
- All the machines operate on CentOS 6.5, on kernel 2.6.32-431.29.2.el6.x86_64.
Firstly, isolate your work loads. Don’t be tempted if for example you are designing an analytics pipeline to mix the workloads. Issues will arise such as failing nodes and/or instability of the cluster in general.
There are hundreds of configuration options to use but you should start only with the basics and tuning on demand only and after careful thought. Example you could start with the heap size a Cassandra node uses and if that matches your use case. Follow best practices from the start and change one thing at a time.
Monitoring is crucial in order to verify your production environment settings, Opscenter is a very useful tool if not the only one towards that direction. Look out for full disks, exceeding JVM old generation heap size, frequent gc cycles, read/write latencies.
Cassandra as other products is an evolving technology, having said that you may hit some issues like current implementation limitations or bugs. As a common solution strategy you should evaluate your use case, the roadmap for the features you need and plan for upgrades or refactorings.
Another thing to consider is that if you integrate Cassandra and Spark in your Lambda architecture you must design the schema for idempotent writes. Repeatability is a requirement in such a use case eg, due to spark job failures or development bugs. While deleting data is an option overwriting is best, exploiting Cassandra’s fast write path.
Do not optimize early. It is very tempting with all the available settings to start tuning. Refrain from doing that and make sensible changes using a measured justifiable process (e.g. use jconsole, a java profiler etc). Finally just learn your tool and enjoy it!
The Cassandra community is evolving along with the products and getting pretty mature. My interaction was through mail lists and communication has been really helpful. I am looking forward to learn more, it is a very exciting technology!