Patricio Echagüe Sr Software Engineer at RelateIQ
RelateIQ has rethought relationship management by automating, simplifying, and deepening the way people engage with professional relationships. It automatically tracks and analyzes the day-to-day interactions that drive relationships in business development, sales, and more. Using data and algorithms to make you better at your job.
My role is to design systems at scale, as well as data pipelines that can help us mine user data and provide our users with the information they need to make better decisions.
Path to Cassandra
Like most companies, when we started our first iteration we used MongoDB as a first data store. Soon it became a bottleneck, and Cassandra was the best candidate to replace MongoDB. It has been working non-stop for more than two years. We have upgraded nodes and doubled the capacity, all with no downtime. Cassandra just works.
Cost and high availability of Cassandra
The main reason to use Cassandra as the first alternative was to be able to scale out the system as needed with minimal operation cost.
As a Saas company, maintaining high availability is crucial for us. If we want to help people make more informed decisions, the system must be available at all times. The storage format and the naturally sorted columns play an important role for how we store and fetch the data to power our main services.
Foundation of Cassandra
Cassandra is the foundation for three major pieces of infrastructure. First, it serves as user emails metadata storage, which is used to serve what we called “Timeline” views. Second, it is the home of feature extraction for machine learning algorithms, and, finally, it’s used to store time series data for activity reporting.
The natural sorted columns structure make Cassandra a perfect fit for the job.
We are currently running 1.2 and moving soon to the latest 2.0 version. We started with two data centers composed by 6 and 3 nodes respectively, both on spinning disks. One is for online data and one is for analytics. A few months ago we upgraded those machines to bigger, SSDs ones running two data centers of 3 nodes each.
It’s currently deployed in Amazon EC2 in two data centers using three nodes on each. All machines run on SSD drives to guarantee low seek time. The provisioning is done via Chef.
Advice on getting started
Make sure you understand how you are going to query the data and how it should be sorted. Mainly seek for idempotency as much as possible. If you happen to use spinning disks random IO can degrade performance. So make sure your hot data set fits in memory.
Use the best tool for the job. DataStax Enterprise has released some good, new features that are worth exploring.
It’s been great. I personally was very involved in the community as a committer for Hector (one of the Cassandra clients) and contributing patches which were kindly accepted by the community. Datastax has been doing a great job in this front.