January 28th, 2014

By 

 

 

Michael Rose: Senior Platform Engineer at FullContact

Follow @Xorlev

What does FullContact do and what is your role there?

FullContact is helping people contact each other via their preferred contact mediums. Part of that is finding the most up-to-date public information about people. We search the web millions of times daily and aggregate the information.

 

My role as Senior Platform Engineer is to build scalable data platforms capable of ingesting and making sense of large volumes of heterogenous contact information.

 

How are you using Apache Cassandra?
We’re using Cassandra in two major roles, both in our distributed search platform. First as a repository of searched data, kind of like an augmented outbound cache that gets ingested on a regular basis via Hadoop MapReduce. Second, it’s the live-serving database directly underneath our Person API and holds results of each search including the aggregated ‘resolved’ profile of a person. These clusters both run Cassandra 1.2.10, and we’re utilizing CQL3 tables.

What was the motivation for using Cassandra and what other technologies was it evaluated against?

Cassandra embodies in its core the resilience and availability we need to continue serving our enterprise and internal customers even in the face of transient outages. We originally ran this system on MongoDB and while not very problematic, MongoDB was the wrong choice for the platform and scale it grew to. We’re also users of HBase, but HBase availability on AWS isn’t always perfect, nor does it lend its self well to multi-DC HA installations. Cassandra’s ability to easily support multi-DC masterless replication was a huge motivating factor for us, even if we don’t yet take advantage of it. Additionally, Cassandra its self is much less complex operationally.

 

Can you share some insight on what your deployment looks like?

We run 2 clusters, 1 of 12 nodes, and 1 of 9 nodes. Both are deployed entirely in Amazon US-East-1 but in 3 different AZs. Our 12 node cluster runs with ~400-500GB of data per node and our 9-node live-serving cluster runs with ~200GB of data per node. Cassandra runs under the supervision of Netflix’s Priam for ease of configuration + backup/restore capabilities. Most of the time we forget about Cassandra and it keeps on running.

 

What advice do you have for those just getting started with Cassandra?

Make sure it’s the right fit for what you’re doing. In general, it’s easiest just to set it up and give it a whirl.

 

What’s your experience with the Apache Cassandra community?

The community is excellent. The IRC channel was instrumental in helping us work through initial misconceptions and figuring out various operational issues we’ve run across.

Read more about FullContact’s move to Cassandra from Michael’s blog post, Migrating from MongoDB to Cassandra.

Vote on Hacker News