January 29th, 2014

Robert Wille: Development Lead at Fold3
Follow @fold3

What does Fold3 do and what is your role there?
fold3.com is a website for original-source historical military documents. At present we have over 100 million images and over 400 million searchable records. I am the development lead for the website.

How are you using Cassandra?
Central to the organization of our data is our browse structure. The browse structure allows users to do a number of things. They can use it to browse our data. When a record is found via search, they can see the browse path in their search results metadata. When viewing an image, the browse structure gives them access to the next/previous image in the current document and a film-strip-like view of the entire collection. The browse structure is currently stored in a MySQL database, but we are migrating it to Cassandra 2.0. The browse structure has approximately 400 million nodes.

What was the motivation for using Cassandra and what other technologies was it evaluated against?
The motivation for migrating to Cassandra is to make it possible to scale inexpensively. The database that currently hosts the browse structure is approaching the reasonable limits of cores and RAM for a server. We have a couple of very large titles that we would like to add to our website, but it is likely that we would outgrow our database server (which we just purchased a couple of months ago). We evaluated MongoDB and OrientDB as alternatives.

Can you share some insight on what your deployment looks like?
Initially, we’ll have a single datacenter. Eventually we will migrate our documents to Cassandra as well. At that point, we will probably create a second DC to have a hot backup and to run reports and stuff off of. We should have close to a billion very small records in the browse structure. We’re currently planning on a four-node cluster with about a terabyte of total storage, with Cassandra consuming about half of that.

What advice do you have for those just getting started with Cassandra?
My advice for someone getting started would be to read a lot, and to subscribe to the users group and read everything. There is so much that a new person doesn’t even know that they don’t know. There are a lot of subtleties that can impact schema design and code architecture. The ramp-up time is much longer than for something like MySQL or Postgres, so its best to learn early and learn a lot before you start designing schemas and writing code.

Anything else that you’d like to add?
The community has been really great. I’ve learned a lot and have gotten answers to a number of questions and issues I’ve had.