October 11th, 2013

By 

 

Sean Crawford:  Marketing Manager at Quandl

Blake Hilscher:  Software Engineer at Quandl

Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service

 

TL;DR:  Quandl is a search engine for numerical data. The platform offers access to millions of open and free financial, economic, and social datasets; indexed from hundreds of sources globally.

 

Quandl is using Cassandra to store all of their numerical data. They currently have around 250 gigabytes, while It’s not a huge amount at the moment, they’re expecting that it’s going to continue to grow and wanted a database to grow with them. Some of the different technologies considered were MongoDB, HBase and Riak. The evaluated based on availability, scalability, write and read performance, failover tolerance, and the community.

 

Quandl found that on most of the fronts, Cassandra performed better. They liked that Cassandra has high availability and the high scalability along with a lot of really great tools and community support like with Netflix, Astyanax and OpsCenter.

 

Hello everyone, this is Christian Hasker with Planet Cassandra for today’s Apache Cassandra Use Case; I am joined today by Sean and Blake from Quandl. To get things started, why why don’t you tell us a little bit about what Quandl does.

Sean:  Quandl is a search engine for numerical data. The platform offers access to millions of open and free financial, economic, and social datasets; indexed from hundreds of sources globally. It takes disparate data hidden deep within the web in a myriad of forms and opens them up, making them easily accessible in whichever format is desired. Quandl minimizes the time required searching for data, and empowers people to begin their analysis work sooner. Our long-term goal at Quandl is to make all numerical data easy to find and easy to use.

 

Perfect, now if you wouldn’t mind introducing yourself and Blake and outlining a little bit what you guys are responsible for at Quandl? 

Sean:  I mostly deal with community, marketing, and partnerships. My day to day is composed of interacting with users, making sure they’re well served and getting what they need from Quandl; getting people aware of and interseted in Quandl; and connecting with other data and analysis services that we may be able to work together with.

 

Blake:  I’m one of the software engineers at Quandl.  I’ve been working on the product in a variety of forms from building infrastructure to assessing how we can better store and access our data.  That was one of the big things that I worked on in the last half year so eventually just setting up Cassandra.

 

That brings us to why Cassandra and what role is it playing there at Quandl, and if you can talk about if you looked at anything else before Cassandra would be great.

Blake:  We’re using Cassandra to store all of our numerical data of which we currently have around 250 gigabytes.  It’s not a huge amount of data at the moment, but we’re expecting that it’s going to continue to grow and grow and grow.  Some of the different technologies that I considered were MongoDB, HBase and Riak.  The way that I evaluated the different technologies was based on availability, scalability, write and read performance, failover tolerance, and the community and their availability.

 

Looking at all these different things I found that on most of the fronts, Cassandra performed better.  I liked that it has the high availability and the high scalability. I also found it had a lot of really great tools and community support like with Netflix, Astyanax and OpsCenter.

 

When I looked at profiling the performance of the different technologies, I found that MongoDB just didn’t perform well enough to meet our needs.  It’s a lot slower.  I found HBase wasn’t quite as performant as Cassandra.  That’s pretty much my estimite. 

 

Great.  The other thing we’d like you to talk about that the community finds very helpful and trusting, is the environment that you’re running Cassandra in, version of Cassandra, maybe what specific features you’re utilizing, what your hardware looks like.

Blake:  Right now we’re running our single data center in EC2.  We have six nodes in our cluster running on M1 extra large instances so we’re handling our deployment using Chef which is a DevOps tool in Ruby’s network.  Our configuration is Raid 10 running on ephemeral drives and we’re using a replication factor of three. In the future we expect to deploy additional data centers and other availabilities, and obviously increase nodes as we have an increasing demand.

 

We’re running Verion 1.2.1.  I saw that 2 came out recently so I’ll probably upgrade to that soon.  When I was going through the process, I initially was using I think Version 1, the one before C2.0, and we’re using the DataStax java driver to interface with Cassandra.

 

Great.  What was your background before coming to Cassandra and how did you find that transition and what advice would you pass along to other people looking to get started or may have just gotten started with it?

Blake:  For the previous decade I was using SQL technologies, so primarily MySQL and Postgres more recently.  As far as advice, I’m not too sure, I’ve found working with Cassandra and scaling it out is like a breeze compared to scaling out Postgres.  Scaling Postgres was insane.

 

We hear that over and over again.  It really is designed to scale out.  Do you have plans to go to multiple data centers?

Blake:  Yeah, as our volume of data increases I expect that we’ll add additional data centers and additional available on AWS so that we’re hard fault tolerant.  We’re about to add a bunch of new user generated stuff on Quandl so that will become important in the upcoming months.

 

Excellent.  Earlier you mentioned community is one of your factors in choosing Cassandra.  What’s your experience been at either community events  like meet-ups, being on IRC, or the mailing list?  If you could talk to that a little bit would be great.

Blake:  I’ve used the IRC a fair amount when I was trying to get everything up and running. When I was running into different technical issues with the different drivers that I tested out.  The IRC was very useful and the other aspect of the community, just the fact that there’s a lot of different libraries and things available for Cassandra that make it easy to play around with.