September 13th, 2013

By 

Dan Foody: Founder & CEO at Cloze, Inc.

Alex Coté: Founder & CMO at Cloze, Inc.
Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service

 

TL;DR: You can think of Cloze as a noise-cancelling inbox for email and social.  There are two parts to what the Cloze engine does: The first is analyze a user’s history, so there are the analytics to figure out who the important people are then the second is the online portion of what they do, which is just giving people access to their inboxes with all the noise filtered out. 

 

Cloze chose Cassandra for its ease of use, management, and deployment. The the other side was high availability; they needed the ability to keep services up 24/7, even in the event of failures. 

 

They are running a cluster of 12 nodes on Amazon XL instances, in a three-way replication across three different Amazon data centers with availability zones at Amazon speeds.  They have not had any downtime.

 

Hello everyone this is Christian Hasker with Planet Cassandra; I am joined today by Dan and Alex from Cloze. To get things started, why don’t you tell us a little bit about what Cloze does.

You can think of Cloze as a noise-cancelling inbox for email and social.  What Cloze does is figure out who’s most important to you.  We analyze all of your past email history, of everyone you ever talked with, to figure out the most important people.  We use that to then filter your social feeds and your email, so that you get a view of just what’s important and not all the other junk that comes through.

 

I need that, thank you.  In my mind I’m seeing how Cassandra could be a really good fit for that but if you could outline how you’re using Cassandra as the data store that would be great.

There are two parts to what our engine does: The first is we analyze your history, so there are the analytics to figure out who the important people are then the second is the online portion of what we do, which is just giving people access to their inboxes with all the noise filtered out. 

 

We actually use Cassandra for both aspects of that.  So when we connect someone, we import all the data into Cassandra.  That data is analyzed and then we keep the most recent history in Cassandra. We offload all of the older stuff after the first analysis is done. 

 

A typical user might bring in a gigabyte of data on a first analysis, so that gets sucked in, analyzed, all their older history then gets dropped after we’ve analyzed it. We keep the history and a summary of “here’s what’s most important to you”.  So, Cassandra’s our core data store for messages and what we do to analyze the information.

 

Especially now that every other week we hear about a high profile privacy leak. Is that a concern for your users and how do you mitigate that?

Obviously privacy’s important to all of our users, so we’ve built our own layers of security on top of everything we do.  Everything in Cassandra is encrypted, so even we can’t get at our user’s email. 

 

We’ve built a number of layers of security around it and on top of it.  It’s interesting, in some ways Cassandra avoids a lot of the common attacks like SQL injection because there is no direct equivalent to SQL.  

 

Maybe in the future, we’ll see CQL injections arising.

It’s a little harder because the syntax is much more restricted; so a lot of the ways you do SQL injection is because the syntax is so rich.  As CQL gets more rich, it may actually be subject to that but it’s not right now.

 

That’s really fascinating because people kind of equate relational databases as being very secure and having a lot of security; you’ve kind of inverted that for me today; that’s great.

The security problems are rarely platform level, they’re all in how you build stuff on top of the platform.  So, in a relational database, if you forget one single place to check for injection attacks you’ve opened up a hole.

 

Can you talk a little bit about how you decided to go with Cassandra?  Were there other technologies that you looked at along the way?

We looked at a combination of factors.  The big ones for us were ease of use, ease of management, and ease of deployment. The the other side was high availability; we needed the ability to keep our service up 24/7, even in the event of failures. 

 

We looked at things like Hbase, MongoDB and traditional relational databases as well, like MySQL.  The other big factor for us is that we have a lot of data.  As I said, every user is about a gigabyte of data and once you put few hundred thousand users in a relational database, it will start to break down; even databases like MongoDB will start to break down.  That kind of eliminates a lot of the different approaches for doing large-scale analysis.

 

We were left with looking at things like Cassandra and HBase, ones that could really scale and had reliability built-in.  Cassandra, versus HBase, is certainly a much simpler product to deploy and manage, especially for a small team.

 

In regards to your testing, did Cassandra work as advertised?  Did it scale out as anticipated, as you added data?

Yeah, it’s doing very well.  We’re running on Amazon right now, so we run a three-way replication across three different Amazon data centers and our availability zones are at Amazon speeds. We’re happy to say we’ve had no down-time, even when failures happen.

 

So when an entire Amazon data sensor goes down, like Amazon East, you have still been available?

It’s funny, it’s almost the reverse problem for us.  We don’t notice when they go down, so it’s like “Oh, one of our data centers is down.  Maybe we should look to make sure nothing else is going wrong”.  So we have to add extra monitoring because everything operates normally, even when we have these things like a data center go down.

 

As you started out with Cassandra, what are some of the things you’ve learnt along the way that would be beneficial for someone starting out with Cassandra to think about?

You know, you need to design your data the right way. It’s very different than designing for a relational database.  In a relational world, you start off by saying I’ll just put the data in and then figure out how to get at it.  You can’t do that with Cassandra, you have to actually think about how you’re going to use the data before you structure it.

 

In a relational world, you can get away for a little while without doing that but it bites you later.  In Cassandra, you have to think about it upfront.  We’ve had certain challenges with Cassandra itself.  Some of the features didn’t work as advertised; things like secondary indexes, which we had started using originally either had bugs in them or didn’t really work like full-scale secondary indexes.

 

The other one we got bitten on was ‘counter-columns’.  We’d made heavy use of those initially, and then found out they really had too many flaws for us to be able to rely on them.  So, then we had to get rid of the use of secondary indexes and counter-columns from our architecture. 

 

What’s the advice there?  Use not the latest open source release or how would you have avoided that issue?

In many ways it’s like anything, stick to the stuff that’s tried and true.  So, in Cassandra, stick to the tried and true aspects not the features introduced in the latest release; let it get a few major releases before you actually get on to using a new feature.  The core of Cassandra has been very stable for us and very reliable.

 

Yeah, I think that’s good advice.  How has the community been for you as you’ve hit those issues? Have you been on the IRC, mailing list, or user groups?  What’s your experience there?

The community is very active around it, so that’s great.  The great thing about open source is we can fix problems in it too.  We found and fixed a few in Cassandra and shared it with the community. 

 

There are still other times where you run into the typical open source problem – the community wants to only move forward, but if you’re on an older release, you want some of the bug fixes back-ported; there’s always tension there. 

 

Yes, you want the long tail and the long tail support but the community is moving forward fast, I totally understand.  You talked about the average amount of data that every user has; I’m wondering, how much total data are you storing in Cassandra now?

I don’t know the exact gigabytes of data off the top of my head.  We’re running on a cluster of 12 nodes right now that are on Amazon XL instances.  A lot of our data, once we do initial processing, is actually offloaded to Amazon S3 partly because it’s the cheapest way to store lots of data.

 

We couldn’t afford to store all of our data in Cassandra just because of the machine cost at Amazon.  So, we offload a ton of data into S3 as sort of our secondary store. So, Cassandra is just keeping the users most recent 30 days of data after we’ve done their initial pass.  Even still, with hundreds of thousands of users, it adds up.

Vote on Hacker News