August 26th, 2013

Dave Angulo: CTO & Co-Founder at SpotRight

Brady Gentile: Community Manager at DataStax


TL;DR: SpotRight started using Cassandra at 0.6.1. When choosing a database, they evaluated Neo4j, HBase, Cassandra and many more. Their key decision to choose Cassandra was based on the graph factor; when you represent graph data structures in memory, they’re represented as key value stores.


SpotRight has fairly tall nodes, varrying from 20 – 80 depending on the architecutre they’re running and workloads, with 4TB on each.  They’re runnning spinning disks with commit logs on Raid 0 very fast SCSI drives and large slower data disks. SpotRight is currently experimenting with Vnodes.


Hello Planet Cassandra users. Today we have joining us Dave Angulo, CTO and Co-founder at SpotRight. To start things out, could you tell us a little bit about what SpotRight does?

SpotRight’s GraphMassive product is the largest multi-network consumer social graph for data-driven and brand marketers.  We’ve linked social interest and relationships to individual consumers at scale, currently three hundred million consumers with eight billion social connections.  With all that data we provide valuable social insights for digital marketers including audiences and measurement based off those insights.


How are you using Apache Cassandra at SpotRight?

Cassandra is a core data store for all of the data we have on people and their connections.  We started out a long time ago (in Cassandra time), on 0.6.1.  It’s kind of grown with us and everything we have effectively is in Cassandra. That’s the source of truth and then we use that source of truth to power other products.


It sounds like you guys were early adopters, starting on 0.6. What was your motivation for using Apache Cassandra when you first started out?  Were there other technologies that you looked at?

We did a pretty exhaustive look.  This was 2010 and we looked at a whole bunch of different things.  Our primary data asset, besides social profiles, is a graph data structure, so we looked at everything that was out there at that time (HBase, Neo4j, Cassandra, etc.).  What appealed to us most was the graph factor, when you represent graph data structures in memory, they’re represented as key value stores.  That played really well to Cassandra and drove the choice.


Could you share some insight into what your deployment looks like?

We pretty much tried everything under the sun and experimented with a number of different architectures.  We settled on a managed service environment with our own hardware.  Currently, we are running spinning disks with commit logs on Raid 0 very fast scsi drives and large slower data disks. The problems that we solve are very large and extremely memory intensive, we have fairly tall nodes, varrying from 20 – 80 depending on the architecture they’re running and workloads, with 4TB each.  So, we’re not normal I guess in Cassandra world,  but it’s worked out well.  We are currently experimenting with a new architecture using the VNodes that came in with 1.2, and so far we are fairly happy with that.


Are there any features you would like to see coming out in future versions of Apache Cassandra?

We do a lot of our work as Hadoop batches and are getting more and more into streaming.  I see dealing with Cassandra and Hadoop integration being a great feature.


Predicate creation right now is “or” based, so if you say you want a bunch of columns out of your column family, you’ll “or” those all together and get a fairly large data set.  Our main feature request is that we have the option to treat those as “and”s.  So, when we construct a predicate, we can “and” the columns together and we only get rows returned that have all of the columns and not only one of them.


Do you have any experience with the Apache Cassandra community in regards to mailing lists, meetups or do you frequent

We definitely have used the mailing list with our problems in the past and it’s been extremely responsive.  We have contributed a bug fix or two here or there back to the community and it’s been received very nicely.  We haven’t made too many meetups because of the pace that we’re operating at but the community has been great. 


One of the major wins for Cassandra is the size of the brain-share right now; the size of the install base at this point is large enough that it’s rare to hit a problem and nobody else has hit.  So, it’s nice to have access to everybody’s shared knowledge and experiences fairly quickly. 


It sounds like you’ve had a positive experience and you guys are moving very fast, so that’s fantastic.  Dave thanks so much for joining us today and giving us some insight into how you’re using Cassandra.  Before we sign off, is there anything else you’d like to add?

Like I said, we’re really happy users.  Also, for over a year now we’ve internally been using Scala sugar for Cassandra, with intentions to open source; we have decided to do this and it is now available on Github.