Nexgate is a cloud-based service that lets you discover, monitor, and protect your brand’s image across social networks such as Facebook, Twitter, Google+, LinkedIn and YouTube. Nexgate solves the ever-growing problem around security and compliance in the social media space; they protect customers from social hacks, malicious content, rogue employees, SPAM and other threats, just as you would with other communication tools like email and IM.
A major part of our solution, which utilizes Apache Cassandra, is harvesting huge swarms of data out of the social web; we then classify and action this data based on policies that customers can configure.
As a concrete example, when we work with a bank; the bank wants to make sure that employees who are working on the social web aren’t posting things that violate FINRA regulations. We, in as real-time as possible, read their posts and comments and classify them with natural language processing and machine learning. We then action those items based on policy.
Our problem requires a solution which allows us to store a large amount of data coming out of the social web. This problem meets the Cassandra use case really well because it’s column data. The metadata that comes out of Facebook is slightly different than the metadata that comes out of Twitter, LinkedIn, or YouTube; so, we have a need for this really huge table of social content that is 50% alike across all these different platforms.
We need the ability to quickly add new columns, and be able to write code that operates on those new columns in a performant manner. Cassandra lets us basically build an endlessly scalable store for all this social data.
Our data model is basically two huge tables. We’ve a table which we call “items”, which has a row for every single post, tweet, comment, or share that we’ve ever scanned; it includes all the content and all of the metadata around that, including the identity of the author, the application, and all kinds of other information.
Then we’ve a second table that holds all of our classifications and categorizations. Again, it’s very simple because we’re not using many of the sophisticated features around column families; the advantage for us is that we can continue to scale Cassandra horizontally by just adding nodes.
One thing I should also say is that we have requirements which need to be satisfied with a summarization of this data. We use MySQL, actually, for a front end with a small number of fixed fields that represent things like an ID, the date, and the time that the content was posted. Information that we need to summarize and report across is used in MySQL, almost like an index into the huge set of data stored in Cassandra.
As many other start ups do in the very beginning, we had a MySQL database that we started to throw data into. Quickly, we realized that whole model of a single instance wasn’t going to solve our long-term problems because we actually wanted to have access to all of this data that we were collecting; we didn’t want to make hard decisions around throwing it out.
The accuracy of any data classification is predicated by the scale of the corpus that you can test it against. We knew we wanted to save this data forever and we quickly saw the limitations of a traditional relational database model. We started to look at NoSQL solutions because they clearly fit our use case.
We checked out MongoDB, but it was very important to me, operationally, to have a solution that was multi-master, where every node in the cluster was itself a master. I shouldn’t have to think about [or manage] the details of which node is a master and which is a slave or all the details underneath that in terms of replication. I wanted seamless replication based on how I define the data model, where every node in the system was a master.
This requirement simplified the solution for me operationally and from a software development standpoint, as well. I ended up taking a deeper look at Cassandra and Riak. Those were the two that we put head to head, because our final requirement was integration with Solr, which both technologies possessed.
When we ran our benchmarks, Apache Cassandra won hands down in terms of reliability, ease of use, the speed in which you could scale horizontally. It just won technologically.