October 16th, 2014

Social Media Security

Social media has become the new frontier for spammers and cyber-attackers. Unlike email, which is a well-established medium with a mature security infrastructure, social media is ripe for attack by bad actors.

Not only are fewer guardrails in place on the typical social media platform, but the payoff for a spammer is also much greater. Whereas a bad actor can only send one email to each recipient, just a single social media post is needed to reach thousands. A single comment posted on a high-profile Facebook page can receive thousands of views.

At Nexgate, we offer security solutions for social media, which combat things like spam and malware posted to a social media account. We automate the discovery, monitoring, and protection of social media accounts. To monitor and protect a social media account, Nexgate has more than a hundred in-house content classifiers and all social media content is classified in real-time.



Social Media Data

Social media data is rough. Data size is roughly 1 kb, including content and metadata. The content includes actual message text and links. Metadata includes data such as post time, poster ID, and on what account it was posted. Metadata varies depending on social platform, to include engagement activity such as Likes, Followers, or Subscribers.


Why We Chose Cassandra

When we first started building our product, we threw everything into MySQL, but quickly realized that we also needed a NoSQL solution. Data that was fixed length, non-null, heavily indexed, and required group access fit well in SQL. To accommodate data that has variable length, is-commonly-null, softly indexed, single access, and frequent text searching, we needed a NoSQL solution.

All of this would be difficult without the right tools. As a startup, we were attracted to technologies that are easy to use and simple to deploy. For a distributed NoSQL store, we were interested in proven horizontal scalability and operational simplicity (decentralized system). Because the nature of our product depended on real-time monitoring and classification, that distributed store needed to be highly available and provide quick access.

Datastax Enterprise Cassandra was the perfect choice for us in all these respects. We were also excited to use Datastax to help integrate Solr and Hadoop (now we are thrilled about integrating with Spark).


Spam Detection using Cassandra

Social media spam can be a single link directing to a malware site:



Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate):


These are some brief examples of social media spam. To learn more about the state of social media spam, please check out our white paper.

We can create spam signatures to catch this type of content, but it would be “after the fact” and too slow to catch spam in real time. We can leverage the data model in Cassandra to help us catch these types of spam quickly and effortlessly.

Even though Cassandra is a NoSQL schema-less database, it is worth carefully defining the data model, which is based on how you will query the data. A typical data model in Cassandra looks like this:


For our purposes, we want to determine spam content that has been posted duplicate times, since spammers tend to post same-content messages. Therefore, we use the following:

Row key: Hash of the social media content

Column Key: Unique ID of commenter (which is unique to each social media platform and strictly increases with time)

Column Value: Item ID and time of post

The “Item ID” as part of the column value is an internal variable that maps back to our SQL tables to help us determine other characteristics about the post.

We find this data model to be powerful because:

  • It is easy to determine how many times a same-content post is made – count the number of columns! We will never double count because the column key will simply be updated instead of adding a new column.

  • The content is indexed, allowing for quick reads and writes.

  • By reading the column value, we can extract useful time-series information for duplicated posts. This allows us to do analysis on posting activity to determine if there are patterns around posting activity, such as bursts or regular-interval periods.


After implementing our data model into production, we began to catch much more spam in real time. This data model allowed us to help our customers automate the removal of inappropriate spam messages. In one scenario, a customer received over 25,000 spam messages over the span of a few hours. This would be much too expensive and time-consuming for a person to sift through all the content and delete by hand.

This is only one of our many use cases for Cassandra, and we haven’t gone into detail how else we use Cassandra. However, we hope this simple example provides the power and simplicity of using Cassandra as a distributed NoSQL store. Along with Datastax Enterprise support, we feel extremely well-prepared and confident in our production instances.

If you are interested in social media security at a killer startup, we are hiring! Check out our open positions: http://nexgate.com/about/careers/

Nexgate is a social media security startup that helps automate the discovery, monitoring, and protection of social media accounts (find out how through http://nexgate.com/demo).

Interested in more talks from Cassandra Summit 2014? Check out the Cassandra Summit 2014 YouTube playlist and register for the Cassandra Summit Europe 2014 conference today.