Top Posts This Month
Upcoming Webinars
Global Posts
— Ken Anderson Associate Professor, Co-Director at University of Colorado’s Project EPIC
"As the place to store all of our tweets and guarantee we would never lose anything, Cassandra is the best technology out there."
— Ken Anderson Associate Professor, Co-Director at University of Colorado’s Project EPIC

I’m an Associate Professor in the Department of Computer Science, here at CU Boulder, and my research areas are in software engineering, web engineering, and software architecture. I’m the co-director of Project EPIC. EPIC stands for Empowering the Public with Information in Crisis (http://epic.cs.colorado.edu); Project EPIC is one of the largest—actually, the largest—NSF grant in the area of what we call crisis informatics. We look at how people use social media during times of mass emergency.

 In particular, what we do is collect lots and lots of data from Twitter, and we do some automated analysis of that Twitter data. We also have ethnographers and the like that go in and look at the conversations in the tweet stream, tweet by tweet, over thousands of tweets, to understand what people are doing on-line during and after those events.

Monitoring Haiti – 2010

We collected, for instance, during the 2010 Haiti earthquake and one of the things that we pulled out of our Twitter dataset was the birth of a group called Humanity Road. This was a purely digital volunteer group, consisting of people who came to the event remotely and wanted to help.

Through monitoring the Twitter stream, they found that there were people on the ground in Haiti that were reporting useful information, and the group of volunteers who eventually founded Humanity Road would go in and make sure that those people had minutes on their cell phones (so they could keep tweeting). They’d organize donation drives to purchase those minutes. They would grab information from the Red Cross and retweet it. They would also vet information; so if they saw something go by and it wasn’t confirmed, they’d try to confirm it, and then they’d post it far and wide.

What we do at Project EPIC is study people like that to understand how we could make them more effective in emergency response in the future if we were to design and implement improved software services for colloaborating and coordinating in mass emergency events.

Storing Tweets in #Cassandra

At the start of an event (like the September 2013 Colorado floods), some of our team members will make use of services like TweetDeck to get a feel for the hashtags and keywords that people are using to discuss the event. We then make use of our Twitter data collection infrastructure to submit those keywords to the Twitter Streaming API. We take those tweets and put them into an in-memory queue which is processed by a set of filters to extract various pieces of information about the tweets to help us keep track of the collection. The last filter in the chain has the job of making the tweet persistent by storing it in Cassandra. We have a four-node Cassandra cluster running 24/7 waiting to store those tweets.

70KTweets

We can create a new event whenever it’s needed, and as the tweets come streaming out of the streaming API they go straight into Cassandra, which then guarantees that we’re never going to lose a tweet. We’ll always be able to retrieve it later on. Depending on the event, we see everything from one tweet a second (for small events) up to 100 tweets per second for events like the 2011 Japan Earthquake or the 2012 London Olympics. 100 tweets per second works out to 8.64M tweets per day and with Cassandra we never miss getting those tweets persisted for later analysis.

Moving from MySQL to Become Disaster Ready

Well, when we started on our Twitter data collection infrastructure back in January 2010, we made use of a traditional n-layer web application architecture, with separate layers for apps, services, persistence, and the like. We started with MySQL as the initial data store and used Spring’s integration with Hibernate to store and access data in it. It was familiar, easy to use, and with SQL you can, of course, do lots of arbitrary queries on the data long after storing it, but during the 2011 Japan earthquake, our infrastructure based on MySQL finally fell over.

In that event, we actually did have a case where we were having hundreds of tweets per second for sustained amounts of time, and MySQL could not clear the in-memory queue fast enough, and we were sitting there, watching the in-memory queue grow to several tens of thousands of tweets, and worried that if we had a power outage, or if any other problem were to kick in, we’d lose all that data.

The Move to Cassandra

At that time, my grad student, Aaron Schram, was doing a deep dive into the NoSQL technology space and found at that time, roughly in mid-2011, that there was a lot of energy around Cassandra and its community, and decided to give it a try. We particularly looked at the characteristics that we needed. We needed something that was always available. We needed to make sure that we could, at any one moment, there’d always be a way to write a tweet, and then we wanted to make sure we’d never lose that tweet. So we needed replication once a tweet was successfully stored. And, we needed something that was scalable… large-scale events generate tens of millions to hundreds of millions of tweets and we never want to delete that data since there’s no way to access it again via the Twitter search API. The size of our data sets are going to do nothing but grow and so Cassandra’s horizontal scalability was exactly what we needed.

In addition, we needed flexibility to store our data without worrying about its structure. Twitter changes the metadata that is associated with a tweet frequently and we didn’t want to be in a situation where we had to deal with schema migration each time the structure of a tweet changed. With Cassandra, we can just store the whole tweet object and deal with the structural changes of those objects over time during the analysis stage.

After looking at all the different technology that was out there, Cassandra bubbled to the top in terms of having exactly those characteristics that we could depend on. Now, we knew that it wasn’t as great of a technology for arbitrary queries after the fact, and so we have been looking at various ways to do analysis with some other tools, although we do want to look at using, for instance, DataStax Enterprise with its support for Solr and Hadoop as a way to be able to apply arbitrary queries over the data we have stored in Cassandra. However,  as the place to store all of our tweets and guarantee we would never lose anything, Cassandra is the best technology out there.

Perks of Cassandra

For us, it really was matching those characteristics of availability, replication, scalability, and flexibility, and finding the right tool for the job. We have hundreds of events stored consuming terabytes of data and we know that we’re never going to lose that data and, if we ever need more space, we can just add additional nodes to our cluster.

We also like that we have the ability to play with the structure of our cluster. Right now, our four nodes are physical machines sitting in a server rack here at CU. But, in the future, we may want to add additional servers by spinning up virtualized instances on Amazon’s EC2 and then use Cassandra’s data center capabilities to share and integrate those nodes into our existing cluster.

We think Cassandra’s a great choice for storing and managing data, and it’s helped us reach a point where we can now focus almost exclusively on designing and developing the analytics infrastructure that we will layer on top of our existing Cassandra-based data collection infrastructure.

DevCenter 1.2 delivers support for Cassandra 2.1 and query tracing

October 16, 2014

By 

We’re very pleased to announce the availability of DataStax DevCenter 1.2, which you can download now. We’re excited to see how DevCenter has already become the defacto query and development tool for those of you working with Cassandra and DataStax Enterprise, and now with version 1.2, we’ve added additional support and options to make your development work even easier.

Version 1.2 of DevCenter delivers full support for the many new features in Apache Cassandra 2.1, including user defined types and tuples. DevCenter’s built-in validations, quick fix suggestions, the updated code assistance engine and the new snippets can greatly simplify your work with all the new features of Cassandra 2.1.

DevCenter1.2_udt

DevCenter 1.2 also has an updated Schema explorer, which displays user defined types contained within your keyspaces.

DevCenter1.2_schema_udt

Because query performance is so important, we’ve added query tracing capabilities that display details of each step the query took to be completed. When a query or a longer CQL script is executed, the Results view will present you with both the query results and, on the new separate tracing tab, performance diagnostics about the query. We’ve added visual cues that should simplify understanding and navigating the details of the performance trace, which helps you tune the query for better performance if needed.

DevCenter1.2_trace

Finally, DevCenter 1.2 contains a series of visual interface improvements (e.g. a better wizard for creating new connections, an option to clone an existing connection, opening a connection by double clicking it) that make using DevCenter even easier and more enjoyable than before. A more complete list of changes can be found here.

Download DevCenter 1.2 now and be sure to send us your feedback on what more we can do to make DevCenter even better.

Cassandra at Nexgate for Social Media Security

October 16, 2014

By 

Social Media Security

Social media has become the new frontier for spammers and cyber-attackers. Unlike email, which is a well-established medium with a mature security infrastructure, social media is ripe for attack by bad actors.

Not only are fewer guardrails in place on the typical social media platform, but the payoff for a spammer is also much greater. Whereas a bad actor can only send one email to each recipient, just a single social media post is needed to reach thousands. A single comment posted on a high-profile Facebook page can receive thousands of views.

At Nexgate, we offer security solutions for social media, which combat things like spam and malware posted to a social media account. We automate the discovery, monitoring, and protection of social media accounts. To monitor and protect a social media account, Nexgate has more than a hundred in-house content classifiers and all social media content is classified in real-time.


 

 

Social Media Data

Social media data is rough. Data size is roughly 1 kb, including content and metadata. The content includes actual message text and links. Metadata includes data such as post time, poster ID, and on what account it was posted. Metadata varies depending on social platform, to include engagement activity such as Likes, Followers, or Subscribers.

 

Why We Chose Cassandra

When we first started building our product, we threw everything into MySQL, but quickly realized that we also needed a NoSQL solution. Data that was fixed length, non-null, heavily indexed, and required group access fit well in SQL. To accommodate data that has variable length, is-commonly-null, softly indexed, single access, and frequent text searching, we needed a NoSQL solution.

All of this would be difficult without the right tools. As a startup, we were attracted to technologies that are easy to use and simple to deploy. For a distributed NoSQL store, we were interested in proven horizontal scalability and operational simplicity (decentralized system). Because the nature of our product depended on real-time monitoring and classification, that distributed store needed to be highly available and provide quick access.

Datastax Enterprise Cassandra was the perfect choice for us in all these respects. We were also excited to use Datastax to help integrate Solr and Hadoop (now we are thrilled about integrating with Spark).

 

Spam Detection using Cassandra

Social media spam can be a single link directing to a malware site:

 

 

Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate):

 

These are some brief examples of social media spam. To learn more about the state of social media spam, please check out our white paper.

We can create spam signatures to catch this type of content, but it would be “after the fact” and too slow to catch spam in real time. We can leverage the data model in Cassandra to help us catch these types of spam quickly and effortlessly.

Even though Cassandra is a NoSQL schema-less database, it is worth carefully defining the data model, which is based on how you will query the data. A typical data model in Cassandra looks like this:

 

For our purposes, we want to determine spam content that has been posted duplicate times, since spammers tend to post same-content messages. Therefore, we use the following:

Row key: Hash of the social media content

Column Key: Unique ID of commenter (which is unique to each social media platform and strictly increases with time)

Column Value: Item ID and time of post

The “Item ID” as part of the column value is an internal variable that maps back to our SQL tables to help us determine other characteristics about the post.

We find this data model to be powerful because:

  • It is easy to determine how many times a same-content post is made – count the number of columns! We will never double count because the column key will simply be updated instead of adding a new column.

  • The content is indexed, allowing for quick reads and writes.

  • By reading the column value, we can extract useful time-series information for duplicated posts. This allows us to do analysis on posting activity to determine if there are patterns around posting activity, such as bursts or regular-interval periods.

 

After implementing our data model into production, we began to catch much more spam in real time. This data model allowed us to help our customers automate the removal of inappropriate spam messages. In one scenario, a customer received over 25,000 spam messages over the span of a few hours. This would be much too expensive and time-consuming for a person to sift through all the content and delete by hand.

This is only one of our many use cases for Cassandra, and we haven’t gone into detail how else we use Cassandra. However, we hope this simple example provides the power and simplicity of using Cassandra as a distributed NoSQL store. Along with Datastax Enterprise support, we feel extremely well-prepared and confident in our production instances.

If you are interested in social media security at a killer startup, we are hiring! Check out our open positions: http://nexgate.com/about/careers/

Nexgate is a social media security startup that helps automate the discovery, monitoring, and protection of social media accounts (find out how through http://nexgate.com/demo).

Interested in more talks from Cassandra Summit 2014? Check out the Cassandra Summit 2014 YouTube playlist and register for the Cassandra Summit Europe 2014 conference today.

Rick Branson Infrastructure Software Engineer at Instagram
"Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed."
Rick Branson Infrastructure Software Engineer at Instagram

Instagram is a free photo sharing app to take photos, apply filters, and share them on social networks such as Facebook, Twitter, and the like. It allows its over 200 million users  to capture and customize their photos and videos to share with the world.Instagram

Cutting costs with Cassandra

Initially our deployment was for storing auditing information related to security and site integrity purposes. To break down that concept, it means fighting spam, finding abusive users, and other things like that. It was really a sweet spot for the Cassandra offering.  

Originally, these features were conducted in Redis; the data size was  growing too rapidly, and keeping it in memory was not a productive way to go. It was a really high write rate and really low read rate, a spot where Cassandra really pops and shines so the switch ended up being a no-brainer for us to adopt Cassandra in that area.  We started out with a 3 node cluster and that use case has grown to a 12 node cluster. That was our path for our main application backend stuff.   

Fraud detection, newsfeed, & inbox

For the first use case mentioned above for our backend, we moved off of a Redis master/slave replication setup; it was just too costly to have that. We moved from having everything in memory, with very large instances, to just putting everything on disks; when you really don’t need to read that often, it works fine having it on disks. Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed.  When you’re going from an un-sharded setup to a sharded setup, it can be a pain; you basically get that for free with Cassandra, where you don’t have to go through the painful process of sharding your data.

Recently, we decided to port another use case that is much more critical. We spent time getting everyone on the team up-to-date with Cassandra, reading documentation, learning how to operate it effectively. We chose to use Cassandra for what we call the “inboxes” or the newsfeed part of our app. Basically, it’s a feed of all the activity that would be associated with a given user’s account; you can see if people like your photos, follow you, if your friends have joined Instagram, received comments, etc. The reason we decided to move that to Cassandra was that it was previously in Redis and we were experiencing the same memory limitations.  

For this “inbox” use case, the feed was already sharded; it was a 32 node cluster with 16 masters and 16 replicas that were fail-over replicas and, of course, we had to go through all the sharing of things. We noticed that we were running out of space on these machines and they weren’t really consuming a lot of CPU (Redis can be incredibly efficient with CPU) but obviously when you run out of memory… you run out of memory.  

It just ended up being more cost effective and easy to operate a Cassandra cluster for this use case, where you don’t need the kind of in-memory level performance. Durability was a big factor as well that Redis didn’t provide effectively; I touched on that in my Cassandra Summit 2013 presentation.

Deployment at Instagram

We’ve had a really good experience with the reliability and availability of Cassandra. It’s a much different work load: we’re running on SSDs with Cassandra version 1.2 and we’re able to get that latest version there with all of the nice bells and whistles including VnodesLeveled Compaction, etc. It was a very successful project and it only took us a few days to convert everything over.

Some details on our cluster: It’s a 12 node cluster of EC2 hi1.4xlarge instances; we store around 1.2TB of data across this cluster. At peak, we’re doing around 20,000 writes per second to that specific cluster and around 15,000 reads per second. We’ve been really impressed with how well Cassandra has been able to drop into that role. We also ended up reducing our footprint, so that’s been a really good experience for us. We learned a lot from that first implementation and we were able to apply that knowledge to our most recent implementation. Every time someone pulls up their Instagram now, they’re hitting that 12 node Cassandra cluster to pull their data from; it’s really exciting.

Dig into the docs

I would recommend digging into the system and reading all of the Cassandra documentation, especially the stuff on the DataStax website. The best part is documentation that I’ve noticed is it has a lot of extra information about the internals; really understanding that is important. Any database or datastore you use, you’re really going to need to dig into the documentation in order to properly use it the way it’s intended. People often run into situations where they get themselves cornered by adopting a solution too quickly or incorrectly and not doing their homework.  Specifically, with a datastore, it’s really important that it’s the most stable and reliable part of your stack.

And people can always find me on the IRC channel trolling away.

 

 

Cassandra Summit 2014: Cassandra at Instagram 2014

Jeff Jirsa Co-founder and CTO at Bad Juju Games
"Cassandra frees us from a world of slave-promotion scenarios and intricate manual partitioning / rebalancing that we had to consider in the RDBMS world."
Jeff Jirsa Co-founder and CTO at Bad Juju Games

Bad Juju Games is best known for our technology suite that we provide to video game developers, which collects a significant amount of data from video games, ingests it, and provides it back to developers for use in-game or on their websites. You may have seen our work floating around – sites like the World Tekken Federation for Namco (case study), or Call of Duty: Elite for Activision. I’m the co-Founder / CTO.

Our main product GOOP (Gaming Optimized Online Platform) provides a wide range of cross-platform gaming features; everything from: cloud-saving player profiles, thousands of permutations of leaderboards, very granular COD Elitereal-time usage analytics, to wagering and tournament systems for games on iOS, Android, PC, Xbox, Playstation, Wii, etc. Cassandra is our primary datastore for almost all data flowing into the system, storing and aggregating thousands of data points for millions of players.

Leveling up with Cassandra

We actually started investigating Cassandra around early 2010, with our first production launch using Cassandra 0.6 to support Namco Bandai’s title “Ace Combat: Assault Horizon”. At that time, we investigated a number of tools, including sharded Redis implementations (which we continue to use in very specific situations), MongoDB (which we ended up abandoning due to data durability concerns, some of which have been addressed, but were REALLY significant in 2010), Riak, Cassandra, and of course, balanced all of the options against our existing partitioned, replicated MySQL. While we liked Riak as well, we eventually selected Cassandra due to community support and involvement, and the rapid pace with which it was adding features.

The bottom line is we were looking for a solution that scaled well – we were already using dozens of MySQL servers doing tens of thousands of queries per second each, with many terabytes of data, and we were looking at our upcoming growth curve, and we wanted an option that would handle that level of write load gracefully. We had been building fairly complex replicated and sharded RDBMS clusters for some time, and we were looking for a solution that would grow with us, yet handle significantly more write bandwidth and better tolerate single node failures. Cassandra was fairly young at that time, but it fit our needs quite well.

Deployment at Bad Juju

As previously mentioned, our very first production deployment was Cassandra 0.6, supporting Namco’s ACAH title. That cluster ended up at approximately 10 nodes supporting over a million Xbox360 and PS3 players, with about 600GB of data per node. Our typical Xbox/Playstation titles tend to fall in that range (12-24 dedicated nodes, 500GB+ data per node, using AWS i2 instances).

Since then we’ve had clusters on 1.0, 1.2, and in early 2014, we announced our first version of the API available to the public, which is also our first production 2.0 cluster (currently 2.0.10), and our first cluster to use CQL tables rather than the classic thrift columnfamilies.

The two biggest factors and benefits of Cassandra for us are being able to better handle node failures without impacting the games, and simply being able to handle very high write volume without having to overthink the database layer. Cassandra frees us from a world of slave-promotion scenarios and intricate manual partitioning / rebalancing that we had to consider in the RDBMS world. Beyond that, we really take advantage of cross-datacenter replication (it’s not uncommon for us to have clusters that live both at AWS and in our on-site DC).

Community

The community support is incredible. #Cassandra on Freenode is fairly active, and most people there tend to be more than willing to help out (it’s also quite helpful that DataStax has a few guys idling in there that can chime in for clarification when needed). The Cassandra meetups and Summits are active with a wide mix of experience levels, and it’s great that even the larger companies using Cassandra tend to have engineers willing to share their experiences with newcomers as they start using Cassandra.

Advice

In both developing for Cassandra, and operating Cassandra the concepts aren’t incredibly difficult, but they can be significantly different than coming from a relational database world. Don’t rely simply on docs and intuition; read about it, and then ask someone about it. There are features that encourage poor design choices (like secondary indexes), features that have limitations you should understand (counters, which we use extensively, but have limitations you should know about) and a lot of operational considerations where others have valuable experience to share that will save you headache later (it seemed like there were a few talks at Cassandra Summit 2014 where ops teams realized that running multi-datacenter over a VPN ended up being as easy as they expected due to bandwidth usage). Beyond that, be sure to track as many metrics as you can, and keep an eye on them over time to make sure everything’s healthy as your cluster grows.

If you’re looking at various “NoSQL” options, take your time. It’s a fun time to be writing all this distributed “big data” backed software, because Cassandra / Riak / et al enable a lot of things that were REALLY difficult before. But take the time to do the research; don’t just choose the one that’s easiest to setup, or the one that has the easiest-to-use interface for your programming language of choice – make sure whatever you choose is going to serve you well in production, not just in development.

Thanks for giving me an opportunity to talk about how we use Cassandra!

1 2 3 137