Top Posts This Month
Upcoming Webinars
Global Posts

Cassandra Summit 2014: Fuzzy Matching at Scale

October 24, 2014


In the last few months I’ve given two different talks about scalable fuzzy matching.

The first was a Strata in San Jose, titled Similarity at Scale.  In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows.

More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where the focus was ad hoc queries to match one target against a large corpus. The approach I covered in depth was to use Solr queries to create a reduced set of candidates, after which you could apply typical “match distance” heuristics to re-score/re-rank the results.

The video for this second talk is freely available (thanks, DataStax!) and you can watch me lead off with an “uhm” right here:

Fuzzy Matching at Scale” was created by Ken Krugler, President of Scale Unlimited.

Notes from Cassandra Summit: The Rise of Top Line IT

October 23, 2014


You can always tell a great tech event from two things: the Q&A in the sessions, and the side conversations about real-world uses. (A lousy tech event has fancy giveaways and gimmicks and same-as-last-year demos, but that’s another story.) By all the standards I know, Cassandra Summit was a great event.

A lot has changed since open source was just a way to make your company IT environment work better, with high-quality replacements for existing product categories. Today, open source projects such as Cassandra are changing the categories from the user side, something like what Doc Searls was on to back in 2004, when he talked about DIY IT. Instead of just making a more flexible, more cost-effective entry in an existing product category, today’s open source is carving out whole new ones. And the developers are coming from all kinds of companies,
not just cool startups.

And the big win from the new generation of open source? It’s transforming the IT function from just a cost center, the department that doesn’t matter, into a way to build new things. But moving from the old cost center IT to the new IT that moves the top line means we all have to be ruthless in eliminating the old busywork. The software is out there to let the machines do their own routine babysitting, so we had better use it. We were all impressed with how Cassandra handled a recent emergency cloud reboot, and that kind of resiliency is key to how we can make the new generation come together.

Q&A on OSv

We got a lot of great questions at our session on OSv and afterward. OSv is a new OS for VMs in the cloud, with none of the complexities of an old bare-metal OS. No local config files, no permissions, just the basics that you need to run a single Java or POSIX application at maximum speed. Building a complete VM? That’s easy to do on any platform, and only takes only nine seconds out of your day. And, yes, we can bring every JMX management object out to its own REST endpoint with Jolokia. This means that managing the whole virtual appliance–application, JVM, OS–is all in one place, no need to mess around with arcane OS-level tools.

All in all, an OSv virtual machine gives you more Cassandra throughput, along with the all-important benefit of freeing up your time as an IT professional to focus on creating new value, not just keeping the old stuff up. Cassandra is full of great computer science that makes it resilient and scalable. OSv is full of great computer science that gives it high throughput and low overhead on any cloud, public or private. Together, they take a lot of the old, time-consuming work off of your to-do list and free you up to build the new, valuable stuff. It’s never been a better time to be in IT, and I’m not just saying it because of the groovy Cassandra Summit backpack. See you next year.

Down with Tweaking! Removing Tunable Complexity for Cassandra Performance and Administrator

Company: Cloudius Systems

Abstract: The need for performance tuning of the JVM and OS is making administrators the bottleneck for Cassandra deployments–especially in virtual environments. Over the past two years, the OSv project has profiled tuning-sensitive applications with a special focus on Cassandra. Today, many of the important bottlenecks for NoSQL applications are tunable on a conventional OS, but do not require tuning in the OSv environment. OSv gives Cassandra a simpler environment, set up to run one application in a single address space. This talk will cover how to use OSv to improve performance in key areas such as JVM memory allocation and network throughput–without loading up your to-do list with difficult tuning tasks.

Greg Cooper Senior Software Engineer at Iovation
"Iovation picked Cassandra to be able to scale their services out and grow them linearly both in terms of scale and predictable cost growth... With Oracle we had stability issues once the data sets got really huge and that’s been great with Cassandra."
Greg Cooper Senior Software Engineer at Iovation

Iovation’s a service that other customers will incorporate into their business to help detect fraud and it works in real time trying to find fraudulent accounts. You probably use Iovation services indirectly through other customers’ websites. As the device reputation authority, we’re able to expose and predict trustworthiness from a consumer’s interactions across the broad online landscape, including retail, financial services, telecommunications, social networking, logistics, dating, gaming, and gambling.

After migrating to Apache Cassandra from Oracle in 2013, Iovation has increased traffic volume by 600% with faster processing times – all at an 8X cost savings.

Oracle pain, Cassandra gain

I’ve been at iovation about three years and just before I arrived they integrated Cassandra 0.6 into the mix to do some of the real time traffic for a lot of the device recognition applications.  Since, we’ve used it in a couple of our other services we have several clusters of Cassandra 1.1.6. Right now we’re using it primarily for our real time services and fast responses and that sort of thing and it’s been great.

Iovation was primarily an Oracle relational database shop before they got into Cassandra, prior to me joining.  Iovation picked Cassandra to be able to scale their services out and grow them linearly both in terms of scale and predictable cost growth.  In addition to that, one of the motivators for replacing Oracle was the ease of maintenance and growth. With Oracle we had stability issues once the data sets got really huge and that’s been great with Cassandra.

Multi-data center & vnodes

We have three of our own data centers so it’s really a private cloud kind of scenario so three data centers with data replicated in each place and we have three clusters in production, one of them is a 24-node cluster with eight in each data center for our reputation service and then we have a 12-node cluster with four nodes in each data center for one of our services and we have another one, a little smaller one for velocity based things.  It’s two in each data center. I think they’re all on Cassandra 1.1.6 and we’re pretty excited to dig into  1.2.  

In 1.2 one of the big things is the ability to grow the cluster without having to double it; right now we have this paradigm where if we want to grow a cluster the easiest thing is to double it just to get the tokens to keep from having to move around our data.  With virtual nodes, added in 1.2, and the token assignments we’re excited to be able to grow the cluster by single and piecemeal nodes and by a couple machines rather than double it.  That’s one very exciting thing.

Trust in Cassandra

It’s really nice to see the stability just gets better and better with each release and every time a new release comes out I’m really impressed with the stability and simplicity that’s introduced from the previous versions, it’s been great.


Read more in TechRepublic’s 

Big data success: iovation drops Oracle for Apache Cassandra, sees 600% volume growth

Escaping From Disco-Era Data Modeling

October 21, 2014


On StackOverflow, I have seen Cassandra used in a lot of strange ways – particularly when it comes to secondary indexes.  I believe much of the confusion that exists is due to the majority of new Cassandra users having their understanding of database indexing grounded in experience with relational databases.  They tend to approach a Cassandra data model with the goal of trying find the most efficient way to store the data, and then add secondary indexes in a vain attempt to satisfy their potential query patterns.


This approach usually manifests itself in questions like this:

Here is my schema:

Why doesn’t my SELECT query work?

Or (for the sake of argument) replace the “ORDER BY” clause with “ALLOW FILTERING” and ask:

Why is my SELECT query so slow?

This model is representative of what you would commonly see in a relational database.  It does a great job of storing the data in a logical way, keyed with a unique identifier (messageid), and clustered by a date/time column.  Great that is, until you want to query it by anything other than each row’s unique identifier.  So the next logical leap is made, and the next thing you know, three or four (or more) secondary indexes are created on this column family (table).

Secondary indexes in Cassandra are not meant to be used as a “magic bullet,” allowing a column family to be queried by multiple high-cardinality keys.  Indexes on high-cardinality columns (columns with a large number of potential values) perform poorly, as many seeks will be required to filter a large number of rows down to a small number of results.  Likewise, indexes on extremely low-cardinality columns (booleans or columns with a small number of values, such as gender) perform badly, due to the large number of rows that will be returned.  Also, secondary indexes do not perform well in large clusters, as result sets are assembled from multiple nodes (the more nodes you add, the more network time you introduce into the equation).

There are other important points concerning secondary index use, such as the conditions under which they are intended to be used.  These (and those mentioned above) are detailed in a DataStax document titled When To Use An Index.  When helping others with Cassandra data modeling, this is the document that I will most often refer people to.  Anyone considering using secondary indexes should read that document thoroughly.

Cassandra queries work best when given a specific, precise row key.  The use of additional, differently-keyed column families populated with the same (redundant) data will perform faster than a query which uses a secondary index.  Therefore, the proper way to solve this problem is to create a column family to match each query.  This is known as query-based modeling.  Here is one way to design a column family to suit the afore-mentioned query:

With this solution, the data will be partitioned on fromjid and tojid, and clustered by messagedatetime.  Messageid is added to the PRIMARY KEY definition to ensure uniqueness.  This will allow the user’s query to be keyed on fromjid and tojid, while still allowing an ORDER BY on the clustering column messagedatetime.


Of course, it’s hard to change the way people think within the bounds of a single conversation.  Here is a typical response to the above solution:

I know my original design somewhat violates the whole NoSQL concept, but it saves a lot of memory.

That is a very 1970′s way of thinking.  Relational database theory originated at a time when disk space was expensive.  In 1975, some vendors were selling disk space at a staggering eleven thousand dollars per megabyte (depending on the vendor and model).  Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars.  Today (2014), you can buy a terabyte drive for sixty bucks.  Disk space is cheap; operation time is the expensive part.  And overuse of secondary indexes will increase your operation time.

Therefore, in Cassandra, you should take a query-based modeling approach.  Essentially, (Patel, 2014) model your column families according to how it makes sense to query your data.  This is a departure from relational data modeling, where tables are built according to how it makes sense to store the data.  Often, query-based modeling results in storage of redundant data (and sometimes data that is not dependent on its primary row key)…and that’s ok.

In summary, we have managed to free ourselves from bell-bottoms, disco, and shag carpeting.  Now it is also time that we free ourselves from the notion that our data models need be normalized to the max; as well as the assumption that all data storage technologies are engineered to work with normalized structures.  With Cassandra, it is essential to model your data to suit your query patterns.  If you do that properly, then you shouldn’t even need a secondary index…let alone three or four of them.



DataStax (2014).  When To Use An Index.  Retrieved from: on 09/29/2014.

McCallum, J. (2014).  Disk Drive Prices (1955-2014).  Retrieved from: on 09/29/2014.

Patel, J. (2012).  Cassandra Data Modeling Best Practices, Part 1.  Retrieved from: on 10/03/2014.


Cassandra Day Denver 2014 SlideShare/Video Presentations

October 20, 2014


Cassandra Day Denver 2014, sponsored by DataStax, Planet Cassandra & Pearson eCollege, was a huge hit with over 250 attendees, 2 speaking tracks (beginner and advanced) rotating across 4 rooms on the Pearson campus, and 13 technical sessions. Attendees had a chance to learn about how Cassandra is being used across a variety of companies and projects, such as Person’s eCollege platform or Project EPIC at the University of Boulder Colorado, as well as a full getting started track to take attendees from “zero to hero” with Apache Cassandra.

Found below are slides and videos from Cassandra Day Denver 2014 that we’re excited to share with the Apache Cassandra community.  We hope  you can attend the next Cassandra Day in person! Request us to come to your city by emailing with your cities name.


Beginner Track

Getting Started with Apache Cassandra — From Zero to Hero

Speaker: Jon Haddad, Technical Evangelist for Apache Cassandra at DataStax & Luke Tillman, Language Evangelist for Apache Cassandra at DataStax

This is a crash course introduction to Cassandra. You’ll step away understanding how it’s possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We’ll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We’ll wrap things up by setting up Cassandra locally, so bring your laptops!


Python & Cassandra Best Friends

Speaker: Jon Haddad, Technical Evangelist for Apache Cassandra at DataStax

We’ll be doing a deep dive into working with Cassandra using Python. We’ll cover a wide range of tools, starting with the native driver to cqlengine. We’ll also explore alternatives to the cqlsh repl to quickly explore our databases.

Building Java Applications with Apache Cassandra
Speaker: Tim Berglund, Global Director of Training at DataStax

So you’re a JVM developer, you understand Cassandra’s architecture, and you’re on your way to knowing its data model well enough to build descriptive data models that perform well. What you need now is to know the Java Driver.

What seems like an inconsequential library that proxies your application’s queries to your Cassandra cluster is actually a sophisticated piece of code that solves a lot of problems for you that early Cassandra developers had to code by hand. Come to this session to see features you might be missing and examples of how to use the Java driver in real applications.


Advanced Track

Lessons Learned Implementing Cassandra at Pearson eCollege

Speaker: Codey Whitt — Senior Software Engineer, Pearson eCollege

This talk discusses things to consider when considering Cassandra through the purview of a Pearson’s team’s recent Cassandra adoption after coming from a .NET/SQL world. Topics covered include data model design, operationalization of a cluster, and other best practices along with what happens when they aren’t followed.


Cassandra Anti-Pattern Jeopardy
Speaker: Rachel Pedreschi, Lead Sales Engineer at DataStax
Don’t put your Cassandra application in “jeopardy’! Come learn from real examples from the field on how NOT to do Cassandra. Prizes might be involved!


Setting up a DataStax Enterprise Instance on Microsoft Azure

Speaker: Joey Filichia, BI Consultant at Filichiacom

There are many options for Cloud Providers, but according to the Gartner Magic Quadrant 2014 for IaaS Solutions, Amazon AWS and Microsoft Azure are both leaders and visionaries. DataStax provides instructions for provisioning an Amazon Machine Image. This discussion will provide guidance on setting up a single-node DataStax Enterprise cluster using an Ubuntu 14.04 Server and a Windows Azure Virtual Machine. Using the DataStax Enterprise production installation in text mode, we will install DSE end to end during the presentation.


Reading Cassandra SSTables Directly for Offline Data Analysis

Speaker: Ben Vanberg, Software Engineer at FullContact

Here at FullContact we have lots and lots of contact data. In particular we have more than a billion profiles over which we would like to perform ad hoc data analysis. Much of this data resides in Cassandra, and we have many analytics MapReduce jobs that require us to iterate across terabytes of Cassandra data. To solve this problem we’ve implemented our own splittable input format which allows us to quickly process large SSTables for downstream analytics.


Feelin’ the Flow: Analyzing Data with Spark and Cassandra

Speaker: Rich Beaudoin, Senior Software Engineer at Pearson eCollege

In the world of Big Data it’s crucial that your data is accessible. Cassandra provides us with a means to reliably store our data, but how can we keep it flowing? That’s where Spark steps up to provide a powerful one-two punch with Cassandra to get your data flowing in all the right directions.


A Cassandra Data Model for Serving up Cat Videos

Speaker: Luke Tillman, Language Evangelist for Apache Cassandra at DataStax

Keyboard Cat, Nyan Cat, and of course the world famous Grumpy Cat–it seems like the Internet can’t get enough cat videos. If you were building an application to let users share and consume their fill of videos, how would you go about it? In this talk, we’ll take a look at the data model for KillrVideo, a sample video sharing application similar to YouTube where users can share videos, comment, rate them, and more. You’ll learn get a practical introduction to Cassandra data modelling, querying with CQL, how the application drives the data model, and how to shift your thinking from the relational world you probably have experience with.


Transitioning to Cassandra for an Already Giant Product
Speaker: Andrew Kuttig, Director of Software Engineering at SpotXchange, Inc.
Data modeling, cluster sizing, and planning can be difficult when transitioning an existing product to Cassandra. Especially when the new Cassandra deployment needs to handle millions of operations per second on day one! In this talk I’ll discuss our strategy for data modeling, cluster sizing, and our novel approach to data replication across data centers.

Using Cassandra to Support Crisis Informatics Research
Speaker: Ken Anderson, Associate Professor, Department of Computer Science at The University of Colorado Boulder
Crisis Informatics is an area of research that investigates how members of the public make use of social media during times of crisis. The amount of social media data generated by a single event is significant: millions of tweets and status updates accompanied by gigabytes of photos and video. To investigate the types of digital behaviors that occur around these events requires a significant investment in designing, developing, and deploying large-scale software infrastructure for both data collection and analysis. Project EPIC at the University of Colorado has been making use of Cassandra since Spring 2012 to provide a solid foundation for Project EPIC’s data collection and analysis activities. Project EPIC has collected terabytes of social media data associated with hundreds of disaster events that must be stored, processed, analyzed, and visualized. This talk will cover how Project EPIC makes use of Cassandra and discuss some of the architectural, modeling, and analysis challenges encountered while developing the Project EPIC software infrastructure.

1 2 3 138