October 20th, 2014

In the last few months I’ve given two different talks about scalable fuzzy matching.

The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows.

More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where the focus was ad hoc queries to match one target against a large corpus. The approach I covered in depth was to use Solr queries to create a reduced set of candidates, after which you could apply typical “match distance” heuristics to re-score/re-rank the results.

The video for this second talk is freely available (thanks, DataStax!) and you can watch me lead off with an “uhm”:

Fuzzy Matching at Scale” was created by Ken Krugler, President at Scale Unlimited. Be sure to check out all of the presentations from Cassandra Summit 2014 at the official YouTube playlist and register for the upcoming Cassandra Summit Europe 2014 to catch even more great presentations before the year’s end.