In the last few months I’ve given two different talks about scalable fuzzy matching.
The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows.
More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where the focus was ad hoc queries to match one target against a large corpus. The approach I covered in depth was to use Solr queries to create a reduced set of candidates, after which you could apply typical “match distance” heuristics to re-score/re-rank the results.
The video for this second talk is freely available (thanks, DataStax!) and you can watch me lead off with an “uhm”: