Warm and Fuzzy Inside
He knows when you're asleep He knows when you're awake He knows when you've been bad or good So be good for goodness sake
Sounds pretty nefarious, right? Jolly old elf indeed. Just how does Nick know so much about everyone? Truth be told, he has many proprietary mechanisms of collecting data in house but as is so often the case with large endeavours, he subcontracts much of the work out to other parties. In this case, he purchases commercial data sets from market research firms which you would think would lighten the load on his data mining elves however most of his contractors aren't as rigorous about data cleaning and normalization as they are.
Fortunately, the North Pole Crew have some experience with imprecise matching and due to Santa's poor spelling. In fact, earlier this month we saw an article on how they were testing modules for calculating Levenshtein distance - the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other - to help work out how close the two strings are together. With this could the Elves splice their data and the commercial data sets together?
The elves did some testing and sadly it didn't look good. The problem the Elves were facing is that Levenshtein distance alone wasn't being smart enough for them. Consider the following corruptions:
In the first example the
an from Santa's name simply got swapped to become
na, whereas the second example requires two whole new characters replaced two old characters. However: these two examples have identical Levenshtein distance scores - how could the elves get their code to prioritize one over the other?
Text::Fuzzy is a fuzzy text matcher that can not only calculate Levenshtein distance, but is also able to calculate Damerau-Levenshtein distance - counting character swaps as just one step. This gives us this much better looking result.
As a bonus the interface for Text::Fuzzy is also surprisingly simple:
(Note the use of the
trans option there to enable Damerau-Levenshtein scores.)
Having Text::Fuzzy pick the one of the closest matches from a list is also really easy:
Putting It All Together
With this tool in in mind, the Elves knocked together a proof-of-concept for combining the two data sets. The code takes two disparate CSVs and combines them into a single CSV - fuzzily!
ID,FirstName,MiddleName,LastName,Street,City,Province,PostCode,Country,Match,FullName,Address,ExtraData 1,John,Jacob,Jingleheimer-Schmidt,1320 Needham Drive,Centreville,NV,89119,USA,89,Jon Jacob Jingleheimmer Schmitt,1320 Needham Dr, Centerville, NV 89119, USA,Tm90aGluZyB0byBzZWUgaGVyZS4= 2,Bartholomew,Jay,Simpson,1322 Evergreen Terrace,Springfield,,,USA,71 3,Marie Claire,,Dubois,62 Clos du Bois Rossia,Namur,,5017,BE,85,Mary Claire Dubois,62 Clos duBois Russia,Namur,,5017,BE,dGl0aSB0b3RvIHR1dHUgdGF0YQ== 3,Marie Claire,,Dubois,62 Clos du Bois Rossia,Namur,,5017,BE,85,Marie Claire du Bois,26 Clos du Bois Rossia,Namur,,5017,BE,SidhaW1lIGJpbmUgbGVzIGhpc3RvaXJlcyBkZSBHYXN0b24gTGFnYWZmZQ==