In a recent blog post by Kristen Gregerson of Satori Software you may learn A Terrible Tale where the identity of two different real world individuals were merged into one golden record with the most horrible result you may imagine associated with a recent special day related to the results of the other kind of matching going around.
As reported by Jim Harris some years ago in the post The Very True Fear of False Positives the bad things happening from false positives in data matching is indeed a hindrance for doing data matching
If we do data matching we should be aware that false positives will happen and we should know the probability of that it happens and we should know how to avoid the resulting heartache.
Indeed using a data matching tool is better than relying on simple database indexes and indeed there are differences in how good various data matching tools are at doing the job, not at least doing it under different circumstances as told in the post What is a best-in-class match engine?
Curious about how data matching tools work (differently)? There is an eLearning course available co-authored by yours truly. The course is called Data Parsing, Matching and De-duplication.
This is of course the centre of the argument against purely probabalistic matching techniques as used by most stack data matching technologies. While at first glance these statistical match approaches appar simple, because they claim not to require data standardisation, they result in many false positive matches and cannot be easily tuned to rule out false positives – as discussed here http://dataqualitymatters.wordpress.com/2011/07/13/what-is-data-governance-data-quality-matching. Companies such as Trillium Software have designed their match approach to be granular (so that we can isolate very small cases where a false positive match must be ruled out) and easy to tune 9so that we can improve teh results). This is critical to avoid false poitive matches and have a result that can be tested and signed off (trusted)
Thanks Gary for adding in. Indeed there are different approaches to data matching and each approach has its pros and cons. What I always found challenging, no matter what approach taken, is the balancing of getting the right true positives without having false positives. Being too secure is good, but often leaves out too many actually matching records. Your ROI is about catching all the right matching records without having the too many false ones. My experience is that many Gartner magic quadrant data quality tools just doesn’t do the job good enough.
Henrik, you are quite right. This is why it is critical that your match rules are granular i.e. each match must be based on a clearly defined use case. So we can agree that (for example) if we have a 100%given name and 99% surname then this is a match but if we have a small variation in 99% given name and 100% surname then this is not.
This allows us to test the effectiveness of our match and identify and elimnate cases that result in false positives. The alternative approach (if it is mre then 99% chance of a match then it is a match) cannot be tuned and both cases in the above example would pass, or both must fail
Clearly, in the real world we would want more information to use in amatch – such as address, telephone number, tax number or other realtively personal information.
So when looking at automated matching you need to understand whether individual match cases can be isolated, tested for effectiveness and, if necessary, One approach, taken by many commercial MDM platforms is to generate manual exceptions for suspicious may be matches. In practise,these excepetions may run into tens of thousands or records that must be manually resolved each month. Very few businesses have the operational capacity to handle these exceptiosn so they are routinely ignored. Puick a match approach 9and therefore tool) that will provide maximum coinfiedence in the match rresult, test and tune your outputs, and hopefully you will be left with very small volumes of records requiring manual interventiion.