Beware of False Positives in Data Matching

In a recent blog post by Kristen Gregerson of Satori Software you may learn A Terrible Tale where the identity of two different real world individuals were merged into one golden record with the most horrible result you may imagine associated with a recent special day related to the results of the other kind of matching going around.

datamatching
Join the Data Matching Group on LinkedIn

As reported by Jim Harris some years ago in the post The Very True Fear of False Positives the bad things happening from false positives in data matching is indeed a hindrance for doing data matching

If we do data matching we should be aware that false positives will happen and we should know the probability of that it happens and we should know how to avoid the resulting heartache.

Indeed using a data matching tool is better than relying on simple database indexes and indeed there are differences in how good various data matching tools are at doing the job, not at least doing it under different circumstances as told in the post What is a best-in-class match engine?

Curious about how data matching tools work (differently)? There is an eLearning course available co-authored by yours truly. The course is called Data Parsing, Matching and De-duplication.

Bookmark and Share

3 thoughts on “Beware of False Positives in Data Matching

  1. garymdm 24th February 2013 / 14:07

    This is of course the centre of the argument against purely probabalistic matching techniques as used by most stack data matching technologies. While at first glance these statistical match approaches appar simple, because they claim not to require data standardisation, they result in many false positive matches and cannot be easily tuned to rule out false positives – as discussed here http://dataqualitymatters.wordpress.com/2011/07/13/what-is-data-governance-data-quality-matching. Companies such as Trillium Software have designed their match approach to be granular (so that we can isolate very small cases where a false positive match must be ruled out) and easy to tune 9so that we can improve teh results). This is critical to avoid false poitive matches and have a result that can be tested and signed off (trusted)

    • Henrik Liliendahl Sørensen 24th February 2013 / 22:47

      Thanks Gary for adding in. Indeed there are different approaches to data matching and each approach has its pros and cons. What I always found challenging, no matter what approach taken, is the balancing of getting the right true positives without having false positives. Being too secure is good, but often leaves out too many actually matching records. Your ROI is about catching all the right matching records without having the too many false ones. My experience is that many Gartner magic quadrant data quality tools just doesn’t do the job good enough.

      • garymdm 25th February 2013 / 06:17

        Henrik, you are quite right. This is why it is critical that your match rules are granular i.e. each match must be based on a clearly defined use case. So we can agree that (for example) if we have a 100%given name and 99% surname then this is a match but if we have a small variation in 99% given name and 100% surname then this is not.

        This allows us to test the effectiveness of our match and identify and elimnate cases that result in false positives. The alternative approach (if it is mre then 99% chance of a match then it is a match) cannot be tuned and both cases in the above example would pass, or both must fail

        Clearly, in the real world we would want more information to use in amatch – such as address, telephone number, tax number or other realtively personal information.

        So when looking at automated matching you need to understand whether individual match cases can be isolated, tested for effectiveness and, if necessary, One approach, taken by many commercial MDM platforms is to generate manual exceptions for suspicious may be matches. In practise,these excepetions may run into tens of thousands or records that must be manually resolved each month. Very few businesses have the operational capacity to handle these exceptiosn so they are routinely ignored. Puick a match approach 9and therefore tool) that will provide maximum coinfiedence in the match rresult, test and tune your outputs, and hopefully you will be left with very small volumes of records requiring manual interventiion.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s