When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.
The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:
A: The positive automated matches. Ideally you will take samples for manual inspection.
C: The negative automated matches.
B: The dubious part selected for manual inspection.
Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.
I have worked with the following features for such functionality:
- Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
- Check-out and check-in for multiuser environments
- Presenting a ranked range of computer selected candidates
- Color coding elements in matched candidates – like:
- green for (near) exact name,
- blue for a close name and
- red for a far from similar name
- Possibility for marking:
- as a manual positive match,
- as a manual negative match (with reason) or
- as questionable for later or supervisor inspection (with comments)
- Entering a match found by other methods
- Removing one or several members from a duplicate group
- Splitting a duplicate group into two groups
- Selecting survivorship
- Applying hierarchy linkage
Anyone else out there who have worked with making or using a man-machine dialogue for this?
Sadly, I have to raise my hand to the question. I say sadly because it takes hours upon hours and it is tedious!
I like the color coding solution. The particular tool I use presents scores of similarity so I could update the color coding according to similarity score.
This is a good post on one of the most difficult pieces to the deduplication puzzle that no one likes to talk about but someone has to do. It is great to see someone start to breach the topic!
Nice post, Henrik!
When I was with the phone company we implemented a lot of matching in our SVC platform. This platform had a web-based administrative GUI which let a (very small) team of people
a) Review probable matches (threshold >=90%)
b) Investigate Possible Matches (threshold >=65% and <90%)
c) Review Unmatched (threshold 55%).
(note: It’s been a while so my memory of the exact thresholds isn’t 100% accurate)
Each of these was a seperate list box on the GUI screen. For the unmatched review, selecting on the record pulled back the potential matches found and allowed humans to manually double check the computer results. Likewise, a similar capability existed for the Possibles.
Anything below the lower limit of probability wasn’t even presented to a human to review any further as the chances of them being a false positive or negative were too remote.
While not colour coded, this had the same effect. Also, once the match key associations were made in the database to a single entity, it improved future matching against that record (more potential keys to match against, increasing probability scores).
I wonder how much better we could have made the process if we’d been able to get budget and IT support to make the incremental changes to the process that had originally been envisaged.
William and Daragh, thanks for commenting.
I’m pleased that you also see the opportunity in this side of data matching processes, that may be hidden behind the all the fancy algorithms (no pun intended, I’m a geek there as well).
I Like this site your article is very nice , Thanks, very interesting article, keep up it coming 🙂
Actually WordPress did place this one in the spam folder – a good example of how computer says no without bothering you. Keep up it coming.
I like the article too 🙂
Sometimes the data linkage is easy job, you take your best practice rules, slightly change them and your work is done. In these types can computer really help you, but in this case, you don’t need much computer assistence.
On the other hand there are the other cases. E.g. In your example you miss the OrgNumber and you have to come up with a complex blocking scheme, so you won’t end up comparing the records each to each. In this type of data linkage you need to use your experience, do tests and measures and keep tweaking the rules. I usually end up representing the matching score by colours.
But anyway it would be great there is something that finds a proper data sample, all matching rules no matter the entity, country and customer industry, no matter you work for risk or marketing departments (we know their definition of single customer differs) and you just press the button “YES” 🙂
Thanks for the comment. Surely, that big ”YES” button is worth striving for. In the mean time, it’s wise to have some effective inspection features 🙂
Excellent post Henrik,
The “may be” is a tough nut to crack. And more “may be” results we hear from computer, more is the work for humans
When we are implementing MDM, we give lot of emphasis to the number of perfect matches, perfect non-matches and doubtful matches. Easier said than done, this takes several iterations and a lot of ‘attention to minute details’ work.
Good news however for me is the help I get from the tools. The products I work on provide sampling of best matches and data stewardship functionality to analyze/inspect relationships. A user interface which shows 2 probable matches side-by-side is a great help in comparing data elements and figuring out action to be taken.
My main concentration will be in reducing number of “may be” matches so we can help reduce as much manual work as possible. And not to forget false positive & false negatives which don’t let me have peaceful nights.
Prash, thanks for adding in. Building an MDM hub is exactly the hardest matching task, where we have multiple purposes of use for the party entities. This means that a match doesn’t necessarily have to be settled as a merge, but perhaps as a relation in a hierarchy.