The most frequent data quality improvement process done around is deduplication of party master data.
A core functionality of many data quality tools is the capability to find duplicates in large datasets with names, addresses and other party identification data.
When evaluating the result of such a process we usually divide the result of found duplicates into:
- False positives being automated match results that actually do not reflect real world duplicates
- True positives being automated match results reflecting the same real world entity
The difficulties in reaching the above result aside, you should think the rest is easy. Take the true positives, merge into a golden record and purge the unneeded duplicate records in your database.
Well, I have seen so many well executed deduplication jobs ending just there, because there are a lot of reasons for not making the golden records.
Sure, at lot of duplicates “are bad” and should be eliminated.
But many duplicates “are good” and have actually been put into the databases for a good reason supporting different kind of business processes where one view is needed in one case and another view is needed in another case.
Many, many operational applications, including very popular ERP and CRM systems, do have inferior data models that are not able to reflect the complexity of the real world.
Only a handful of MDM (Master Data Management) solutions are able to do so, but even then the solutions aren’t easy as most enterprises have an IT landscape with all kinds of applications with other business relevant functionality that isn’t replaced by a MDM solution.
What I like to do when working with getting business value from true positives is to build a so called Hierarchical Single Source of Truth.