18 years ago I cruised into the data quality realm when making my first deduplication tool. Then it was an attempt to solve a business case of two companies who were considering merging and wanted to know the intersection of customers. So far, so good.
Since then I have worked intensively with deduplication and other data matching tools and approaches and also co-authored a leading eLearning course on the matter as seen here.
Deduplication capability is a core feature of many data quality tools and indeed the probably most mentioned data quality pain is lack of uniqueness not at least in party master data management.
However, most deduplication efforts don’t in my experience stick. Yes, we can process a file ready for direct marketing and purge the messages that might end up in the same offline or online inbox despite of spelling differences. But taking it from there and use the techniques in achieving a single customer view is another story. Some obstacles are:
- There is a fear of getting it all wrong as told in post Beware of False Positives in Data Matching.
- For many good reasons business processes requires deliberate duplicates as reported in the post Entity Revolution vs Entity Evolution.
In the comments to the latter 3 year old post the intersection (and non-intersection) of Entity Resolution and Master Data Management (MDM) was discussed.
During my latest work I have become more and more convinced that achieving a single view of something is a lot about entity resolution as expressed in the post The Good, Better and Best Way of Avoiding Duplicates.
Henrik – I agree that trust is a critical element in getting matching to stick – as discussed in my comment here http://dataqualitymatters.wordpress.com/2013/09/12/deterministic-matching-versus-probabilistic-matching/
Match approaches need to be easily understood by humans. We have successfully deployed matching by building test cases for each match rule that allowed the business to build trust in our results.
A reference list is not really an option in South Africa – there is no definitive, quality source for many common objects such as companies, neither is there a definitive national address database.
Thanks for commenting Gary.
Lots of efforts have indeed been put into match implementation around and as mentioned I have also worked a lot with that. And surely some match tools work better than others.
Using rich reference data is an easy way to take it into entity resolution and indeed the practical ways of doing that is for example very different between South Africa and Scandinavia.
I remember once that I learned that D&B had identified their B2B match results to be best in the Nordics and France. As the supplier of the Nordic tool I was of course proud. But after working with the D&B Worldbase I realized that the main reason probably was that public registration of companies worked best in, among other places, the Nordics and France.
Besides utilizing rich (or as I call it) big reference data tagging into many big data sources seems to be a future way applying entity resolution to achieving a single view of party master data.
Hi Henrik – there is definite potential in the big reference data approach.
To me it requires a hybrid approach.
Unless you are going to use a reference list as your single source of all truth then you will still need to do some kind of match between your data and the reference list. If you can trust this match then you can use the reference list to enrich or improve your existing data sets.
And, as your Good, Better, Best post points out – internal content may already exist, so you don;t always need an external set. In South African banking, for example, we frequently see a customer with an existing account being asked to go through the entire KYC process (source documents etc) because the bank is product centric and does not recognise the customer that they have on record in another business unit.
Very good points Henrik. I would even add that “dedup” must always be part of a data governance process and engrained in MDM solutions.
Stand-alone solutions are appealing but they will rarely support “do/undo/revert to previous versions” capabilities. Nor will they support workflows with different levels of validations to fix errors or supervise the quality of deduplication (the false positive you refer to, for instance).
People and processes are definitely part of the equation and in my opinion, including de-duplication in an overall MDM/Governance process can make it stick.
Thanks for commenting Gauthier. I do agree and have several times used deduplication as a successful part of making data ready for MDM. But then you could ask the question: Why don’t MDM Implementations Stick?.