Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct.
The most common scenario for data matching is deduplication of customer data records held across an enterprise. In this case we often see a gap between what we technically try to do and the desired business outcome from deduplication. In my experience, this misalignment has something to do with real-world alignment.
What we technically do is basically to find a similarity between data records that typically has been pre-processed with some form of standardization. This is often not enough.
Deduplication and other forms of data matching with customer master data revolves around names and addresses.
Standardization and verification of addresses is very common element in data quality / data matching tools. Often such at tool will use a service either from its same brand or a third-party service. Unfortunately, no single service is often enough. This is because:
- Most services are biased towards a certain geography. They may for example be quite good for addresses in The United States but very poor compared to local services for other geographies. This is especially true for geographies with multiple languages in play as exemplified in the post The Art in Data Matching.
- There is much more to an address than the postal format. In deduplication it is for example useful to know if the address is a single-family house or a high-rise building, a nursing home, a campus or other building with lots of units.
- Timeliness of address reference data is underestimated. I recently heard from a leader in the Gartner Quadrant for Data Quality Tools that a quarterly refresh is fine. It is not, as told in the post Location Data Quality for MDM.
The overlaps and similarities between data matching and identity resolution was discussed in the post Deduplication vs Identity Resolution.
In summary, the capability to tell if two data records represent the same real-world entity will eventually involve identity resolution. And as this is very poorly supported by data quality tools around, we see that a lot of manual work will be involved if the business processes that relies on the data matching cannot tolerate too may, or in some cases any, false positives – or false negatives.
Even telling that a true positive match is true in all circumstances is hard. The predominant examples of this challenge are:
- Is a match between what seems to be an individual person and what seems to be the household where the person lives a true match?
- Is a match between what seems to be a person in a private role and what seems to be the same person in a business role a true match? This is especially tricky with sole proprietors working from home like farmers, dentists, free lance consultants and more.
- Is a match between two sister companies on the same address a true match? Or two departments within the same company?
We often realize that the answer to the questions are different depending on the business processes where the result of the data matching will be used.
The solution is not simple. The data matching functionality must, if we want automated and broadly usable results, be quite sophisticated in order to take advantage of what is available in the real-world. The data model where we hold the result of the data matching must be quite complex if we want to reflect the real-world.