One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.
Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:
The names are a close match (colored blue) as we have two swapped words.
The street addresses are an exact match (colored green),
The places are a mismatch (colored red).
All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.
Later we may encounter the two records:
The names are a close match (colored blue).
The street addresses are an exact match (colored green),
The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).
All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.
In a next match run we may meet these two records:
The names are an exact match (colored green).
The street addresses are an exact match (colored green),
The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).
We have a confident automated match with no need of costly manual inspection.
This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.
That is why, de-dupe solutions are sort of learning over time.
Usually we maintain a city look-up table which gets updated after de-dupe is done by looking at the match groups where city names are not matching.
Thanks for commenting, Tirthankar. Gathering these synonyms can be done in many ways indeed. Address validation services, knowing about these synonyms, may in this case also be used.
Hi Henrik,
I really like the componentized learning you describe here. It’s essential to achieving efficiency and sustained learning in an enterprise.
Now my question is, do you “force” a single view of this record in your subsequent decision to “match” or do you recognize a similarity and a distinction that allows you to affiliate them (as they describe the same physical location)?
Thanks for commenting Jeff. Good question.
The example about learning the probable link between “Buenos Aires” and “Capital Federal” is taken from a matching solution running at a large business directory provider. In this case the match is the basis for data enrichment services.
When working with master data hubs my recommendation is maintaining a golden record representing the real world entity with links to the original instances of records from various sources.
Hi Henrik, this is what the Google does and another search engines, I should mention, that matching has the most overlap with search engine and same background in information retrieval theory. These algorithms are called: learning to rank and they are learned in the time and also they could user profiling
Ondrej. Thanks for joining. I agree, data matching and search engine technology are closely related. Right now I’m involved a tool called iDQ (instant Data Quality) where we combine data matching and search engine features.
We are doing some similar work with probabilities and machine learning in identity resolution. Curious to follow what others are doing. http://matchbox.io/consumer-identification.html