One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.
Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:
The names are a close match (colored blue) as we have two swapped words.
The street addresses are an exact match (colored green),
The places are a mismatch (colored red).
All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.
Later we may encounter the two records:
The names are a close match (colored blue).
The street addresses are an exact match (colored green),
The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).
All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.
In a next match run we may meet these two records:
The names are an exact match (colored green).
The street addresses are an exact match (colored green),
The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).
We have a confident automated match with no need of costly manual inspection.
This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.











