Probabilistic Learning in Data Matching

26th July 201222nd February 2013Henrik Gabs Liliendahl

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Tirthankar Ghosh 27th July 2012 / 06:04

That is why, de-dupe solutions are sort of learning over time.
Usually we maintain a city look-up table which gets updated after de-dupe is done by looking at the match groups where city names are not matching.

Reply
- Henrik Liliendahl Sørensen 27th July 2012 / 08:40
  
  Thanks for commenting, Tirthankar. Gathering these synonyms can be done in many ways indeed. Address validation services, knowing about these synonyms, may in this case also be used.
  
  Reply
Jeff Jones 31st July 2012 / 15:47

Hi Henrik,

I really like the componentized learning you describe here. It’s essential to achieving efficiency and sustained learning in an enterprise.

Now my question is, do you “force” a single view of this record in your subsequent decision to “match” or do you recognize a similarity and a distinction that allows you to affiliate them (as they describe the same physical location)?

Reply
- Henrik Liliendahl Sørensen 31st July 2012 / 16:23
  
  Thanks for commenting Jeff. Good question.
  
  The example about learning the probable link between “Buenos Aires” and “Capital Federal” is taken from a matching solution running at a large business directory provider. In this case the match is the basis for data enrichment services.
  
  When working with master data hubs my recommendation is maintaining a golden record representing the real world entity with links to the original instances of records from various sources.
  
  Reply
Ondrej Rozinek 6th August 2012 / 21:23

Hi Henrik, this is what the Google does and another search engines, I should mention, that matching has the most overlap with search engine and same background in information retrieval theory. These algorithms are called: learning to rank and they are learned in the time and also they could user profiling

Reply
- Henrik Liliendahl Sørensen 7th August 2012 / 07:49
  
  Ondrej. Thanks for joining. I agree, data matching and search engine technology are closely related. Right now I’m involved a tool called iDQ (instant Data Quality) where we combine data matching and search engine features.
  
  Reply
Joel Wilson 28th March 2017 / 20:30

We are doing some similar work with probabilities and machine learning in identity resolution. Curious to follow what others are doing. http://matchbox.io/consumer-identification.html

Reply

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

Probabilistic Learning in Data Matching

Related

7 thoughts on “Probabilistic Learning in Data Matching”

Leave a comment Cancel reply