Probabilistic Learning in Data Matching

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Bookmark and Share

7 thoughts on “Probabilistic Learning in Data Matching

  1. Tirthankar Ghosh 27th July 2012 / 06:04

    That is why, de-dupe solutions are sort of learning over time.
    Usually we maintain a city look-up table which gets updated after de-dupe is done by looking at the match groups where city names are not matching.

    • Henrik Liliendahl Sørensen 27th July 2012 / 08:40

      Thanks for commenting, Tirthankar. Gathering these synonyms can be done in many ways indeed. Address validation services, knowing about these synonyms, may in this case also be used.

  2. Jeff Jones 31st July 2012 / 15:47

    Hi Henrik,

    I really like the componentized learning you describe here. It’s essential to achieving efficiency and sustained learning in an enterprise.

    Now my question is, do you “force” a single view of this record in your subsequent decision to “match” or do you recognize a similarity and a distinction that allows you to affiliate them (as they describe the same physical location)?

    • Henrik Liliendahl Sørensen 31st July 2012 / 16:23

      Thanks for commenting Jeff. Good question.

      The example about learning the probable link between “Buenos Aires” and “Capital Federal” is taken from a matching solution running at a large business directory provider. In this case the match is the basis for data enrichment services.

      When working with master data hubs my recommendation is maintaining a golden record representing the real world entity with links to the original instances of records from various sources.

  3. Ondrej Rozinek 6th August 2012 / 21:23

    Hi Henrik, this is what the Google does and another search engines, I should mention, that matching has the most overlap with search engine and same background in information retrieval theory. These algorithms are called: learning to rank and they are learned in the time and also they could user profiling

    • Henrik Liliendahl Sørensen 7th August 2012 / 07:49

      Ondrej. Thanks for joining. I agree, data matching and search engine technology are closely related. Right now I’m involved a tool called iDQ (instant Data Quality) where we combine data matching and search engine features.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s