The Art in Data Matching

1st February 201129th May 2012Henrik Gabs Liliendahl

I’ve just investigated a suspicious customer data match:

A Company on Kunstlaan no 99 in Brussel

was matched with high confidence with:

The Company on Avenue des Arts no 99 in Bruxelles

At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.

The diverse facts are:

Brussels is the Belgian capital
Belgium has two languages: French and Flemish (a variant of Dutch)
Some parts of the country is French, some parts is Flemish and the capital is both
Brussels is Bruxelles in French and Brussel in Flemish
Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
Laan is Flemish meaning Avenue (same origin as Lane I guess)
Avenue des Arts is French meaning Avenue of Art (French is easy)

Technically the computer in this case did as follows:

Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.

It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.

In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.

Wayne Colless 1st February 2011 / 23:24

Hi Henrik
Your matching solution was obviously not dependent upon the degree of character alignment in the match keys and instead relied on other rules and perhaps reference sources to determine ‘equivalence’ of values. The same principle is often used in identity matching. Some name matching approaches allow ‘equivalent’ names to count as a match. For example: John, Ewan, Shawn, Johnny, Jack, Ian, Evan,Hans,Jens, Jan, Jon, Johan, Johannes, Giovanni, Gianni,Juan,Ivan and Ġwanni might all be deemed to be ‘equivalent’ and acceptable for matching.

Reply
- Henrik Liliendahl Sørensen 2nd February 2011 / 07:48
  
  Hi Wayne, thanks for commenting. Yes, in this solution we use both traditional (and commercial) similarity algorithms along with having a vocabulary. Some words in the vocabulary were put in beforehand and most other are learned by the way, so to speak.
  
  Reply
Pingback: SSIS-Components.net | The Art in Data Matching
Nigel Thomas 3rd February 2011 / 02:51

Hi Henrik,
For sure Bruxelles, Brussel, or any misspelt variation of it is an interesting challenge, just like any situation where the company name and address can be represented in multiple languages, legal vs tradestyle names, etc. Keeps us all occupied.

I was interested in the ‘learning’ element of your solution. Do you have any metrics regarding the quality/accuracy of the match, versus what a solution using global address reference data would deliver.
Regards, Nigel

Reply
- Henrik Liliendahl Sørensen 3rd February 2011 / 09:04
  
  Hi Nigel. I don’t have any solid numbers and as with most things data quality I guess it is going to be difficult to settle such metrics. If you base your learned vocabulary on those terms being captured within a given organization you need a lot of processed data matching before it pays off. I do though believe we will see some open data in this field. Global address reference data is very different between countries measured by coverage, depth, timeliness, precision and accuracy.
  
  Reply
Kalina Lipinska 4th February 2011 / 11:39

Hi Henrik,

it must be good algorithm that catches that. Belgium is indeed quite complex because streets in Brussels do officially have two – Flamish and French – names, which is typically included in global reference files. The problem lies in all the other cities and towns in either Flanders or Wallonia, that officially should have one language street names, but peoplet use both. This typically is not reflected in the reference files and it’s where machine learning algorithms can be of huge help. Good luck with that!

Reply
- Henrik Liliendahl Sørensen 4th February 2011 / 14:17
  
  Thanks a lot Kalina. Good information about the widespread use of double street names not included in official reference data.
  
  Reply
Nicolas 4th February 2011 / 17:04

Hi Henrik,

So interesting for me! As I am working in Belgium, we know how difficult it is to manage the both national languages (Dutch and French).

The solution we have in my company (provider of data solution related to BtB and BtC referential) is to separate the treatment of the addresses (multilingual street referential – daily updated) and the treatment of the name!

Furthermore we use an historical dataset which enables us to match also with an old name or even an old address for the company!

We can match as follow:

THE OLD FIRM – kunstlaan no 99 – Brussels

With

THE NEW ONE – rue de la loi no 21 – Bruxelles

=> assuming that the company has changed his name, has moved and the input is written in another language

We speak rather of the “identification” of a company in our referential than a matching!

Reply
- Henrik Liliendahl Sørensen 4th February 2011 / 18:10
  
  Thanks for commenting Nicolas. I agree that local matching or identification most often takes it a bit further than generic worldwide approaches.
  
  Reply
Tirthankar Ghosh 20th June 2011 / 09:15

Hi Henrik,

This is a very good example of using country specific vocabulary during matching. A multilingual country is always interesting.

Reply
- Henrik Liliendahl Sørensen 20th June 2011 / 10:18
  
  Thanks for joining Tirthankar, indeed, multilingual issues are interesting aspects of the art of data matching.
  
  Reply
Amri 13th December 2018 / 21:11

Hi Henrik,

Thanks for sharing your experience, we are facing the same problem as we are working on customer data in belgium. Keep going

Reply

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

The Art in Data Matching

Related

12 thoughts on “The Art in Data Matching”

Leave a comment Cancel reply