The Art in Data Matching

I’ve just investigated a suspicious customer data match:

A Company on Kunstlaan no 99 in Brussel

was matched with high confidence with:

The Company on Avenue des Arts no 99 in Bruxelles

At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.

The diverse facts are:

  • Brussels is the Belgian capital
  • Belgium has two languages: French and Flemish (a variant of Dutch)
  • Some parts of the country is French, some parts is Flemish and the capital is both
  • Brussels is Bruxelles in French and Brussel in Flemish
  • Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
  • Laan is Flemish meaning Avenue (same origin as Lane I guess)
  • Avenue des Arts is French meaning Avenue of Art (French is easy)

Technically the computer in this case did as follows:

  • Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
  • Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
  • Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.

It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.

In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.

Bookmark and Share

11 thoughts on “The Art in Data Matching

  1. Wayne Colless 1st February 2011 / 23:24

    Hi Henrik
    Your matching solution was obviously not dependent upon the degree of character alignment in the match keys and instead relied on other rules and perhaps reference sources to determine ‘equivalence’ of values. The same principle is often used in identity matching. Some name matching approaches allow ‘equivalent’ names to count as a match. For example: John, Ewan, Shawn, Johnny, Jack, Ian, Evan,Hans,Jens, Jan, Jon, Johan, Johannes, Giovanni, Gianni,Juan,Ivan and Ġwanni might all be deemed to be ‘equivalent’ and acceptable for matching.

    • Henrik Liliendahl Sørensen 2nd February 2011 / 07:48

      Hi Wayne, thanks for commenting. Yes, in this solution we use both traditional (and commercial) similarity algorithms along with having a vocabulary. Some words in the vocabulary were put in beforehand and most other are learned by the way, so to speak.

  2. Nigel Thomas 3rd February 2011 / 02:51

    Hi Henrik,
    For sure Bruxelles, Brussel, or any misspelt variation of it is an interesting challenge, just like any situation where the company name and address can be represented in multiple languages, legal vs tradestyle names, etc. Keeps us all occupied.

    I was interested in the ‘learning’ element of your solution. Do you have any metrics regarding the quality/accuracy of the match, versus what a solution using global address reference data would deliver.
    Regards, Nigel

    • Henrik Liliendahl Sørensen 3rd February 2011 / 09:04

      Hi Nigel. I don’t have any solid numbers and as with most things data quality I guess it is going to be difficult to settle such metrics. If you base your learned vocabulary on those terms being captured within a given organization you need a lot of processed data matching before it pays off. I do though believe we will see some open data in this field. Global address reference data is very different between countries measured by coverage, depth, timeliness, precision and accuracy.

  3. Kalina Lipinska 4th February 2011 / 11:39

    Hi Henrik,

    it must be good algorithm that catches that. Belgium is indeed quite complex because streets in Brussels do officially have two – Flamish and French – names, which is typically included in global reference files. The problem lies in all the other cities and towns in either Flanders or Wallonia, that officially should have one language street names, but peoplet use both. This typically is not reflected in the reference files and it’s where machine learning algorithms can be of huge help. Good luck with that!

    • Henrik Liliendahl Sørensen 4th February 2011 / 14:17

      Thanks a lot Kalina. Good information about the widespread use of double street names not included in official reference data.

  4. Nicolas 4th February 2011 / 17:04

    Hi Henrik,

    So interesting for me! As I am working in Belgium, we know how difficult it is to manage the both national languages (Dutch and French).

    The solution we have in my company (provider of data solution related to BtB and BtC referential) is to separate the treatment of the addresses (multilingual street referential – daily updated) and the treatment of the name!

    Furthermore we use an historical dataset which enables us to match also with an old name or even an old address for the company!

    We can match as follow:

    THE OLD FIRM – kunstlaan no 99 – Brussels

    With

    THE NEW ONE – rue de la loi no 21 – Bruxelles

    => assuming that the company has changed his name, has moved and the input is written in another language

    We speak rather of the “identification” of a company in our referential than a matching!

    • Henrik Liliendahl Sørensen 4th February 2011 / 18:10

      Thanks for commenting Nicolas. I agree that local matching or identification most often takes it a bit further than generic worldwide approaches.

  5. Tirthankar Ghosh 20th June 2011 / 09:15

    Hi Henrik,

    This is a very good example of using country specific vocabulary during matching. A multilingual country is always interesting.

    • Henrik Liliendahl Sørensen 20th June 2011 / 10:18

      Thanks for joining Tirthankar, indeed, multilingual issues are interesting aspects of the art of data matching.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s