I’ve just investigated a suspicious customer data match:
A Company on Kunstlaan no 99 in Brussel
was matched with high confidence with:
The Company on Avenue des Arts no 99 in Bruxelles
At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.
The diverse facts are:
- Brussels is the Belgian capital
- Belgium has two languages: French and Flemish (a variant of Dutch)
- Some parts of the country is French, some parts is Flemish and the capital is both
- Brussels is Bruxelles in French and Brussel in Flemish
- Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
- Laan is Flemish meaning Avenue (same origin as Lane I guess)
- Avenue des Arts is French meaning Avenue of Art (French is easy)
Technically the computer in this case did as follows:
- Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
- Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
- Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.
It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.
In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.
Your matching solution was obviously not dependent upon the degree of character alignment in the match keys and instead relied on other rules and perhaps reference sources to determine ‘equivalence’ of values. The same principle is often used in identity matching. Some name matching approaches allow ‘equivalent’ names to count as a match. For example: John, Ewan, Shawn, Johnny, Jack, Ian, Evan,Hans,Jens, Jan, Jon, Johan, Johannes, Giovanni, Gianni,Juan,Ivan and Ġwanni might all be deemed to be ‘equivalent’ and acceptable for matching.
Hi Wayne, thanks for commenting. Yes, in this solution we use both traditional (and commercial) similarity algorithms along with having a vocabulary. Some words in the vocabulary were put in beforehand and most other are learned by the way, so to speak.
For sure Bruxelles, Brussel, or any misspelt variation of it is an interesting challenge, just like any situation where the company name and address can be represented in multiple languages, legal vs tradestyle names, etc. Keeps us all occupied.
I was interested in the ‘learning’ element of your solution. Do you have any metrics regarding the quality/accuracy of the match, versus what a solution using global address reference data would deliver.
Hi Nigel. I don’t have any solid numbers and as with most things data quality I guess it is going to be difficult to settle such metrics. If you base your learned vocabulary on those terms being captured within a given organization you need a lot of processed data matching before it pays off. I do though believe we will see some open data in this field. Global address reference data is very different between countries measured by coverage, depth, timeliness, precision and accuracy.
it must be good algorithm that catches that. Belgium is indeed quite complex because streets in Brussels do officially have two – Flamish and French – names, which is typically included in global reference files. The problem lies in all the other cities and towns in either Flanders or Wallonia, that officially should have one language street names, but peoplet use both. This typically is not reflected in the reference files and it’s where machine learning algorithms can be of huge help. Good luck with that!
Thanks a lot Kalina. Good information about the widespread use of double street names not included in official reference data.
So interesting for me! As I am working in Belgium, we know how difficult it is to manage the both national languages (Dutch and French).
The solution we have in my company (provider of data solution related to BtB and BtC referential) is to separate the treatment of the addresses (multilingual street referential – daily updated) and the treatment of the name!
Furthermore we use an historical dataset which enables us to match also with an old name or even an old address for the company!
We can match as follow:
THE OLD FIRM – kunstlaan no 99 – Brussels
THE NEW ONE – rue de la loi no 21 – Bruxelles
=> assuming that the company has changed his name, has moved and the input is written in another language
We speak rather of the “identification” of a company in our referential than a matching!
Thanks for commenting Nicolas. I agree that local matching or identification most often takes it a bit further than generic worldwide approaches.
This is a very good example of using country specific vocabulary during matching. A multilingual country is always interesting.
Thanks for joining Tirthankar, indeed, multilingual issues are interesting aspects of the art of data matching.
Thanks for sharing your experience, we are facing the same problem as we are working on customer data in belgium. Keep going