As Bill Shakespeare Wrote …

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

Bookmark and Share

3 thoughts on “As Bill Shakespeare Wrote …

  1. Lawrence Dubov 15th March 2011 / 19:56

    Even if the names are exactly the same, it doesn’t mean that it is a match. Most likely a match only on the name (even full name) will be a false positive. This is why matching requires multiple attributes to be compared and probabilistically and fuzzy matched, e.g. Full Name including history, address including history, phone number including history, date of birth, and other attributes – as many as you can get.
    This is exactly what good matching algorithms do.

    • Henrik Liliendahl Sørensen 15th March 2011 / 20:57

      Larry, I agree. This is also why I say that there is a low risk for the mistake to occur. It is very unlikely that Bill and William in Denmark will have a similar address, phone or date of birth and therefore be in range of being considered as a match candidate.

    • Wayne Colless 15th March 2011 / 23:11

      Exactly, Lawrence. The use of additional attributes in the match key gives you the flexibility to be more ‘lenient’ in regards to the name and still maintain a high degree of certainty that you don’t have a false positive result.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s