As Bill Shakespeare Wrote …

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

Bookmark and Share

Data Quality and Data Visualization

This is a self-centric blog post about data quality and data visualization.

The figure to the right is a statistic about who viewed my profile in a certain period on LinkedIn.

Looking at that makes me think about a couple of data quality and data visualization issues especially linked to visualization of data on a world map.

Hidden value

Fortunately there is both a map and some numbers below, because the map is too small to show from where I have the most views: My very small home country Denmark.

Misleading proportions

I have no views from the grey countries. So I should certainly concentrate on Greenland (the big grey land in the top of the map) to get more viewers, right?

Well, the Mercator projections make areas close to the poles like Greenland look much bigger than in the real world. Greenland is a big island, but in fact only less than 1/3 of Australia (the almost as big light blue land in the down under right corner) – and Greenland only has 1/400 of the population of Australia.

Cultural dependency

My blogging and LinkedIn activities are in English due to the moderate population of Denmark. Therefore, and because of the spread of LinkedIn biased in the English speaking world, it’s no surprise most viewers are from English speaking countries.

Bookmark and Share

Foreign Affairs

There is a famous poster called The New Yorker. This poster perfectly illustrates the centricity we often have about the town, region or country we live in.

The same phenomenon is often seen in data management.

I mentioned United States centricity as a minor criticism in my recent book review about the excellent book “Master Data Management and Data Governance”.  

An example from the book is this statement:

“It is important to differentiate between U.S. domestic addresses and international addresses. This distinction is important for U.S.-centric MDM solutions because U.S. domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

The same fact could be expressed by saying:

“It is important to differentiate between Danish domestic addresses and international addresses. This distinction is important for Danish-centric MDM solutions because Danish domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

Only, the better formatted address in the first case is the messy address in the last case, and the better formatted address in the last case is the messy address in the first case.

If your MDM scope is country-centric it is sensible to concentrate on automation related to that country.

If your MDM scope is international there are two options:

  • The easy way: The one size fits all option. This is a moderate investment, but also, it only yields moderate results in terms of automation and data quality.
  • The hard way: You have to implement specialized automation and investigate best external reference data for each country. I made a Danish-centric post on that last year here.

Bookmark and Share

Book Review: Berson and Dubov on MDM

A few days ago Julian Schwarzenbach over at the Data and Process Advantage Blog published a review of the book “Master Data Management and Data Governance” by Alex Berson and Larry Dubov. Link to Julian’s review here.

And hey, that’s the book I have been reading too during the last months. So why not make my review too.    

I agree very much with Julian’s positive review of the book. It is a very comprehensive book – and thick and heavy I have learned from bringing it with me on travel which is where I usually read offline stuff. But master data management and related data governance is a big and heavy discipline with a lot of details that has to be dealt with.

Probably I have annoyed fellow travellers in trains and airplanes while reading the book with exclamations as: Yes, precisely, that’s what I always have said, good point and so on. Because I agree very much with many of the issues described and the solutions discussed in the book.

For the mandatory bit of criticism that must be included in every book review I will bring on my pet bashing about United States and English language centricity. Well, it’s actually not that bad, as the book at many places does indicate that other angles and pains exist than those being prominent in the United States and with the English language.

Oh, and I bear with that  my surname in the references are spelled “Sorensen” instead of “Sørensen” and that a related date are formatted like “11/22/2009” which will be the 11th day in the 22nd month of the year 2009 to me.     

Bookmark and Share

No Privacy Customer Onboarding

This post is a follow up on today’s #DataKnightsJam happening on twitter. Today’s subject was data quality and data privacy.

Diversity in data quality is a subject discussed a lot of times on this blog.

So I want to share a real life example of a good upstream get it right first time data sharing approach that might compromise privacy thresholds in other places.

The image to the right is the data entry form from a Swedish webshop used for customer self-registration. The main flow is that:

  • You type your national ID (personnummer in Swedish)
  • You press the following button
  • The system fetches your name and address data from the public citizen hub
  • The webshop gets an accurate, complete single customer view  

The webshop www.jula.se sells tools for home improvement.

Bookmark and Share

The Letter Ø

This blog is written in English. Therefore the letters used are normally restricted to A to Z.  

The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.

My last surname is “Sørensen”. This word contains the lower case letter “ø” which is “Ø” in upper case. This letter is part of two alphabets: The Danish/Norwegian and the Faroese. Sometimes data has to be transformed into the English alphabet. Then the letter “ø” may be transformed to either “o” or “oe”. So my last surname will be either “Sorensen” or “Soerensen” in the English alphabet.

The town part of my address is “København”. The word “København” is what we call an endonym, which is the local word for place or a person. The opposite of endonym is exonym. The English exonym for “København” is “Copenhagen” which of course only has letters from the English alphabet. The Swedish exonym for “København” is “Köpenhamn”. Here we have a variant of “Ø” being “Ö”. The letter “Ö” exists in a lot of alphabets as Swedish, German, Hungarian and Turkish.

Usually “Ø” is transformed to “Ö” between Danish/Norwegian and these alphabets. The other way we usually accept the letter “Ö” in Danish/Norwegian master data.

These issues are of course a problem area in data quality, data matching and master data management. And with the complexity only between alphabets using Latin characters there is of course much more land to cover when including Cyrillic and Greek letters and then the other scripts systems with their hierarchical elements.

Bookmark and Share

The Art in Data Matching

I’ve just investigated a suspicious customer data match:

A Company on Kunstlaan no 99 in Brussel

was matched with high confidence with:

The Company on Avenue des Arts no 99 in Bruxelles

At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.

The diverse facts are:

  • Brussels is the Belgian capital
  • Belgium has two languages: French and Flemish (a variant of Dutch)
  • Some parts of the country is French, some parts is Flemish and the capital is both
  • Brussels is Bruxelles in French and Brussel in Flemish
  • Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
  • Laan is Flemish meaning Avenue (same origin as Lane I guess)
  • Avenue des Arts is French meaning Avenue of Art (French is easy)

Technically the computer in this case did as follows:

  • Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
  • Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
  • Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.

It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.

In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.

Bookmark and Share

1/1/11

Date formats have always been a trouble maker.

1/1/11 is one format for expressing today’s date. 2011/01/01 is another one. 1st January 2011 is a third way. January 1, 2011 is a fourth way.

That is of course given you use the Gregorian calendar and you don’t live far east from me, where it’s already a new day when I post this post.

1/1/11 is not one of those days where we have the usual confusion between the American way of expressing a date using the sequence month/day/year opposite to the common straight forward European sequence being day/month/year.

But in a few hours when it’s 2/1/11 in Europe and some hours later when it’s 1/2/11 in North America we are confused.

So, data quality folks, remember putting your dates in a unique format starting from tomorrow the 2nd January 2011 or, if you like, January 2, 2011.  

Happy New Unique Year.

Bookmark and Share

Diversity in Data Quality in 2010

Diversity in data quality is a favorite topic of mine and diversity has been my theme word in social media engagement this year.

Fortunately I’m not alone. Others have been writing about diversity in data quality in the past year. Here are some of the contributions I remember:

The Dutch data quality tool vendor Human Inference has a blog called Data Value Talk. Here several posts are about diversity in data quality including the post World Languages Day – Linguistic diversity rules in Switserland!

Another blog based in the Netherlands is from Graham Rhind. Graham (a Brit stranded in Amsterdam) is an expert in international issues with data quality and one of his blog posts this year is called Robert the Carrot.

The MDM Vendor IBM Initiate has a lively blog about Master Data Management and Data Quality. One of the posts this year was an introduction to a webinar. The post by Scott Schumacher (in which I’m proud to be mentioned) is called Join Us to Demystify Multi-Cultural Name Matching.

Rich Murnane posted a funny but learning video with Derek Sivers about Japanese addresses called What is the name of that block? (Again, thanks Rich for the mention).

In the eLearningCurve free webinar series there was a very educational session with Kathy Hunter called Overcoming the Challenges of Global Data.  There is also an interview with Kathy Hunter on the DataQualityPro site.

I also remember we debated the state of the art of data quality tools when it comes to international data in the post by Jim Harris called OOBE-DQ, Where Are You? As Jim mentions in his later post called Do you believe in Magic (Quadrants)?: “It must be noted that many vendors (including the “market leaders”) continue to struggle with their International OOBE-DQ”.

I guess that international capabilities in data quality tools and party master data management solutions will be on the agenda in 2011 as well.

Bookmark and Share