The Tower of Babel

Brueghel-tower-of-babelSeveral old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed we still are.

Multi-cultural issues is one of the really big challenges in data quality improvement. This includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name the ones I have experienced.

As globalization moves forward these challenges becomes more and more important. Enterprises tend to standardize world wide on tools and services, shared service centres takes care of data covering many countries and so on. When an employee works with data from another country he often wrongly adapts his local standards to these data and thereby challenges the data quality more than seen before.

Recently I updated this site with pages around “The art of Matching”. One topic is “Match Techniques” and comments posted here were exactly very much around the need for methods that solves the problems arising from having multi-cultural data. Have a look.

International and multi-cultural aspects of data quality improvement has been a favourite topic of mine for a long time.

Whether and when an organisation has to deal with international issues is of course dependent on whether and in what degree that organisation is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality. 

Some of the many different observations I have made includes the following:

  • Nicknames is a top issue in name matching in some cultures, but not of much importance in other cultures
  • Family names is key element in identifying households in some cultures, but not very useful in other cultures
  • Address verification and correction is very useful in some countries but close to impossible in other countries
  • Business directories are complete, consistent and available in some countries, but not that good in other countries
  • Citizen information is available for private entities in some countries, but is a no go in other countries

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

Bookmark and Share

The GlobalMatchBox

dnbLogo10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.

bisnode-logoThis solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.

Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.

It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.

So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.

Bookmark and Share