Data Diversity

As part of my work I deal with data from different countries. In the below figure I have put in some examples of different presentations of the same data from some of the countries I meet the most being Denmark (DK), Germany (DE), France (FR), United States (US) and United Kingdom (GB):

 
Click on figure to enlarge.

I have some more information on the issues regarding the different attributes:

Bookmark and Share

New Eyes on Iceland

This eights Data Quality World Tour blog post is about Iceland.

Patronymics

Rather than using family names, the Icelanders use patronymics. This means that the first Icelandic President Sveinn Björnsson must have been son of Björn and I guess current Prime Minister Jóhanna Sigurðardóttir is the daughter of Sigurð. This must create some havoc for well proven algorithms for finding households. (Add to that that the Prime Minister is in a same-sex marriage).

Volcanoes

In the good old days air traffic wasn’t concerned with the recurring volcanic eruptions on Iceland. Today it seems to be a repeating cause of travel havoc. A bit like poor data quality wasn’t taken seriously in the good old days, but today dirty data creates havoc in business intelligence implementations.  

Previous Data Quality World Tour blog posts:

How long is a Marathon?

Many large cities around the world have a yearly marathon event. Today it’s Copenhagen (and possibly other cities too).

The marathon distance today is 42,195 kilometers (if I use comma as decimal point) which resembles 26 miles and 385 yards or 26.22 miles (if I use a dot as decimal point).

So even if we today agree about the distance we might represent that distance in various ways. The distance has however varied during history as seen in the table with the length of the Olympic marathons.

What about real world alignment?

Well, if the Greek runner called Pheidippides (sometimes spelled Phidippides or Philippides) took the long but flat Southern route from Marathon to Athens it would have been around 42 kilometers. If he took the shorter but steeper Northern route it would only have been around 35 kilometers.

What about me? Oh, I’ll go for 42,195 kilometers – on the bike.   

Bookmark and Share

No NOT NULL

A basic way of ensuring data quality in a database is to define that a certain attribute must be filled. This is done by specifying that the value “null” isn’t allowed or as said in SQL’ish: Setting the NOT NULL constraint.

A common data quality issue is that such constraints almost always are too rigid.

In my last post called Notes about the North Pole it was discussed that every place on earth has a latitude and a longitude except that the North Pole – and the South Pole – hasn’t a longitude. So if you have a table with geocodes you can’t set NOT NULL for the longitude if you (though very unlikely) should store the coordinates for the poles. Alternatively you could store 0 for longitude to make it complete – but then it would be very inaccurate. 360 degree inaccurate so to speak.

Another infrequent example from this blog is that every person in my country has a given (first) name and a family (last) name. But there are a few Royal Exceptions. So, no NOT NULL for the family name.

Related to people and places there are plenty of more frequent examples. If you only expect addresses form United States, Australia or India setting the NOT NULL for the state attribute seems wise. But expect foolish values in here when you get addresses from most other parts of the world. So, no NOT NULL for the state.  

A common variant of the mandatory state value is when you register for data quality webinars, white papers and so on. Most often you must select from a value list containing the United States of America – in some cases also mixed in with Canadian Provinces. The NULL option to be used by strangers may hide as “Not Applicable” way down the list among states beginning with N.

I usually select Alaska which is among the first states in the alphabetical order – which also brings me back close to the North Pole making my data close to 360 degree inaccuracy.     

Bookmark and Share

Compound Words

When working with data quality and not at least data matching an ever recurring issue is compound words. We even have the issue when talking about terms related to data quality like is it called “meta data” or “metadata” and is it called “multi-domain MDM” or “multidomain MDM”. With MDM my spell checker likes the first option, but Gartner (the analyst firm) likes the last option.

In an international context the issue with compound words becomes much more frequent. In some languages like the other Germanic languages than English compound words are used much more. For example a street name as “Main Street” will be “Hauptstrasse” in German and “Hovedgade” in Danish.

If your first language has many compound words (like mine) you tend to use (and overuse) compound words even in English. I stumbled upon that when I was helping a family member looking for searching trends for “hair extensions”.

If you look at the regional interest in Google Insights the interest in “hair extensions” (figure 1) is big mostly in countries with English as first language while the interest in “hairextensions” (figure 2) is big mostly in countries having English as secondary or third language.

Bookmark and Share

Japanese Jargon

This is the fifth post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Home of quality philosophy

Japan is the home and inspiration of quality thinking. Therefore we also have some Japanese words used when talking quality. For example kaizen is used for continuous quality improvement, muda is the waste we should avoid and gemba is the real place where things happens and things could be changed.

Streets with no names

When sending letters to Japan the way of addressing is different from how it is done in most other parts of the world. Street names are seldom used in Japanese postal addresses, but the numbers/names of the blocks between the streets are used.

Would you like Kanji, Hiragana, Katakana or Romaji?

No, this is not a selection from the a la carte menu at a Japanese restaurant but different kind of writing systems to choose from in Japan covering three different kinds of script systems. Kanji is the old symbolic writing system similar to Chinese writing. Hiragana and Katakana are syllabic writing systems while Romaji is transcription of Japanese into Roman alphabetic letters.  

Previous Data Quality World Tour blog posts:

Bookmark and Share

Finding Finland

This is the fourth post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Let’s start with Finnish

Finland is situated in the North Eastern corner of Europe. The Finnish language is together with Estonian and Hungarian much longer south in Europe totally different from the neighboring countries languages which are Germanic or Slavic. Swedish is also an official language in Finland, and in some parts of Finland cities and streets have both (usually totally different) Finnish and Swedish names.

Galoshes

The by far largest company in Finland is the cell phone maker Nokia. Before the cell phone was invented Nokia made paper and galoshes – the old way of connecting people. Nokia also from 2006 to 2008 owned the data quality firm Identity Systems. It was sold to Informatica. I guess Identity Systems connected with the Gaelic Tiger firm Similarity Systems make up the data matching capabilities at Informatica.

Syslore

One of the remaining (relatively) larger independent data matching firms in the world is Syslore. Syslore is hiding in Finland.

Previous Data Quality World Tour blog posts:

Bookmark and Share

Questions about Quebec

This is the third post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Is Quebec a country?

No. Quebec is a province in Canada. But it was close on the 30/10/1995 with a referendum on sovereignty with only a very slim majority against sovereignty for the only province in Canada where French is the only official language.

What’s that date: 30/10/1995?

Besides having a different language Quebec also uses a different date format than else in North America. Where North Americans write month-day-year (like 10/30/1995) Quebecker’s write day-month-year like in most other parts of the world. I learned that from this blog post comment here.

The North American multi-cultural sandbox

A lot of software including tools for data quality and master data management comes from North America. When the international (and none English) capabilities of the software and related stuff are questioned, a good answer is always: Well, we did something in Quebec. Like here.

Previous Data Quality World Tour blog posts:

Bookmark and Share

Inside India

This is the second post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Cultural Diversity

India‘s culture is marked by a high degree of syncretism and cultural pluralism. Every state and union territory has its own official languages, and the constitution also recognizes 21 languages.

National Identification Number for 1.2 Billion People

The government of India has initiated a program for assigning a unique citizen ID for the over 1.2 billion people living in India. The program called Aadhaar is the largest of that kind in the world.

A System Integration Superpower

Tata, Satyam, Infosys, Wipro is just some of the many mega system integrators within master data management and data quality with headquarters in India. Add to that that companies like Cognizant and many others have most of their professionals based in India.  

Bookmark and Share

Check out the Czech Republic

This is the first post in a planned series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Companies all over

Last time I checked the Czech Republic had the highest number of Duns Numbers (unique company ID’s in the Dun & Bradstreet WorldBase) per capita in the world. Wonder if this is because of a very effective public sector registration, some special rules for incorporation or is it duplicates?

Exonyms, endonyms and beers

Many Czeck cities are known by the English exonyms (the name in English) but of course have a local endonym (name in Czech). The capital Prague is Praha in Czech. The town Pilsen is called Plzeň in Czech, but there are several towns around the world called Pilsen – and then of course there is a sort of beer called pilsener. (České) Budějovice is Czech for Budweis in German and English. We are certainly talking beer here also.

Ataccama

The data quality and master data management firm Ataccama was founded in the Czech Republic.

Bookmark and Share