Georgian Geography and History

This is the sixth post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Georgia

Georgia is the English name for a sovereign state in the South Caucasus where Europe meets Asia. Georgia was a part of the Soviet Union under the English name Georgian SSR from 1922 to 1991. Back in the 4th century BC a unified kingdom of Georgia was established as an early example of an advanced state organization under one king and an aristocratic hierarchy.

Georgia

Georgia is a state located in the southeastern United States. Back in the 18th century the area was known as the Province of Georgia within the British colonies. Before the arrival of the Europeans some of current Georgia was part of the Cofitachequi paramount chiefdom.

Ambiguous place names and slowly changing dimensions

Like with Georgia there are lots of examples of place names belonging to more than one place on Earth. Besides that location reference data like the Georgia’s have slowly changing dimensions as what area is covered, where in a hierarchy it belongs and what it is called at a certain time.

Previous Data Quality World Tour blog posts:

Finding Finland

This is the fourth post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Let’s start with Finnish

Finland is situated in the North Eastern corner of Europe. The Finnish language is together with Estonian and Hungarian much longer south in Europe totally different from the neighboring countries languages which are Germanic or Slavic. Swedish is also an official language in Finland, and in some parts of Finland cities and streets have both (usually totally different) Finnish and Swedish names.

Galoshes

The by far largest company in Finland is the cell phone maker Nokia. Before the cell phone was invented Nokia made paper and galoshes – the old way of connecting people. Nokia also from 2006 to 2008 owned the data quality firm Identity Systems. It was sold to Informatica. I guess Identity Systems connected with the Gaelic Tiger firm Similarity Systems make up the data matching capabilities at Informatica.

Syslore

One of the remaining (relatively) larger independent data matching firms in the world is Syslore. Syslore is hiding in Finland.

Previous Data Quality World Tour blog posts:

Bookmark and Share

Questions about Quebec

This is the third post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Is Quebec a country?

No. Quebec is a province in Canada. But it was close on the 30/10/1995 with a referendum on sovereignty with only a very slim majority against sovereignty for the only province in Canada where French is the only official language.

What’s that date: 30/10/1995?

Besides having a different language Quebec also uses a different date format than else in North America. Where North Americans write month-day-year (like 10/30/1995) Quebecker’s write day-month-year like in most other parts of the world. I learned that from this blog post comment here.

The North American multi-cultural sandbox

A lot of software including tools for data quality and master data management comes from North America. When the international (and none English) capabilities of the software and related stuff are questioned, a good answer is always: Well, we did something in Quebec. Like here.

Previous Data Quality World Tour blog posts:

Bookmark and Share

Inside India

This is the second post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Cultural Diversity

India‘s culture is marked by a high degree of syncretism and cultural pluralism. Every state and union territory has its own official languages, and the constitution also recognizes 21 languages.

National Identification Number for 1.2 Billion People

The government of India has initiated a program for assigning a unique citizen ID for the over 1.2 billion people living in India. The program called Aadhaar is the largest of that kind in the world.

A System Integration Superpower

Tata, Satyam, Infosys, Wipro is just some of the many mega system integrators within master data management and data quality with headquarters in India. Add to that that companies like Cognizant and many others have most of their professionals based in India.  

Bookmark and Share

Check out the Czech Republic

This is the first post in a planned series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Companies all over

Last time I checked the Czech Republic had the highest number of Duns Numbers (unique company ID’s in the Dun & Bradstreet WorldBase) per capita in the world. Wonder if this is because of a very effective public sector registration, some special rules for incorporation or is it duplicates?

Exonyms, endonyms and beers

Many Czeck cities are known by the English exonyms (the name in English) but of course have a local endonym (name in Czech). The capital Prague is Praha in Czech. The town Pilsen is called Plzeň in Czech, but there are several towns around the world called Pilsen – and then of course there is a sort of beer called pilsener. (České) Budějovice is Czech for Budweis in German and English. We are certainly talking beer here also.

Ataccama

The data quality and master data management firm Ataccama was founded in the Czech Republic.

Bookmark and Share

Data Quality and Data Visualization

This is a self-centric blog post about data quality and data visualization.

The figure to the right is a statistic about who viewed my profile in a certain period on LinkedIn.

Looking at that makes me think about a couple of data quality and data visualization issues especially linked to visualization of data on a world map.

Hidden value

Fortunately there is both a map and some numbers below, because the map is too small to show from where I have the most views: My very small home country Denmark.

Misleading proportions

I have no views from the grey countries. So I should certainly concentrate on Greenland (the big grey land in the top of the map) to get more viewers, right?

Well, the Mercator projections make areas close to the poles like Greenland look much bigger than in the real world. Greenland is a big island, but in fact only less than 1/3 of Australia (the almost as big light blue land in the down under right corner) – and Greenland only has 1/400 of the population of Australia.

Cultural dependency

My blogging and LinkedIn activities are in English due to the moderate population of Denmark. Therefore, and because of the spread of LinkedIn biased in the English speaking world, it’s no surprise most viewers are from English speaking countries.

Bookmark and Share

The Letter Ø

This blog is written in English. Therefore the letters used are normally restricted to A to Z.  

The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.

My last surname is “Sørensen”. This word contains the lower case letter “ø” which is “Ø” in upper case. This letter is part of two alphabets: The Danish/Norwegian and the Faroese. Sometimes data has to be transformed into the English alphabet. Then the letter “ø” may be transformed to either “o” or “oe”. So my last surname will be either “Sorensen” or “Soerensen” in the English alphabet.

The town part of my address is “København”. The word “København” is what we call an endonym, which is the local word for place or a person. The opposite of endonym is exonym. The English exonym for “København” is “Copenhagen” which of course only has letters from the English alphabet. The Swedish exonym for “København” is “Köpenhamn”. Here we have a variant of “Ø” being “Ö”. The letter “Ö” exists in a lot of alphabets as Swedish, German, Hungarian and Turkish.

Usually “Ø” is transformed to “Ö” between Danish/Norwegian and these alphabets. The other way we usually accept the letter “Ö” in Danish/Norwegian master data.

These issues are of course a problem area in data quality, data matching and master data management. And with the complexity only between alphabets using Latin characters there is of course much more land to cover when including Cyrillic and Greek letters and then the other scripts systems with their hierarchical elements.

Bookmark and Share

1/1/11

Date formats have always been a trouble maker.

1/1/11 is one format for expressing today’s date. 2011/01/01 is another one. 1st January 2011 is a third way. January 1, 2011 is a fourth way.

That is of course given you use the Gregorian calendar and you don’t live far east from me, where it’s already a new day when I post this post.

1/1/11 is not one of those days where we have the usual confusion between the American way of expressing a date using the sequence month/day/year opposite to the common straight forward European sequence being day/month/year.

But in a few hours when it’s 2/1/11 in Europe and some hours later when it’s 1/2/11 in North America we are confused.

So, data quality folks, remember putting your dates in a unique format starting from tomorrow the 2nd January 2011 or, if you like, January 2, 2011.  

Happy New Unique Year.

Bookmark and Share

Matching Down Under

As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.

This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.

As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.

The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.

As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.

I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:

Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.   

Bookmark and Share

Hell in Norway

Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.

Like if you see a city name called “Hell”.

Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.

But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.

Bookmark and Share