Game, Set, Match

Tennis is one of the sports I practiced a lot when I was young and still like to play when possible.

As a consequence I guess I also like to follow world class tennis not at least now where we finally got a Dane competing for the big titles. I’m thinking about Caroline Wozniacki who is seeded as number one in the ongoing US Open Grand Slam tournament.

So, as an excuse to write a blog post about it I have come up with these connections between Caroline and Data Matching.

The name:

Wozniacki isn’t exactly a Nordic name as she is the daughter of native-born Polish parents. In fact, if the Polish naming practice should be followed her surname should be Wozniacka; the female form of the name. But as practiced in Western countries she has inherited a genderless family name.  Good for matching.

The bet:

Bets on sports event is like scoring in data matching. You are not 100 % sure but rely on probability. Odds for Caroline winning the US Open opening round matches are as 1.01 and 1.02 = 98 – 99 % certainty = pretty sure. But odds get higher as the tournament proceeds to final rounds and it can go either way.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

Did They Put a Man on the Moon?

Recently I have been reading some blog posts circling around having a national ID for citizens in the United States including a post from Steve Sarsfield and another post from Jeffrey Huth of Initiate.

In Denmark where I live we have had such a national ID for about half a century. So if you are a vendor with a great solution for data matching and master data management in healthcare and wants to approach a Danish prospect in healthcare (which are mainly public sector here), they will tell you, that the solutions looks really nice, but they don’t have that problem. You can’t stay many seconds as a patient in a Danish hospital before you are asked to provide your national ID. And if you came in inside your mother you will be given an ID for life within seconds after you are born.

The same national ID is the basis when we have elections. Some weeks before the authorities will push the button and every person with the right status and age gets a ballot. Therefore we are in disbelief when we every fourth year are following when United States elects a president and we learn about all the mess in voter registration.

Is that happening in the nation that put a man on the moon in 1969?. Or did they? Was it after all a studio recording?

Bookmark and Share

The Many Worlds of Data Quality

This morning I had some fun reading the articles on Wikipedia explaining about Data Quality.

I tried to compare the texts available in:

I am afraid that the quality of texts and some differences between how the subject is presented in the different languages shows the immaturity of the data quality discipline and not at least the lack of global embracement that is seen in literature, published articles and the technology available.

Three observations from the Wikipedia articles:

The French piece is in some parts a translation from the English text. However the translation became very difficult in the History section, as the English text here has the well known narrowly United States scope.

The German text is completely different from the English text. Also the title is Information Quality. The references are largely from German authors.

The Japanese text seems to be a Google Translate of the (former) English text. This is strange as much of the quality inspiration originally came from Japan.

Bookmark and Share

LinkedIn and the other Thing

I have a profile in two different business oriented social networking services: LinkedIn and XING.

I have far more connections in LinkedIn than in XING.

My connections in LinkedIn are mainly from English speaking countries (US, UK, IE, IN, AU) and from Scandinavia (DK, NO, SE) where I live and where English is widely spoken not at least by people in white-collar.

The connections I have with people in XING are almost only with people from Germany.

This picture matches very well how these two tools are positioned.

The US based LinkedIn is strong in “English speaking” countries with most profiles per capita in:

  • Denmark, Netherlands and USA followed by
  • Norway, Sweden, United Kingdom and Australia

(I have some figures from last year when LinkedIn passed 50 million profiles).

XING is strong in Germany, where XING was founded, and through acquisitions also in Spain and Turkey.

Now, it’s not that you can’t operate LinkedIn in German and Spanish; you can. Also you can operate XING in English.

It’s about meeting your connections where they are.

Bookmark and Share

What’s In a Given Name?

I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.

When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name.  And when you have done this a few times you also learn the risks of doing so.

Here is some of the learning I have been through:

Gender

Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture.  The subject was discussed on DataQualityPro here.

Salutation

In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.

So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.

If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.

Age

In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.

I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.

Nicknames

Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.


Bookmark and Share

Eurovisions

Diversity in data quality is a recurring subject of mine. I think the issues with data quality and diversity resembles a recurring event in Europe being the yearly Eurovision Song Contest. This year the contest was held in Oslo the past week.

Every participating country brings a song. The text may be in any language which then mostly is either English or your different local language(s). Some songs have an international sound while other songs have a strong recognizable local sound. This year I noticed:

  • The winning song from Germany was in the international category, performed in English.
  • As UK songs usually have an international sound and are performed in English the British song handicapped itself with a +20 year old sound leading to a similar position in the finale.
  • Netherlands had a winning strategy with a local sound performed in Dutch. Big hit in Holland I think, but didn’t make it to the finale.

The voting process was as usual criticized as there is a tendency that neighboring countries favors each other such as done by Balkan countries – and the Viking nations.

Bookmark and Share

Post no. 100

This is post number 100 on this blog. Besides that this is a time for saying thank you to those who have read this blog, those who have re-tweeted the posts and not at least those who have commented on the posts on this blog, it is also time for a recapitulation on my opinions (based on my experiences and observations) about data quality.

Let me emphasize three points:

  • Fit for purpose versus real world alignment
  • Diversity in data quality
  • The role of technology in data quality improvement

Fit for purpose versus real world alignment

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

This theme is so far covered in 19 posts and pages including:

Diversity in data quality

International and multi-cultural aspects of data quality improvement have been a favorite topic of mine for a long time.

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

I have made 15 posts on diversity in data quality so far including:

The role of technology in data quality improvement

Being a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start. I have always strived to be a mixed skilled person. As I have tried single handed to build a data quality tool – or to be more specific a data matching tool – I do of course write a lot about data quality technology.

This blog includes 37 posts on data quality technology so far including:

Bookmark and Share