Yesterday I was looking for some quotations for a data quality presentation.
I stumbled upon these ones by Niels Bohr:
An expert is a person who has made all the mistakes which can be made in a very narrow field
I found that this quote is most often used this way:
“An expert is a man who has made all the mistakes which can be made in a very narrow field”.
I am pretty sure Bohr said person – not man. There are just as many female experts as male experts around.
And indeed: Learning from mistakes is the path to expertise in data quality.
There are two sorts of truth: Trivialities, where opposites are obviously absurd and profound truths, recognized by the fact that the opposite is also a profound truth
Bohr was into quantum mechanics. I think data quality is very much like quantum mechanics. Sometimes there is a simple single version of the truth; sometimes there are several great versions of a complex truth.
Anyone who is not shocked by quantum theory has not understood it
Anyone who is not shocked by the actual quality of data has probably not measured it (yet).
Even though I’m not a royalist I’m afraid this will be the second hypocritical blog post within a year with a royal introduction. The first one was about Royal Exceptions.
The big news on all channels today in Denmark (and Australia) is that (Australian born) Crown Princess Mary has given birth to twins; a boy and a girl then being a prince and a princess or as we say in blunt data quality language: A male and a female.
The gender of individuals has always been a prominent element in party master data management and not at least in data matching.
Right now we are having a discussion in the LinkedIn Data Matching group concerning Data Quality of Gender / Sex Codes and the Impacts on Identity Data Matching.
So far we have covered issues as:
- Trustworthiness for assigned gender codes
- Scoring mechanisms in matching including gender codes
- Diversity impact in assigning/verifying gender from names
- Using gender codes for salutation
Please join the discussion and if you are not already a member of the LinkedIn Data Matching group: Join the group here.
Tennis is one of the sports I practiced a lot when I was young and still like to play when possible.
As a consequence I guess I also like to follow world class tennis not at least now where we finally got a Dane competing for the big titles. I’m thinking about Caroline Wozniacki who is seeded as number one in the ongoing US Open Grand Slam tournament.
So, as an excuse to write a blog post about it I have come up with these connections between Caroline and Data Matching.
Wozniacki isn’t exactly a Nordic name as she is the daughter of native-born Polish parents. In fact, if the Polish naming practice should be followed her surname should be Wozniacka; the female form of the name. But as practiced in Western countries she has inherited a genderless family name. Good for matching.
Bets on sports event is like scoring in data matching. You are not 100 % sure but rely on probability. Odds for Caroline winning the US Open opening round matches are as 1.01 and 1.02 = 98 – 99 % certainty = pretty sure. But odds get higher as the tournament proceeds to final rounds and it can go either way.
I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.
When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name. And when you have done this a few times you also learn the risks of doing so.
Here is some of the learning I have been through:
Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture. The subject was discussed on DataQualityPro here.
In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.
So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.
If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.
In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.
I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.
Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.
Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.
The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.
Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.
An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.
Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.
There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.
Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.
Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.
Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.
Figure 1: one year old prime information
PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.