I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.
When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name. And when you have done this a few times you also learn the risks of doing so.
Here is some of the learning I have been through:
Gender
Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture. The subject was discussed on DataQualityPro here.
Salutation
In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.
So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.
If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.
Age
In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.
I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.
Nicknames
Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.
Sehr geehrter Herr Sørensen,
Excellent post about the common data quality challenge represented by the diversity of given names.
Unfortunately, in most of the implementations in the United States, assumptions are made based on the English language, and therefore “Jim” would always be standardized as “James” for matching purposes (although usually into a separate matching field to retain the original value for survivorship where if in five matching James records, Jim was the original given name on three or more of the records, then we would probably assume that the customer preferred to be called Jim).
I am not aware of any similar statistical tables combining given name and birth year available in the United States. However, it certainly would make sense. For example, I would assume that the 1960s had a disproportionately large distribution of Moonchild, Starflower, and Aquarius — however, probably for both genders.
Best Regards,
Non-Danish Jim, whose given name is James, not to be confused with Danish Jim, whose given name is Jim
🙂
Jezus, Jim, for a moment I thought my German CEO was commenting on a blog 🙂
A little late reply maybe but this is how SNL (Saturday Night Live) tackled this problem:
http://en.wikipedia.org/wiki/Pat_(Saturday_Night_Live)
🙂
Thanks for sharing Dario.
The Wiki article says: The central aspect of sketches featuring Pat was the inability of others to determine the character’s sex.
This reminds me about a metadata pet peeve of mine. I am in no way opposed to “sex” but I don’t like when a column is labeled “Sex”. I think “Gender” is better for data modeling.
Great post on one of the issues that plagues data quality tools at the moment. Each of the subcategories (nickname conversion, gender identification and salutation derviation) are all cases for one of your favorite tools/plugins … reference data! External reference data is the key to overcoming each of these variables!
Thanks for the reminder of how important external reference data is to the data quality / data matching realm!
Thanks William. Yes, you need external reference data that is specific to each culture.
Great points all around. Have you worked with data from some of the Slavic countries, where surnames differ based on gender? I have a female Czech friend whose surname is Keleova; her father and brother use a surname of simply Kele. I believe Russian follows the same format. How does that affect DQ tools?
Crysta, indeed, many challenges arises when dealing with global data. Some of these are explained in the product sheet about the Omikron WorldMatch tool, like in Russian:
Михаил Горбачёв = Michail Gorbatschow
Раиса Горбачёва = Raissa Gorbatschowa
I would like to add to Henrik’s response. There are some tools on the market that are country and culture “aware”. This is a great feature. I would like to see it for other domains than names and addresses too but I guess that will be hard to develop?
Great article and great points raised both; in the content and the comments above.