The British newspaper The Guardian has a feature on their website where you can get data about the Olympians. Link here: London 2012 Olympic athletes: the full list.
Browsing the list is a good reminder of the world-wide diversity we have with person names.
The names are here formatted with the surname(s) followed by the given name(s). The surname is in upper case.
The sequence of names is for the Chinese and other East Asian Olympians like they are used to opposite to other Olympians from places where we have the first name being the given name and last name being our surname.
Having the surname in upper case also shows where Olympians have two surnames as it is custom in Spanish cultures.
And oh yes. The South African guy has JIM as his surname.
Finally from this screen shot there is a good question. Is JIANG Wenwen superb at both synchronized swimming and track cycling – or is it two different Olympians with the same name. Some names are very common in China. A little goggling tells me it is two different persons. The synchronized swimmer is more related to her twin sister and swimming partner JIANG Tingting.
Let’s check if there is more than one “John Smith”.
But it could be fun if “Kim Smith” and “Kimberley Smith” came from the same country.
Many Olympians actually don’t have the names reflected in this sheet as many have names in a different alphabet or script system.
The Danish cycling rider “SORENSEN Nicki” actually share my last name, as we know him as “Nicki Sørensen”. The Serbs, Ukrainians and Russian Olympians have their original name in the Cyrillic alphabet, but they have been transliterated to the English alphabet and Olympians from countries with other script systems than an alphabet have had their names gone through a transcription to the (English) alphabet.
So, is the list bad data quality?
Excellent example of human and data diversity.
The question of whether is it good or bad data quality would be better phrased as, “What is the data meant to represent and how well does it represent it?”
Contrary to what many people believe, Quality – even data Quality – is never subjective. It must be measured against the criteria that were defined for its creation. If it meets them it is good Quality if it does not is is bad.
When considering Data Quality people forget that data is always created or transformed by a Function. The criteria for its creation and transformation are part that Function. It is against these criteria that Data Quality must be measured.
An Olympian does not have to win gold to to achieve a Quality event. If their defined criteria for taking part were “to compete at an international level and take 2 seconds off my a personal best” and they to that, then it has been a Quality event for them.
Looking forward to more of your great international observations.
Thanks for adding in John.
By the way: Working with international names (and other data with diversity implications) is a big subject in the new eLearning course: Data Parsing, Matching and De-duplication.
Hi Henrik and John,
This is an interesting list and I would agree with John that one definition of quality is the quality the source believes it needs to achieve. But relative to an enterprise where systems may capture names in native languages in some cases, or split name into it’s component parts like Given Name and Family Name, then we’d assess quality as it pertains to the data’s suitability for a target environment.
In this case, I’d say the data would not be “good quality”, for:
1. representing the TRUE external fact, the name of the person as they would write/recognize it themselves. The Romanization of a name is a transliteration of something that exists in another character script when written by the person it describes.
2. a global enterprise that would want to understand their customers as they understand themselves…say, for purposes of building a privacy profile, marketing program, shipping a product, etc.
In my experience, you’d even want to capture the language/character script of the Name data you’re representing so that it’s explicitly stated. Then, given the systems/teams that want to use the data, they can specify if the language/character script for Russia being Russian/Roman is acceptable, rather than Russian/Cyrillic. Systems and processes expecting Russian/Cyrillic may want to filter out Russian/Roman data from their process, rather than “break” or impact a customer experience with improperly formatted data.
So, the discussion about quality has to take into account more than just the model for capturing the data, but also the purpose and usefulness of the capturing exercise for all downstream functions supported by that process.
Maybe I’m reading too much into this, but this is a common problem in the industry, especially with vended and channel data.
Jeff, thanks for adding to the discussion. I agree with your views.
Often we have to fulfill multiple purposes of use which for example will force us to have (at least) two versions of a name:
• A global uniform standardized way as the example with the Olympians
• A local form in the local sequence and alphabet/script system for personalization reasons