7 thoughts on “When Bad Data Quality isn’t Bad Data

  1. Axel Troike (@AxelTroike) 24th April 2013 / 12:57

    Henrik,

    Another problem of this kind exists if names have to be transcribed from a different alphabet, e.g. from the Russian (Cyrillic) to the Latin alphabet. As an example, a well-known former Soviet leader may be found (and spelled) as Gorbachyov (transcription according to Wikipedia), Gorbachev (used in English) or Gorbatschow (used in German). As with the pope, there may not be serious problems of identifying famous persons.

    However, in the case of the suspected Boston bomber Tamerlan Tsarnaev (the “agreed” spelling in English), depending of the user’s language environment, we also find his family name spelled as Tsarnajew, Tsarnajev, Tsarnaje, Zarnajev which may (not confirmed, but certainly could) have caused a problem of identification during past investigations of this person and consequently may have affected the risk evaluation.

    Back to daily business: International organizations definitely need to consider the effect of non-unique transcription in their identification / name matching procedures when having clients from geographic areas that use different alphabets.

  2. Tonia Thompson 24th April 2013 / 15:46

    Great post and comment. In the US (and many other areas of the world), we also see this problem with nicknames vs legal names. Bob vs Robert or Kat vs Katherine. Matching algorithms must account not only for alternate spellings in various regions of the world, but also any nickname variations.

    You also bring up a good point regarding the phrasing of the question. We must be careful to phrase survey questions and form field labels carefully so we have a greater chance of capturing the intended data.

  3. Dylan Jones 26th April 2013 / 07:20

    Great post Henrik.

    So is this a case of poor Information Quality or is it simply another dimension of Data Quality e.g. Presentation Quality or Metadata Quality that is lacking?

    Data becomes information when we relate an item of data to other items of data in context. In this case, if we wanted additional names we would relate our raw data to a mapping of related names, thus inferring information.

    Ties in nicely to the other discussion on here recently where I talked about Welsh names. I may call a friend David, his Welsh parents may call him Dafydd but his grandparents may call him Dewi. All names are true facts but if you wanted to dedup his record you would need to have all 3 names mapped to each other to align the context.

    Interesting question, application design and model design have to be two of the biggest failure points for DQ, it’s why we get so much overloading of data, stuffing multiple values in one attribute.

    – Dylan Jones
    http://dataqualitypro.com

  4. Henrik Liliendahl Sørensen 26th April 2013 / 08:19

    Thanks a lot Axel, Tonia and Dylan for adding in.

  5. Vishu Rastogi (@decembers_child) 3rd May 2013 / 18:39

    Interesting post Henrik and very interesting follow up comment from Axel.
    Now the question is although many DQ solutions provide transliteration, do global organization really spend resources to this need. If I may play the devil’s advocate is the % of data with them so significant that they spend such significant amount of resources. I mean how may global corporation (and what would be the volume) really have customers spread across say Russia, Greece and America upon whom they would need to apply such dedup rules.

    Although, having said that, and considering the stakes invoked I feel there is strong case for federal agencies and state agencies across the major geographies to implement such solutions.

    -Vishu

Leave a comment