Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.
Therefore a data quality blog post inspired by Hamlet seems timely.
Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.
Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.
Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.
If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.
So, there is a possible track from Hamlet to Olaf.
Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.
I have met that challenge in data matching often. An example will be if we have the following names living on the same address:
- Pegy Smith
- Peggy Smith
- Margaret Smith
A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.
An edit distance similarity will find that Peggy and Pegy are duplicates,
A combined similarity algorithm will find that all three names belong to a single duplicate group.