Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.
When matching we may:
- Compare the original data rows using fuzzy logic techniques
- Standardize the data rows and then compare using traditional exact logic
As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.
Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:
- In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
- Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
- In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.
In my work with international data I have often seen that determining what standard is used is depended on both:
- The culture of the real world entity that the data represents
- The culture of the person (organisation) that provided the data