Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.
When matching we may:
- Compare the original data rows using fuzzy logic techniques
- Standardize the data rows and then compare using traditional exact logic
As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.
Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:
- In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
- Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
- In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.
In my work with international data I have often seen that determining what standard is used is depended on both:
- The culture of the real world entity that the data represents
- The culture of the person (organisation) that provided the data
So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).
This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.
Comments from the Data Matching LinkedIn group:
Michael Ott says:
I believe the best way is a combination of both. Standardized data always provides higher-quality matches. Plus, you should capture and transform data at the original input so that it doesn’t continue to cause problems later. After the standardization step, you will still need to use logic that compares matches in both the straight and fuzzy categories, because the standardization routines may not contain all variations of a standard.
Sanjib Mallik says:
Let us not forget parsing. The more effective parsing is, the better standardization of data can be achieved.
I have been able to push further the value of parsing as a function by incorporating pattern recognition within parsing. By using the same types of logic that enables fuzzy logic matching, one can better identify name, address and other data elements and thus standardize this data better.
I have achieved higher match rate in the back-end just adding more intelligence to parsing in the front-end.
Jax (Jondarr) Gibb says:
Horses for courses. Standardising ‘known’ things makes perfect sense – street types, state abbreviations, etc. Fuzzy matching on unknown things is the only option – & fuzzy relative to the type of data, not a blanket use of phonix, for example. The fine line comes about when you try to cover more possibilities for ‘known’ things – & let the end user define them – & still allow for a fuzziness inherent in customer data to still be applied.
In some contexts, ‘Ct’ can be ‘Court’ or ‘Circuit’, or the US state of Connecticut (& I’m sure many other things).
Standardise in context.
I say:
Thanks folks for commenting. It seems like we all agree that the trick is to combine standardization and parsing with fuzzy logic, and that it’s even not first standardize and parse and then fuzzy matching, but the best results are made by a full combination.