Today is the first day in the new year. The year of the rooster according to the Lunar Calendar observed in East Asia. One of the characteristics of the year of the rooster is that in this year, people will tend to complicate things.
People usually likes to keep things simple. The KISS principle – Keep It Simple, Stupid – has many fans. But not me. Not that I do not like to keep things simple. I do. But only as simple as it should be as Einstein probably said. Sometimes KISS is the shortcut to getting it all wrong.
When working with data quality I have come across the three below examples of striking the right balance in making things a bit complicated and not too simple:
Deduplication
One of the most frequent data quality issues around is duplicates in party master data. Customer, supplier, patient, citizen, member and many other roles of legal entities and natural persons, where the real world entity are described more than once with different values in our databases.
In solving this challenge, we can use methods as match codes and edit distance to detect duplicates. However, these methods, often called deterministic, are far too simple to really automate the remedy. We can also use advanced probabilistic methods. These methods are better, but have the downside that the matching done is hard to explain, repeat and reuse in other contexts.
My best experience is to use something in between these approaches. Not too simple and not too overcomplicated.
Address verification
You can make a good algorithm to perform verification of postal and visit addresses in a database for addresses coming from one country. However, if you try the same algorithm on addresses from another country, it often fails miserably.
Making an algorithm for addresses from all over the world will be very complicated. I have not seen one yet, that works.
My best experience is to accept the complication of having almost as many algorithms as there are countries on this planet.
Product classification
Classifications of products controls a lot of the data quality dimensions related to product master data. The most prominent example is completeness of product information. Whether you have complete product information is dependent on the classification of the product. Some attributes will be mandatory for one product but make no sense at all to another product by a different classification.
If your product classification is too simple, your completeness measurement will not be realistic. A too granular or other way complicated classification system is very hard to maintain and will probably seem as an overkill for many purposes of product master data management.
My best experience is that you have to maintain several classification systems and have a linking between them, both inside your organization and between your trading partners.
Happy New Lunar Year
Another deduplication method I’m starting to see is social/third party. Part of this is using the social handle as an identifier, very similar to the deterministic methods but with additional information. But in the right circumstances (health care comes immediately to mind), data can be sourced from a third party to determine if Dr. X is the same as Dr. X’. Additionally, you can have your users vote on the duplicate outcome. Do the sales, marketing, billing and maintenance people in your organization think this is a duplicate? Do they think it is a duplicate, at least for their own purpose? Obviously, no one has the time to do this in a traditional B2C scenario. But in the right circumstances…