One of the ways to ensure data quality for customer – or rather party – master data when operating in a business-to-business (B2B) environment, is to on-board new entries using an external defined business entity identifier.
By doing that, you tackle some of the most challenging data quality dimensions as:
- Uniqueness, by checking if a business with that identifier already exist in your internal master data. This approach is superior to using data matching as explained in the post The Good, Better and Best Way of Avoiding Duplicates.
- Accuracy, by having names, addresses and other information defaulted from a business directory and thus avoiding those spelling mistakes that usually are all over in party master data.
- Conformity, by inheriting additional data as line-of-business codes and descriptions from a business directory.
Having an external business identifier stored with your party master data helps a lot with maintaining data quality as pondered in the post Ongoing Data Maintenance.
When selecting an identifier there are different options as national IDs, LEI, DUNS Number and others as explained in the post Business Entity Identifiers.
At the Product Data Lake service I am working on right now, we have decided to use an external business identifier from day one. I know this may be something a typical start-up will consider much later if and when the party master data population has grown. But, besides being optimistic about our service, I think it will be a win not to have to fight data quality issues later with guarantied increased costs.
For the identifier to use we have chosen the DUNS Number from Dun & Bradstreet. The reason is that this currently is the only worldwide covered business identifier. Also, Dun & Bradstreet offers some additional data that fits our business model. This includes consistent line-of-business information and worldwide company family trees.
The other day Joy Medved aka @ParaDataGeek made this tweet:
Indeed, upstream prevention of bad data to enter our databases is sure the better way compared to downstream data cleaning. Also real time enrichment is better than enriching long time after data has been put to work.
That said, there are situations where data cleaning has to be done. These reasons were examined in the post Top 5 Reasons for Downstream Cleansing. But I can’t think of many situations, where a downstream cleaning and/or enrichment operation will be of much worth if it isn’t followed up by an approach to getting it first time right in the future.
If we go a level deeper into data quality challenges, there will be some different data quality dimensions with different importance to various data domains as explored in the post Multi-Domain MDM and Data Quality Dimensions.
With customer master data we most often have issues with uniqueness and location precision. While I have spend many happy years with data cleansing, data enrichment and data matching tools, I have during the last couple of years been focusing on a tool for getting that first time right.
Product master data are often marred by issues with completeness and (location) conformity. The situation here is that tools and platforms for mastering product data are focussed on what goes on inside a given organization and not so much about what goes on between trading partners. Standardization seems to be the only hope. But that path is too long to wait for and may in some way be contradicting the end purpose as discussed under the post Image Coming Soon.
So in order to have a first time right solution for product master data sharing, I have embarked on a journey with a service called the Product Data Lake. If you want to join, you are most welcome.
PS: The product data lake also has the capability of catching up with the sins of the past.
The term evergreen is known from botany as plants staying green all year and from music as songs not just being a hit for a few months but capable of generating royalties for years and years.
Data should also stay evergreen. I am a believer in the “first time right” principle as explained in the post instant Single Customer View. However, you must also keep your data quality fresh as examined in the post Ongoing Data Maintenance.
If we look at customer, or rather party, Master Data Management (MDM) it is much about real world alignment. In party master data management you describe entities as persons and legal entities in the real world and you should have descriptions that reflect the current state (and sometimes historical states) of these entities. Some reflections will be The Relocation Event. And as even evergreen trees go away, and “My Way” hopefully will go away someday, you also must be able to perform Undertaking in MDM.
With product MDM it is much about data being fit for multiple future purposes of use as reported in the post Customer Friendly Product Master Data.
One of the cleverest things said ever is in my eyes Parkinson ’s Law that states: “Work expands so as to fill the time available for its completion”.
There is even a variant for data that says: “Data expands to fill the space available for storage”. This is why we have big data today.
Another similar law that seems to be true is Murphy’s Law saying: “Anything that can go wrong will go wrong”. The sharper version of that is Finagle’s Law that warns: “Anything that can go wrong, will—at the worst possible moment”.
When I started working with data quality the most common trigger for data quality improvement initiatives were after a perfect storm encompassing these laws like saying: “The quality of data will decrease until everything goes wrong at the worst possible moment”.
Fortunately more and more organizations are becoming proactive about data quality these days. In doing that I recommend reversing Finagle, Murphy and Parkinson by doing this:
A recent infographic prepared by Trillium Software highlights a fact about data quality I personally have been preaching about a lot:
This number is (roughly) sourced from a study by Wayne W. Eckerson of The Data Warehouse Institute made in 2002:
So, in the fight against bad data quality, a good place to start will be helping data entry personnel doing it right the first time.
One way of achieving that is to cut down on the data being entered. This may be done by picking the data from sources already available out there instead of retyping things and making those annoying flaws.
If we look at the two most prominent master data domains, some ideas will be:
- In the product domain I have seen my share of product descriptions and specifications being reentered when flowing down in the supply chain of manufacturers, distributors, re-sellers, retailers and end users. Better batch interfaces with data quality controls is one way of coping with that. Social collaboration is another one as told in the post Social PIM.
- In the customer, or rather party, domain we have seen an uptake of using address validation. That is good. However, it is not good enough as discussed in the post Beyond Address Validation.
Checking if an eMail address will bounce is essential for executing and measuring campaigns, news letter operations and other activities based on sending eMails as explained here on the site Don’t Bounce by BriteVerify.
A good principle within data quality prevention and Master Data Management (MDM) is the first time right approach. There is a 1-10-100 rule saying:
“One dollar spent on prevention will save 10 dollars on correction and 100 dollar on failure costs”.
(Replace dollars with your favorite currency: Euros, pounds, rubles, rupees, whatever.)
This also applies to capturing an eMail address of a (prospect) customer and other business partners. Many business processes today requires communication through eMails in order to save costs and speed up processes. If you register an invalid eMail address or allow self registration of an invalid eMail address you have got yourself some costly scrap and rework or maybe lost an opportunity.
As a natural consequence the instant Data Quality MDM Edition besides ensuring right names and correct postal addresses also checks for valid eMail addresses.
As reported in the post Crap, Damned Crap, and Big Data there are data quality issues with big data.
The mentioned issue is about the use of quotes in social data: A famous person apparently said something apparently clever and the one who makes an update with the quote gets an unusual large amount of likes, retweets, +1s and other forms of recognition.
But many quotes weren’t actually said by that famous person. Maybe it was said by someone else and in many cases there is no evidence that the famous person said it. Some quotes, like the Einstein quote in the Crap post, actually contradicts what they apparently also has said.
As I have worked a lot with data entry functionality checking for data quality around if a certain address actually exist, if a typed in phone number is valid or an eMail address will bounce I think it’s time to make a quote checker to be plugged in on LinkedIn, Twitter, Facebook, Google Plus and other social networks.
So anyone else out there who wants to join the project – or has it already been said by someone else?