One of the ways to ensure data quality for customer – or rather party – master data when operating in a business-to-business (B2B) environment, is to on-board new entries using an external defined business entity identifier.
By doing that, you tackle some of the most challenging data quality dimensions as:
Accuracy, by having names, addresses and other information defaulted from a business directory and thus avoiding those spelling mistakes that usually are all over in party master data.
Conformity, by inheriting additional data as line-of-business codes and descriptions from a business directory.
Having an external business identifier stored with your party master data helps a lot with maintaining data quality as pondered in the post Ongoing Data Maintenance.
When selecting an identifier there are different options as national IDs, LEI, DUNS Number and others as explained in the post Business Entity Identifiers.
At the Product Data Lake service I am working on right now, we have decided to use an external business identifier from day one. I know this may be something a typical start-up will consider much later if and when the party master data population has grown. But, besides being optimistic about our service, I think it will be a win not to have to fight data quality issues later with guarantied increased costs.
For the identifier to use we have chosen the DUNS Number from Dun & Bradstreet. The reason is that this currently is the only worldwide covered business identifier. Also, Dun & Bradstreet offers some additional data that fits our business model. This includes consistent line-of-business information and worldwide company family trees.
Indeed, upstream prevention of bad data to enter our databases is sure the better way compared to downstream data cleaning. Also real time enrichment is better than enriching long time after data has been put to work.
That said, there are situations where data cleaning has to be done. These reasons were examined in the post Top 5 Reasons for Downstream Cleansing. But I can’t think of many situations, where a downstream cleaning and/or enrichment operation will be of much worth if it isn’t followed up by an approach to getting it first time right in the future.
If we go a level deeper into data quality challenges, there will be some different data quality dimensions with different importance to various data domains as explored in the post Multi-Domain MDM and Data Quality Dimensions.
With customer master data we most often have issues with uniqueness and location precision. While I have spend many happy years with data cleansing, data enrichment and data matching tools, I have during the last couple of years been focusing on a tool for getting that first time right.
Product master data are often marred by issues with completeness and (location) conformity. The situation here is that tools and platforms for mastering product data are focussed on what goes on inside a given organization and not so much about what goes on between trading partners. Standardization seems to be the only hope. But that path is too long to wait for and may in some way be contradicting the end purpose as discussed under the post Image Coming Soon.
So in order to have a first time right solution for product master data sharing, I have embarked on a journey with a service called the Product Data Lake. If you want to join, you are most welcome.
PS: The product data lake also has the capability of catching up with the sins of the past.
If we look at customer, or rather party, Master Data Management (MDM) it is much about real world alignment. In party master data management you describe entities as persons and legal entities in the real world and you should have descriptions that reflect the current state (and sometimes historical states) of these entities. Some reflections will be The Relocation Event. And as even evergreen trees go away, and “My Way” hopefully will go away someday, you also must be able to perform Undertaking in MDM.
One of the cleverest things said ever is in my eyes Parkinson ’s Law that states: “Work expands so as to fill the time available for its completion”.
There is even a variant for data that says: “Data expands to fill the space available for storage”. This is why we have big data today.
Another similar law that seems to be true is Murphy’s Law saying: “Anything that can go wrong will go wrong”. The sharper version of that is Finagle’s Law that warns: “Anything that can go wrong, will—at the worst possible moment”.
When I started working with data quality the most common trigger for data quality improvement initiatives were after a perfect storm encompassing these laws like saying: “The quality of data will decrease until everything goes wrong at the worst possible moment”.
Fortunately more and more organizations are becoming proactive about data quality these days. In doing that I recommend reversing Finagle, Murphy and Parkinson by doing this:
Prevent data quality at the best possible moment being at data capture as told in the post instant Data Quality
Maintain optimal data quality whenever you can as examined in the post Last Time Right
So, in the fight against bad data quality, a good place to start will be helping data entry personnel doing it right the first time.
One way of achieving that is to cut down on the data being entered. This may be done by picking the data from sources already available out there instead of retyping things and making those annoying flaws.
If we look at the two most prominent master data domains, some ideas will be:
In the product domain I have seen my share of product descriptions and specifications being reentered when flowing down in the supply chain of manufacturers, distributors, re-sellers, retailers and end users. Better batch interfaces with data quality controls is one way of coping with that. Social collaboration is another one as told in the post Social PIM.
In the customer, or rather party, domain we have seen an uptake of using address validation. That is good. However, it is not good enough as discussed in the post Beyond Address Validation.
Checking if an eMail address will bounce is essential for executing and measuring campaigns, news letter operations and other activities based on sending eMails as explained here on the site Don’t Bounce by BriteVerify.
A good principle within data quality prevention and Master Data Management (MDM) is the first time right approach. There is a 1-10-100 rule saying:
“One dollar spent on prevention will save 10 dollars on correction and 100 dollar on failure costs”.
(Replace dollars with your favorite currency: Euros, pounds, rubles, rupees, whatever.)
This also applies to capturing an eMail address of a (prospect) customer and other business partners. Many business processes today requires communication through eMails in order to save costs and speed up processes. If you register an invalid eMail address or allow self registration of an invalid eMail address you have got yourself some costly scrap and rework or maybe lost an opportunity.
The mentioned issue is about the use of quotes in social data: A famous person apparently said something apparently clever and the one who makes an update with the quote gets an unusual large amount of likes, retweets, +1s and other forms of recognition.
But many quotes weren’t actually said by that famous person. Maybe it was said by someone else and in many cases there is no evidence that the famous person said it. Some quotes, like the Einstein quote in the Crap post, actually contradicts what they apparently also has said.
As I have worked a lot with data entry functionality checking for data quality around if a certain address actually exist, if a typed in phone number is valid or an eMail address will bounce I think it’s time to make a quote checker to be plugged in on LinkedIn, Twitter, Facebook, Google Plus and other social networks.
So anyone else out there who wants to join the project – or has it already been said by someone else?
Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.
A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:
Downstream data matching
Real time duplicate check
Search and mash-up of internal and external data
Downstream Data Matching
The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.
Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.
Real Time Duplicate Check
The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.
Search and Mash-Up of Internal and External Data
The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.
The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.
The advantages are:
If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.
A variant of the saying “Know Your Customer” for a football club will be “Know Your Fan” and indeed fans are customers when they buy tickets. If they can.
FC Copenhagen cruised into stormy waters when they apparently cancelled all purchases for the upcoming Champions League (European soccer club paramount tournament) clashes against Real Madrid, Juventus and Galatasaray if the purchasers didn’t have a Danish sounding name. The reason was to prevent mixing fans of the different clubs, but surely this poorly thought screening method wasn’t received well among the FC Copenhagen fans not called Jensen, Nielsen or Sørensen.
Most data matching activities going on are related to matching customer, other rather party, master data.
In today’s business world we see data matching related to party master data in those three different channels types:
Offline is the good old channel type where we have the mother of all business cases for data matching being avoiding unnecessary costs by sending the same material with the postman twice (or more) to the same recipient.
Online has been around for some time. While the cost of sending the same digital message to the same recipient may not be a big problem, there are still some other factors to be considered, like:
Duplicate digital messages to the same recipient looks like spam (even if the recipient provided different eMail addresses him/her self).
You can’t measure a true response rate
Social is the new channel type for data matching. Most business cases for data matching related to social network profiles are probably based on multi-channel issues.
The concept of having a single customer view, or rather single party view, involves matching identities over offline, online and social channels, and typical elements used for data matching are not entirely the same for those channels as seen in the figure to the right.
Most data matching procedures are in my experience quite simple with only a few data elements and no history track taking into considering. However we do see more sophisticated data matching environments often referred to as identity resolution, where we have historical data, more data elements and even unstructured data taking into consideration.
When doing multi-channel data matching you can’t avoid going from the popular simple data matching environments to more identity resolution like environments.
Some advices for getting it right without too much complication are:
Emphasize on data capturing by getting it right the first time. It helps a lot.
Get your data models right. Here reflecting the real world helps a lot.
Don’t reinvent the wheel. There are services for this out here. They help a lot.