A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.
One of the trends mentioned is hierarchical data matching.
The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.
One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.
I usually divide a data matching process into three main steps:
- Candidate selection
- Match scoring
- Match destination
(More information on the page: The Art of Data Matching)
Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.
In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.
Right now I’m working on a cloud service called instant Data Quality (iDQ™).
It is basically a very advanced search engine capable of being integrated into business processes in order to get data quality right the first time and at the same time reducing the time needed for looking up and entering contact data.
With iDQ™ you are able to look up what is known about a given address, company and individual person in external sources (I call these big reference data) and what is already known in internal master data.
From a data quality point of view this mashup helps with solving some of the core data quality issues almost every organization has to deal with, being:
- Avoiding duplicates
- Getting data as complete as possible
- Ensuring maximal accuracy
The mashup is also a very good foundation for taking real-time decisions about master data survivorship.
The iDQ™ service helps with getting data quality right the first time. However, you also need Ongoing Data Maintenance in order to keep data at a high quality. Therefore iDQ™ is build for trigging into subscription services for external reference data.
At iDQ we are looking for partners world-wide who see the benefit of having such a cloud based master data service connected to providing business-to-business (B2B) and/or business-to-consumer (B2C) data services, data quality services and master data management solutions.
Here’s the contact data: http://instantdq.com/contact/
Tonight the European Song Contest finale will be watched by over 100 million people, despite the fact that most people agree about that the songs aren’t that good.
The winner will be selected by summing up an equal number of votes from each country. Usually there are big differences in how countries votes. A trend is that some neighboring groups of countries like to vote for each other. Such groups include a “Balkan Block” and a “Viking Empire”.
It’s a bit like survivorship when merging matched data rows into a golden record in an enterprise master data hub. Maybe the winning data isn’t that good and several departments probably don’t like it at all.
So I see no reason why Denmark shouldn’t win tonight.
When working with data quality and master data management at the same time you are constantly met with the challenge that data quality is most often defined as data being fit for the purpose of use, but master data management is about using the same data for multiple purposes at the same time.
Finding the right solution to such a challenge within an organization isn’t easy, because it despite all good intentions is difficult to find someone in the business with an overall answer to that kind of problems as explained in the blog post by David Loshin called Communications Gap? Or is there a Gap between Chasms?
An often used principle for overcoming these issues may be seen as “survival of the fittest”. You negotiate some survivorship rules between “competing” data providers and consumers and then the data being the fittest measured by these rules wins. All other data gets the KISS of death. Most such survivorship rules are indeed simple often based on a single dimension as timeliness, completeness or provenance.
Recently the phrase “survival of the fittest” in evolution theory has been suggested to be changed to “survival of the fit enough” because it seems that many times specimens haven’t competed but instead found a way into empty alternate spaces.
It seems that master data management and related data quality is going that way too. Data that is fit enough will survive in the master data hub in alternate spaces where the single source of truth exists in perfect symbioses with multiple realities.
I have just read two blog posts about the dangers of deleting data in the good cause of making data quality improvements.
In his post Why Merging is Evil Scott Schumacher of IBM Initiate describes the horrors of using survivorship rules for merging two (or more) database rows recognized to reflect the same real world entity.
Jim Harris describes the insane practices of getting rid of unwanted data in the post A Confederacy Of Data Defects.
On a personal note I have just had a related experience from outside the data management world. We have just relocated from a fairly large house to a modest sized apartment. Due to the downsizing and the good opportunity given by the migration we wasted a lot of stuff in the process. Now we are in the process of buying replacements for these things we shouldn’t have thrown away.
As Scott describes in his post about merging, there is an alternate approach to merging being linking – with some computation inefficiency attached. Also in the cases described by Jim we often don’t dare to delete at the root, so instead we keep the original values and makes a new cleansed copy without the supposed unwanted data for the purpose at hand.
In my relocation project we could have rented a self-storage unit for all the supposed not so needed stuff as well.
It’s a balance. As in all things data quality there isn’t a single right or wrong answer to what to do. And there will always be regrets. Now, where’s the undo button?
A frequent challenge when building a customer master data hub is dealing with incoming records from operational systems where the data in one record belongs to several real world entities.
One situation may be that that a name contains two (or more) real world names. This situation was discussed in the post Splitting names.
Another situation may be that:
Fortunately most cases only have 2 different real world representations like X and Y or Y and Z.
An example I have encountered often is when a company delivers a service through another organization. Then you may have:
- The name of the 3rd party organization in the name column(s)
- The address of the (private) end user in the address columns
Or as I remember seen once:
- The name of the (private) end user in the name column(s)
- The address of the (private) end user in the address columns
- The company national identification number of the 3rd party organization in the national ID column
Of course the root cause solution to this will be a better (and perhaps more complex) way of gathering master data in the operational systems. But most companies have old and not so easy changeable systems running core business activities. Swapping to new systems in a rush isn’t something just done either. Also data gathering may take place outside your company making the data governance much more political.
A solution downstream at the data matching gates of the master data hub may be to facilitate complex hierarchy building.
Oftentimes the solution will be that the single customer view in the master data hub will be challenged from the start as the data in some perception is fit for the intended purpose of use.
One of the most frequent assignments I have had within data matching is merging customer databases after two companies have been merged.
This is one of the occasions where it doesn’t help saying the usual data quality mantras like:
- Prevention and root cause analysis is a better option
- Change management is a critical factor in ensuring long-term data quality success
- Tools are not important
It is often essential for the new merged company to have a 360 degree view of business partners as soon as possible in order to maximize synergies from the merger. If the volumes are above just a few thousand entities it is not possible to obtain that using human resources alone. Automated matching is the only realistic option.
The types of entities to be matched may be:
- Private customers – individuals and households (B2C)
- Business customers (B2B) on account level, enterprises, legal entities and branches
- Contacts for these accounts
I have developed a slightly extended version of this typification here.
One of the most common challenges in merging customer databases is that hierarchy management may have been done very different in the past within the merging bodies. When aligning different perceptions I have found that a real world approach often fulfils the different reasoning.
The fuzziness needed for the matching is basically dependent on the common unique keys available in the two databases. These are keys as citizen ID’s (whatever labeled around the world) and public company ID’s (the same applies). Matching both databases with an external source (per entity type) is an option. “Duns Numbering” is probably the most common known type of such an approach. Maintaining a solution for assigning Duns Numbers to customer files from the D&B WorldBase is by the way one of my other assignments as described here.
The automated matching process may be divided into these three steps:
During my many years of practice in doing this I have found that the result from the automated process may vary considerable in quality and speed depending on the tools used.