Hierarchical Data Matching

A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.

HierarchyOne of the trends mentioned is hierarchical data matching.

The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

I usually divide a data matching process into three main steps:

  • Candidate selection
  • Match scoring
  • Match destination

(More information on the page: The Art of Data Matching)

Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.

In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.

Bookmark and Share

Mashing Up Big Reference Data and Internal Master Data

Right now I’m working on a cloud service called instant Data Quality (iDQ™).

It is basically a very advanced search engine capable of being integrated into business processes in order to get data quality right the first time and at the same time reducing the time needed for looking up and entering contact data.

With iDQ™ you are able to look up what is known about a given address, company and individual person in external sources (I call these big reference data) and what is already known in internal master data.

From a data quality point of view this mashup helps with solving some of the core data quality issues almost every organization has to deal with, being:

  • Avoiding duplicates
  • Getting data as complete as possible
  • Ensuring maximal accuracy

The mashup is also a very good foundation for taking real-time decisions about master data survivorship.

The iDQ™ service helps with getting data quality right the first time. However, you also need Ongoing Data Maintenance in order to keep data at a high quality. Therefore iDQ™ is build for trigging into subscription services for external reference data.

At iDQ we are looking for partners world-wide who see the benefit of having such a cloud based master data service connected to providing business-to-business (B2B) and/or business-to-consumer (B2C) data services, data quality services and master data management solutions.

Here’s the contact data: http://instantdq.com/contact/

Bookmark and Share

We All Hate To Watch It

Tonight the European Song Contest finale will be watched by over 100 million people, despite the fact that most people agree about that the songs aren’t that good.

The winner will be selected by summing up an equal number of votes from each country. Usually there are big differences in how countries votes. A trend is that some neighboring groups of countries like to vote for each other. Such groups include a “Balkan Block” and a “Viking Empire”.

It’s a bit like survivorship when merging matched data rows into a golden record in an enterprise master data hub. Maybe the winning data isn’t that good and several departments probably don’t like it at all.

So I see no reason why Denmark shouldn’t win tonight.

Bookmark and Share

Survival of the Fit Enough

When working with data quality and master data management at the same time you are constantly met with the challenge that data quality is most often defined as data being fit for the purpose of use, but master data management is about using the same data for multiple purposes at the same time.

Finding the right solution to such a challenge within an organization isn’t easy, because it despite all good intentions is difficult to find someone in the business with an overall answer to that kind of problems as explained in the blog post by David Loshin called Communications Gap? Or is there a Gap between Chasms?

An often used principle for overcoming these issues may (based on Darwin) be seen as “survival of the fittest”. You negotiate some survivorship rules between “competing” data providers and consumers and then the data being the fittest measured by these rules wins. All other data gets the KISS of death. Most such survivorship rules are indeed simple often based on a single dimension as timeliness, completeness or provenance.

Recently the phrase “survival of the fittest” in evolution theory has been suggested to be changed to “survival of the fit enough” because it seems that many times specimens haven’t competed but instead found a way into empty alternate spaces.

It seems that master data management and related data quality is going that way too. Data that is fit enough will survive in the master data hub in alternate spaces where the single source of truth exists in perfect symbioses with multiple realities.

Bookmark and Share

Now, where’s the undo button?

I have just read two blog posts about the dangers of deleting data in the good cause of making data quality improvements.

In his post Why Merging is Evil Scott Schumacher of IBM Initiate describes the horrors of using survivorship rules for merging two (or more) database rows recognized to reflect the same real world entity.

Jim Harris describes the insane practices of getting rid of unwanted data in the post A Confederacy Of Data Defects.

On a personal note I have just had a related experience from outside the data management world. We have just relocated from a fairly large house to a modest sized apartment. Due to the downsizing and the good opportunity given by the migration we wasted a lot of stuff in the process. Now we are in the process of buying replacements for these things we shouldn’t have thrown away.

As Scott describes in his post about merging, there is an alternate approach to merging being linking – with some computation inefficiency attached. Also in the cases described by Jim we often don’t dare to delete at the root, so instead we keep the original values and makes a new cleansed copy without the supposed unwanted data for the purpose at hand.

In my relocation project we could have rented a self-storage unit for all the supposed not so needed stuff as well.

It’s a balance. As in all things data quality there isn’t a single right or wrong answer to what to do. And there will always be regrets. Now, where’s the undo button?

Bookmark and Share

Mixed Identities

A frequent challenge when building a customer master data hub is dealing with incoming records from operational systems where the data in one record belongs to several real world entities.

One situation may be that that a name contains two (or more) real world names. This situation was discussed in the post Splitting names.

Another situation may be that:

  • The name belongs to real world entity X
  • The address belongs to real world entity Y
  • The national identification number belongs to real world entity Z

Fortunately most cases only have 2 different real world representations like X and Y or Y and Z.

An example I have encountered often is when a company delivers a service through another organization. Then you may have:

  • The name of the 3rd party organization in the name column(s)
  • The address of the (private) end user in the address columns

Or as I remember seen once:

  • The name of the (private) end user in the name column(s)
  • The address of the (private) end user in the address columns
  • The company national identification number of the 3rd party organization in the national ID column

Of course the root cause solution to this will be a better (and perhaps more complex) way of gathering master data in the operational systems. But most companies have old and not so easy changeable systems running core business activities. Swapping to new systems in a rush isn’t something just done either. Also data gathering may take place outside your company making the data governance much more political.

A solution downstream at the data matching gates of the master data hub may be to facilitate complex hierarchy building.

Oftentimes the solution will be that the single customer view in the master data hub will be challenged from the start as the data in some perception is fit for the intended purpose of use.

Bookmark and Share

Merging Customer Master Data

One of the most frequent assignments I have had within data matching is merging customer databases after two companies have been merged.

This is one of the occasions where it doesn’t help saying the usual data quality mantras like:

  • Prevention and root cause analysis is a better option
  • Change management is a critical factor in ensuring long-term data quality success
  • Tools are not important

It is often essential for the new merged company to have a 360 degree view of business partners as soon as possible in order to maximize synergies from the merger. If the volumes are above just a few thousand entities it is not possible to obtain that using human resources alone. Automated matching is the only realistic option.

The types of entities to be matched may be:

  • Private customers – individuals and households (B2C)
  • Business customers (B2B) on account level, enterprises, legal entities and branches
  • Contacts for these accounts

I have developed a slightly extended version of this typification here.

One of the most common challenges in merging customer databases is that hierarchy management may have been done very different in the past within the merging bodies. When aligning different perceptions I have found that a real world approach often fulfils the different reasoning.

The fuzziness needed for the matching is basically dependent on the common unique keys available in the two databases. These are keys as citizen ID’s (whatever labeled around the world) and public company ID’s (the same applies). Matching both databases with an external source (per entity type) is an option. “Duns Numbering” is probably the most common known type of such an approach. Maintaining a solution for assigning Duns Numbers to customer files from the D&B WorldBase is by the way one of my other assignments as described here.

The automated matching process may be divided into these three steps:

During my many years of practice in doing this I have found that the result from the automated process may vary considerable in quality and speed depending on the tools used.

Bookmark and Share

When computer says maybe

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches.  Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

  • Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
  • Check-out and check-in for multiuser environments
  • Presenting a ranked range of computer selected candidates
  • Color coding elements in matched candidates – like:
    • green for (near) exact name,
    • blue for a close name and
    • red for a far from similar name
  • Possibility for marking:
    • as a manual positive match,
    • as a manual negative match (with reason) or
    • as questionable for later or supervisor inspection (with comments)
  • Entering a match found by other methods
  • Removing one or several members from a duplicate group
  • Splitting a duplicate group into two groups
  • Selecting survivorship
  • Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share

Master Data Survivorship

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

mergeThis spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

  • Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
  • Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
  • External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
  • Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

  • Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

  • Mrs Margaret Smith, 1 Main Str, Anytown
  • Peggy Smith, 1 Main Street, Anytown
  • Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Bookmark and Share