Combining Data Matching and Multidomain MDM

Data Matching GroupTwo of the most addressed data management topics on this blog is data matching and multidomain Master Data Management (MDM). In addition, I have also founded two LinkedIn Groups for people interested in one of or both topics.

The Data Matching Group has close to 2,000 members. In here we discus nerdy stuff as deduplication, identity resolution, deterministic matching using match codes, algorithms, pattern recognition, fuzzy logic, probabilistic learning, false negatives and false positives.

Check out the LinkedIn Data Matching Group here.

Multidomain MDM GroupThe Multi-Domain MDM Group has close to 2,500 members. In here we exchange knowledge on how to encompass more than a single master data domain in an MDM initiative. In that way the group also covers the evolution of MDM as the discipline – and solutions – has emerged from Customer Data Integration (CDI) and Product Information Management (PIM).

Check out the LinkedIn Multi-Domain MDM Group here.

The result of combining data matching and multi-domain MDM is golden records. The golden records are the foundation of having a 360-degree / single view of parties, locations, products and assets as examined in The Disruptive MDM / PIM / DQM List blog post Golden Records in Multidomain MDM.

Welcome Reifier on the Disruptive MDM / PIM List

The Disruptive MDM / PIM List is list of solutions in the Master Data Management (MDM), Product Information Management (PIM) and Data Quality Management (DQM) space.

The list presents both larger solutions that also is included by the analyst firms in their market reports and smaller solutions you do not hear so much about, but may be exactly the solution that addresses the specific challenges you have.

The latest entry on the list, Reifier, is one of the latter ones.

Matching data records and identifying duplicates in order to achieve a 360-degree view of customers and other master data entities is the most frequently mentioned data quality issue. Reifier is an artificial intelligence (AI) driven solution that tackles that problem.

Read more about Reifier here.

New entry Reifier

Three Not So Easy Steps to a 360-Degree Customer View

Getting a 360-degree view (or single view) of your customers has been a quest in data management as long as I can remember.

This has been the (unfulfilled) promise of CRM applications since they emerged 25 years ago. Data quality tools has been very much about deduplication of customer records. Customer Data Integration (CDI) and the first Master Data Management (MDM) platforms were aimed at that conundrum. Now we see the notion of a Customer Data Platform (CDP) getting traction.

There are three basic steps in getting a 360-degree view of those parties that have a customer role within your organization – and these steps are not at all easy ones:

360 Degree Customer View

  • Step 1 is identifying those customer records that typically are scattered around in the multiple systems that make up your system landscape. You can do that (endlessly) by hand, using the very different deduplication functionality that comes with ERP, CRM and other applications, using a best-of-breed data quality tool or the data matching capabilities built into MDM platforms. Doing this with adequate results takes a lot as pondered in the post Data Matching and Real-World Alignment.
  • Step 2 is finding out which data records and data elements that survives as the single source of truth. This is something a data quality tool can help with but best done within an MDM platform. The three main options for that are examined in the post Three Master Data Survivorship Approaches.
  • Step 3 is gathering all data besides the master data and relate those data to the master data entity that identifies and describes the real-world entity with a customer role. Today we see both CRM solution vendors and MDM solution vendors offering the technology to enable that as told in the post CDP: Is that part of CRM or MDM?

The Trouble with Data Quality Dimensions

Data Quality Dimensions

Data quality dimensions are some of the most used terms when explaining why data quality is important, what data quality issues can be and how you can measure data quality. Ironically, we sometimes use the same data quality dimension term for two different things or use two different data quality dimension terms for the same thing. Some of the troubling terms are:

Validity / Conformity – same same but different

Validity is most often used to describe if data filled in a data field obeys a required format or are among a list of accepted values. Databases are usually well in doing this like ensuring that an entered date has the day-month-year sequence asked for and is a date in the calendar or to cross check data values against another table and see if the value exist there.

The problems arise when data is moved between databases with different rules and when data is captured in textual forms before being loaded into a database.

Conformity is often used to describe if data adheres to a given standard, like an industry or international standard. This standard may due to complexity and other circumstances not or only partly be implemented as database constraints or by other means. Therefore, a given piece of data may seem to be a valid database value but not being in compliance with a given standard.

For example, the code value for a colour being “0,255,0” may be the accepted format and all elements are in the accepted range between 0 and 255 for a RGB colour code. But the standard for a given product colour may only allow the value “Green” and the other common colour names and “0,255,0” will when translated end up as “Lime” or “High green”.

Accuracy / Precision – true, false or not sure

The difference between accuracy and precision is a well-known statistical subject.

In the data quality realm accuracy is most often used to describe if the data value corresponds correctly to a real-world entity. If we for example have a postal address of the person “Robert Smith” being “123 Main Street in Anytown” this data value may be accurate because this person (for the moment) lives at that address.

But if “123 Main Street in Anytown” has 3 different apartments each having its own mailbox, the value does not, for a given purpose, have the required precision.

If we work with geocoordinates we have the same challenge. A given accurate geocode may have the sufficient precision to tell the direction to the nearest supermarket is, but not precise enough to know in which apartment the out-of-milk smart refrigerator is.

Timeliness / Currency – when time matters

Timeliness is most often used to state if a given data value is present when it is needed. For example, you need the postal address of “Robert Smith” when you want to send a paper invoice or when you want to establish his demographic stereotype for a campaign.

Currency is most often used to state if the data value is accurate at a given time – for example if “123 Main Street in Anytown” is the current postal address of “Robert Smith”.

Uniqueness / Duplication – positive or negative

Uniqueness is the positive term where duplication is the negative term for the same issue.

We strive to have uniqueness by avoiding duplicates. In data quality lingo duplicates are two (or more) data values describing the same real-world entity. For example, we may assume that

  • “Robert Smith at 123 Main Street, Suite 2 in Anytown”

is the same person as

  • “Bob Smith at 123 Main Str in Anytown”

Completeness / Existence – to be, or not to be

Completeness is most often used to tell in what degree all required data elements are populated.

Existence can be used to tell if a given dataset has all the needed data elements for a given purpose defined.

So “Bob Smith at 123 Main Str in Anytown” is complete if we need name, street address and city, but only 75 % complete if we need name, street address, city and preferred colour and preferred colour is an existent data element in the dataset.

More on data quality dimensions:

If a country list is that hard, MDM is really hard

A twitter post directing to an article with the title Make the Right Choice Using the Right Criteria: A Checklist for Exploring MDM Solutions and Capabilities made me curious and got my click.

However, before reading too much I was prompted with an inescapable form asking for my details in a master data sharing tone.

Well, then I could as well explore the mandatory country list. No surprise. A master (or reference) data havoc. Two Bosnia (and) Herzegovina entries. Two Brunei entries. Two Brazil entries. Two Burma / Myanmar entries.

Country List Havoc by Stibo Systems

Data Matching and Real-World Alignment

Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct.

The most common scenario for data matching is deduplication of customer data records held across an enterprise. In this case we often see a gap between what we technically try to do and the desired business outcome from deduplication. In my experience, this misalignment has something to do with real-world alignment.

Data Matching and Real World Alignment

What we technically do is basically to find a similarity between data records that typically has been pre-processed with some form of standardization. This is often not enough.

Location Intelligence

Deduplication and other forms of data matching with customer master data revolves around names and addresses.

Standardization and verification of addresses is very common element in data quality / data matching tools. Often such at tool will use a service either from its same brand or a third-party service. Unfortunately, no single service is often enough. This is because:

  • Most services are biased towards a certain geography. They may for example be quite good for addresses in The United States but very poor compared to local services for other geographies. This is especially true for geographies with multiple languages in play as exemplified in the post The Art in Data Matching.
  • There is much more to an address than the postal format. In deduplication it is for example useful to know if the address is a single-family house or a high-rise building, a nursing home, a campus or other building with lots of units.
  • Timeliness of address reference data is underestimated. I recently heard from a leader in the Gartner Quadrant for Data Quality Tools that a quarterly refresh is fine. It is not, as told in the post Location Data Quality for MDM.

Identity Resolution

The overlaps and similarities between data matching and identity resolution was discussed in the post Deduplication vs Identity Resolution.

In summary, the capability to tell if two data records represent the same real-world entity will eventually involve identity resolution. And as this is very poorly supported by data quality tools around, we see that a lot of manual work will be involved if the business processes that relies on the data matching cannot tolerate too may, or in some cases any, false positives – or false negatives.

Hierarchy Management

Even telling that a true positive match is true in all circumstances is hard. The predominant examples of this challenge are:

  • Is a match between what seems to be an individual person and what seems to be the household where the person lives a true match?
  • Is a match between what seems to be a person in a private role and what seems to be the same person in a business role a true match? This is especially tricky with sole proprietors working from home like farmers, dentists, free lance consultants and more.
  • Is a match between two sister companies on the same address a true match? Or two departments within the same company?

We often realize that the answer to the questions are different depending on the business processes where the result of the data matching will be used.

The solution is not simple. The data matching functionality must, if we want automated and broadly usable results, be quite sophisticated in order to take advantage of what is available in the real-world. The data model where we hold the result of the data matching must be quite complex if we want to reflect the real-world.

Avoid Duplicates by Avoiding Peer-to-Peer Integrations

When working in Master Data Management (MDM) programs some of the main pain points always on the list are duplicates. As explained in the post Golden Records in Multi-Domain MDM this may be duplicates in party master data (customer, supplier and other roles) as well as duplicates in product master data, assets, locations and more.

Most of the data quality technology available to solve these problems revolves around identifying duplicates.  This is a very intriguing discipline where I have spent some of my best years. However, this is only a remedy to the symptoms of the problem and not a mean to eliminate the root cause as touched in the post The Good, Better and Best Way of Avoiding Duplicates.

The root causes are plentiful and as all challenges they involve technology, processes and people.

Having an IT landscape with multiple applications where master data are a created, updated and consumed is a basic problem and a remedy to that is the main reason of being for Master Data Management (MDM) solutions. The challenge is to implement MDM technology in a way that the MDM solution will not just become another silo of master data but instead be solution for sharing master data within the enterprise – and ultimately in the digital ecosystem around the enterprise.

blind-spot-take-careThe main enemy from a technology perspective is in my experience peer-to-peer system integration solutions. If you have chosen application X to support a business objective and application Y to support another business objective and you learn that there is an integration solution between X and Y available, this is very bad news. Because short term cost and timing considerations will make that option obvious. But in the long run it will cost you dearly if the master data involved are handled in other applications as well. Because then you will have blind spots all over the place where through duplicates will enter.

The only sustainable solution is to build a master data hub where through master data are integrated and thus shared with all applications inside the enterprise and around the enterprise. This hub must encompass a shared master data model and related metadata.

 

The Cases for Data Matching in Multi-Domain MDM

Data matching has always been a substantial part of the capabilities in data quality technology and have become a common capability in Master Data Management (MDM) solutions.

We use the term data matching when talking about linking entities where we cannot just use exact keys in databases.

The most prominent example around is matching names and addresses related to parties, where these attributes can be spelled differently and formatted using different standards but do refer to the same real-world entity. Most common scenarios are deduplication, where we clean up databases for duplicate customer, vendor and other party role records and reference matching, where we identify and enrich party data records with external directories.

A way to pre-process party data matching is matching the locations (addresses) with external references, which has become more and more available around the world, so you have a standardized address in order to reduce the fuzziness. In some geographies you can even make use of more extended location data, as whether the location is a single-family house, a high-rise building, a nursing home or campus. Geocodes can also be brought into the process.

matching MDMHandling the location as a separate unique entity can also be used in many industries as utility, telco, finance, transit and more.

For product data achieving uniqueness usually is a lesser pain point as told in the post Multi-Domain MDM and Data Quality Dimensions. But for sure requirements for matching products arises from time to time.

In the old days this was quite difficult as you often only had a product description that had to be parsed into discrete elements as examined in the post Matching Light Bulbs.

With the rise of Product Information Management (PIM) we now often do have the product attributes in a granular form. However, using traditional matching technology made for party master data will not do the trick as this is a different and more complex scenario. My thinking is that graph technology will help as touched in the post Three Ways of Finding a Product.

What Happened to CDI?

CDI is a Three Letter Acronym which in the data management world stands for Customer Data Integration.

Today CDI is usually wrapped into Master Data Management (MDM) as examined in the post CDI, PIM, MDM and Beyond. As mentioned in this post, a well-known analyst, Aaron Zornes, runs a business called the MDM Institute, which was originally called the The Customer Data Integration Institute and still has this website: http://www.tcdii.com/.

Many Master Data Management (MDM) vendors today emphasizes on being multidomain, meaning their solutions can manage customer, supplier employee and other party master data as well as product, asset, location and other core business entity types.

However, some vendors still focus on customer master data and the topic of integrating customer data by excelling in the special pain points here, not at least identity resolution and sustainable merge/purge of duplicates. One example is Uniserv Smart Customer MDM.

In my recent little venture called The Disruptive Master Data Management Solution List the aim is to cover all kinds of MDM solutions: Small or big. New (start-up) or old. Multidomain MDM, Customer Data Integration (CDI), Product Information Management (PIM) or even Digital Asset Management (DAM). As a potential buyer, you can browse all these solutions and select your choice of one-stop-shopping candidates or combine best-of-breed solution candidates that matches your requirements in your industry and geography.

First thing that must happen is that vendors register their solutions on the site here.

MDM