Data Matching Efficiency

Data Matching is the discipline within data quality management where you deal with the probably most frequent data quality issue that you meet in almost every organization, which is duplicates in master data. This is duplicates in customer master data, duplicates in supplier master data, duplicates in combined / other business partner master data, duplicates in product master data and duplicates in other master data repositories.

A duplicate (or duplicate group) is where two (or more) records in a system or across multiple systems represent the same real-world entity.

Typically, you can use a tool to identify these duplicates. It can be as inexpensive as using Excel, it can be a module in a CRM or other application, it can be a capability in a Master Data Management (MDM) platform, or it can be a dedicated Data Quality Management (DQM) solution.

Over the years there have been developed numerous tools and embedded capabilities to tackle the data matching challenge. Some solutions focus on party (customer/supplier) master data and some solutions focus on product master data. Within party master data many solutions focus on person master data. Many solutions are optimized for a given geography or a few major geographies.

In my experience you can classify available tools / capabilities into the below 5 levels of efficiency:

The efficiency percentage here is an empirical measure of the percentage of actual duplicates the solution can identify automatically.

In more detail, the levels are:

1: Simple deterministic

Here you compare exact values between two duplicate candidate records or use simple transformed values as upper-case conversion or simple phonetic codes as for example soundex.

Don’t expect to catch every duplicate using this approach. If you have good, standardized master data 50 % is achievable. However, with a normal cleanliness, it will be lower.

Surprisingly many organizations still start here as a first step of reinventing the wheel in a Do-It-Yourself (DIY) approach.

2: Synonyms / standardization

In this more comprehensive approach you can replace, substitute, or remove values or words in values based on synonym lists. Examples are replacing person nicknames with guessed formal names, replacing common abbreviations in street names with a standardized term and removing legal forms in company names.

Enrichment / verification with external data can also be used, for example by standardizing addresses or classifying products.

3: Algorithms

Here you will use an algorithm as part of the comparison. Edit distance algorithms, as we know from autocorrection, are popular here. One frequently used one is the Levenshtein distance algorithm. But there are plenty out there to choose from each with their pros and cons.

Many data matching tools simply let you choose from using one of these algorithms in each scenario.

4: Combined traditional

If your DIY approach didn’t stop when encompassing more and
more synonyms it will probably be here where you realize that further quest for
raising efficiency includes combining several methodologies and doing dynamic
combined algorithm utilization.

A minor selection of commercial data matching tools and
embedded capabilities can do that for you so you avoid reinventing the wheel
one more time.

This will yield high efficiency, but not perfection.

5: AI Enabled

Using Artificial Intelligence (AI) in data matching has been practiced for decades as told in the post The Art in Data Matching. With the general rise of AI in recent years there is renewed interest both at tool vendors and at users of data matching to industrialize this.

The results are still sparse out there. With limited training of models, it can be less efficient than traditional methodology. However, it can for sure also limit the gap between traditional efficiency and perfection.

More on Data Matching

There is of course much more to data matching than comparing duplicate candidates. Learn some more about The Art of Data Matching.

And, what to do when a duplicate is identified is a new story. This is examined in the post Three Master Data Survivorship Approaches.






MDM and Knowledge Graph

As examined in a previous post with the title Data Fabric and Master Data Management, the use of the knowledge graph approach is on the rise.

Utilizing a knowledge graph has an overlap with Master Data Management (MDM).

If we go back 10 years MDM and Data Quality Management had a small niche discipline that was called (among other things) entity resolution as explored in the post Non-Obvious Entity Relationship Awareness. The aim of this was the same that today can be delivered in a much larger scale using knowledge graph technology.

During the past decade there have been examples of using graph technology for MDM as for example mentioned in the post Takeaways from MDM Summit Europe 2016. However, most attempts to combine MDM and graph have been to visualize the relationships in MDM using a graph presentation.

When utilizing knowledge graph approaches you will be able to detect many more relationships than those that are currently managed in MDM. This fact is the foundation for a successful co-existence between MDM and knowledge graph with these synergies:

  • MDM hubs can enrich knowledge graph with proven descriptions of the entities that are the nodes (vertices) in the knowledge graph.
  • Additional detected relationships (edges) and entities (nodes) from the knowledge graph that are of operational and/or general analytic interest enterprise wide can be proven and managed in MDM.

In this way you can create new business benefits from both MDM and knowledge graph.

The Disruptive MDM/PIM/DQM List 2022: Datactics

A major rework of The Disruptive MDM/PIM/DQM List is in the making as the number of visitors keep increasing and so do the number of requests for individual solution lists.

It is good to see that some of the most innovative solution providers commit to be part of the list also next year.

One of those is Datactics.

Datactics is a veteran data quality solution provider who is constantly innovating in this space. This year Datactics was one of the rare new entries in The Gartner Magic Quadrant for Data Quality Solutions 2021.

It will be exciting to follow the ongoing development at Datactics, who is operating under the slogan: “Democratising Data Quality”.

You can learn more about how their self-service data quality and matching solution looks like here.

Core Datactics capabilities

Deduplication as Part of MDM

A core intersection between Data Quality Management (DQM) and Master Data Management (MDM) is deduplication. The process here will basically involve:

  • Match master data records across the enterprise application landscape, where these records describe the same real-world entity most frequently being a person, organization, product or asset.
  • Link the master data records in the best fit / achievable way, for example as a golden record.
  • Apply the master data records / golden record to a hierarchy.

Data Matching

The classic data matching quest is to identify data records that refer to the same person being an existing customer and/or prospective customer. The first solutions for doing that emerged more than 40 years ago. Since then the more difficult task of identifying the same organization being a customer, prospective customer, vendor/supplier or other business partner has been implemented while also solutions for identifying products as being the same have been deployed.

Besides using data matching to detect internal duplicates within an enterprise, data matching has also been used to match against external registries. Doing this serves as a mean to enrich internal records while this also helps in identifying internal duplicates.

Master Data Survivorship

When two or more data records have been confirmed as duplicates there are various ways to deal with the result.

In the registry MDM style, you will only store the IDs between the linked records so the linkage can be used for specific operational and analytic purposes in source and target applications.

Further, there are more advanced ways of using the linkage as described in the post Three Master Data Survivorship Approaches.

One relatively simple approach is to choose the best fit record as the survivor in the MDM hub and then keep the IDs of the MDM purged records as a link back to the sourced application records.

The probably most used approach is to form a golden record from the best fit data elements, store this compiled record in the MDM hub and keep the IDs of the linked records from the sourced applications.

A third way is to keep the sourced records in the MDM hub and on the fly compile a golden view for a given purpose.

Hierarchy Management

When you inspect records identified as a duplicate candidate, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

The main hierarchies in play here are described in the post Are These Familiar Hierarchies in Your MDM / PIM / DQM Solution?

Family consumer citizen

With persons in private roles a classic challenge is to distinguish between the individual person, a household with a shared economy and people who happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Family company

Organizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Product hierachy

Products are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in North America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer (and in some cases distributor) at the product packing variant level.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This Disruptive MDM / PIM /DQM List has the most innovative candidates here.

Welcome Winpure on The Disruptive MDM / PIM / DQM List

There is a new kid on the block on The Disruptive MDM / PIM / DQM List. Well, Winpure is not a new solution at all. It is a veteran tool in the data matching space.

Recently the folks at Winpure have embarked on a journey to take best-of-breed data matching into the contextual MDM world.

Data matching is often part of Master Data Management implementations, not at least when the party domain (customers, suppliers, other business partners) is encompassed.

However, it not always the best approach to utilize the data matching capabilities in MDM platforms. In some cases, these are not very effective. In other cases, the matching is needed before data is loaded into the MDM platform. And then many MDM initiatives do not include an MDM platform, but relies on capabilities in ERP and CRM applications.

Here, there is a need for a contextual MDM component with strong data matching capabilities as Winpure.

Learn more about Winpure here.

MDM / PIM / DQM Resources

Last week The Resource List went live on The Disruptive MDM / PIM / DQM List.

The rationale behind the scaling up of this site is explained in the article Preaching Beyond the Choir in the MDM / PIM / DQM Space.

These months are in general a yearly peak for conferences, written content and webinars from solution and service providers in the Master Data Management (MDM), Product Information Management (PIM) and Data Quality Management (DQM) space. With the covid-19 crises conferences are postponed and therefore the content providers are now ramping up the online channel.

As one who would like to read, listen to and/or watch relevant content, it is hard to follow the stream of content being pushed from the individual providers every day.

The Resource List on this site is a compilation of white papers, ebooks, reports, podcasts, webinars and other content from potentially all the registered tool vendors and service providers on The Solution List and coming service list. The Resource List is divided into sections of topics. Here you can get a quick overview of the content available within the themes that matters to you right now.

The list of content and topics is growing.

Check out The Resource List here.

MDM PIM DQM resourcesPS: The next feature on the site is planned to be The Case Study List. Stay tuned.

No One MDM Solution Can Fully Satisfy All Current and Future Use Cases

The title of this post is taken from the Gartner Critical Capabilities for Master Data Management Solutions.

One implication of this observation is that you when selecting your solution will not be able to use a generic analyst ranking of solutions as examined in the post Generic Ranking of Vendors versus an Individual Selection Service.

Selection Model

This is the reason of being for The Disruptive MDM / PIM / DQM List.

Another implication is that even the best fit MDM solution will not necessarily cover all your needs.

One example is within data matching, where I have found that the embedded solutions in MDM tools often only have limited capabilities. To solve this case, there are best of breed data matching solutions on the market able to supplement the MDM solutions.

Another example close to me is within multienterprise (business ecosystem wide) MDM, as MDM solutions are focused on each given organization. Here your interaction with a trading partner, and the interaction by the trading partner with you, can be streamlined with a solution like Product Data Lake.

What is a Golden Record?

The term golden record is a core concept within Master Data Management (MDM) and Data Quality Management (DQM). A golden record is a representation of a real world entity that may be compiled from multiple different representations of that entity in a single or in multiple different databases within the enterprise system landscape.

A golden record is optimized towards meeting data quality dimensions as:

  • Being a unique representation of the real world entity described
  • Having a complete description of that entity covering all purposes of use in the enterprise
  • Holding the most current and accurate data values for the entity described

In Multidomain MDM we work with a range of different entity types as party (with customer, supplier, employee and other roles), location, product and asset. The golden record concept applies to all of these entity types, but in slightly different ways.

Party Golden Record

Having a golden record that facilitates a single view of customer is probably the most known example of using the golden record concept. Managing customer records and dealing with duplicates of those is the most frequent data quality issue around.

If you are not able to prevent duplicate records from entering your MDM world, which is the best approach, then you have to apply data matching capabilities. When identifying a duplicate you must be able to intelligently merge any conflicting views into a golden record as examined in the post Three Master Data Survivorship Approaches.

In lesser degree we see the same challenges in getting a single view of suppliers and, which is one of my favourite subjects, you ultimately will want to have a single view on any business partner, also where the same real world entity have both customer, supplier and other roles to your organization.

Location Golden Record

Having the same location only represented once in a golden record and applying any party, product and asset record, and ultimately golden record, to that record may be seen as quite academic. Nevertheless, striving for that concept will solve many data quality conundrums.

Location management have different meanings and importance for different industries. One example is that a brewery makes business with the legal entity (party) that owns a bar, café, restaurant. However, even though the owner of that place changes, which happens a lot, the brewery is still interested in being the brand served at that place. Also, the brewery wants to keep records of logistics around that place and the historic volumes delivered to that place. Utility and insurance are other examples of industries where the location golden record (should) matter a lot.

Knowing the properties of a location also supports the party deduplication process. For example, if you have two records with the name “John Smith” on the same address, the probability of that being the same real world entity is dependent on whether that location is a single-family house or a nursing home.

Golden RecordsProduct Golden Record

Product Information Management (PIM) solutions became popular with the raise of multi-channel where having the same representation of a product in offline and online channels is essential. The self-service approach in online sales also drew the requirements of managing a lot more product attributes than seen before, which again points to a solution of handling the product entity centralized.

In large organizations that have many business units around the world you struggle with having a local view and a global view of products. A given product may be a finished product to one unit but a raw material to another unit. Even a global SAP rollout will usually not clarify this – rather the contrary.

While third party reference data helps a lot with handling golden records for party and location, this is lesser the case for product master data. Classification systems and data pools do exist, but will certainly not take you all the way. With product master data we must, in my eyes, rely more on second party master data meaning sharing product master data within the business ecosystems where you operate.

Asset (or Thing) Golden Record

In asset master data management you also have different purposes where having a single view of a real world asset helps a lot. There are namely financial purposes and logistic purposes that have to aligned, but also a lot of others purposes depending on the industry and the type of asset.

With the raise of the Internet of Things (IoT) we will have to manage a lot more assets (or things) than we usually have considered. When a thing (a machine, a vehicle, an appliance) becomes intelligent and now produces big data, master data management and indeed multi-domain master data management becomes imperative.

You will want to know a lot about the product model of the thing in order to make sense of the produced big data. For that, you need the product (model) golden record. You will want to have deep knowledge of the location in time of the thing. You cannot do that without the location golden records. You will want to know the different party roles in time related to the thing. The owner, the operator, the maintainer. If you want to avoid chaos, you need party golden records.

Data Matching and Deduplication

The two terms data matching and deduplication are often used synonymously.

In the data quality world deduplication is used to describe a process where two or more data records, that describes the same real-world entity, are merged into one golden record. This can be executed in different ways as told in the post Three Master Data Survivorship Approaches.

Data matching can be seen as an overarching discipline to deduplication. Data matching is used to identify the duplicate candidates in deduplication. Data matching can also be used to identify matching data records between internal and external data sources as examined in the post Third-Party Data Enrichment in MDM and DQM.

As an end-user organization you can implement data matching / deduplication technology from either pure play Data Quality Management (DQM) solution providers or through data management suites and Master Data Management (MDM) solutions as reported in the post DQM Tools In and Around MDM Tools.

When matching internal data records against external sources one often used approach is utilizing the data matching capabilities at the third-party data provider. Such providers as Dun & Bradstreet (D&B), Experian and others offer this service in addition to offering the third-party data.

To close the circle, end-user organizations can use the external data matching result to improve the internal deduplication and more. One example is to apply a matched duns-numbers from D&B for company records as a strong deduplication candidate selection criterium. In addition, such data matching results may often result not in a deduplication, but in building hierarchies of master data.

Data Matching and Deduplication