Deduplication as Part of MDM

A core intersection between Data Quality Management (DQM) and Master Data Management (MDM) is deduplication. The process here will basically involve:

  • Match master data records across the enterprise application landscape, where these records describe the same real-world entity most frequently being a person, organization, product or asset.
  • Link the master data records in the best fit / achievable way, for example as a golden record.
  • Apply the master data records / golden record to a hierarchy.

Data Matching

The classic data matching quest is to identify data records that refer to the same person being an existing customer and/or prospective customer. The first solutions for doing that emerged more than 40 years ago. Since then the more difficult task of identifying the same organization being a customer, prospective customer, vendor/supplier or other business partner has been implemented while also solutions for identifying products as being the same have been deployed.

Besides using data matching to detect internal duplicates within an enterprise, data matching has also been used to match against external registries. Doing this serves as a mean to enrich internal records while this also helps in identifying internal duplicates.

Master Data Survivorship

When two or more data records have been confirmed as duplicates there are various ways to deal with the result.

In the registry MDM style, you will only store the IDs between the linked records so the linkage can be used for specific operational and analytic purposes in source and target applications.

Further, there are more advanced ways of using the linkage as described in the post Three Master Data Survivorship Approaches.

One relatively simple approach is to choose the best fit record as the survivor in the MDM hub and then keep the IDs of the MDM purged records as a link back to the sourced application records.

The probably most used approach is to form a golden record from the best fit data elements, store this compiled record in the MDM hub and keep the IDs of the linked records from the sourced applications.

A third way is to keep the sourced records in the MDM hub and on the fly compile a golden view for a given purpose.

Hierarchy Management

When you inspect records identified as a duplicate candidate, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

The main hierarchies in play here are described in the post Are These Familiar Hierarchies in Your MDM / PIM / DQM Solution?

Family consumer citizen

With persons in private roles a classic challenge is to distinguish between the individual person, a household with a shared economy and people who happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Family company

Organizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Product hierachy

Products are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in North America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer (and in some cases distributor) at the product packing variant level.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This Disruptive MDM / PIM /DQM List has the most innovative candidates here.

Data Matching and Deduplication

The two terms data matching and deduplication are often used synonymously.

In the data quality world deduplication is used to describe a process where two or more data records, that describes the same real-world entity, are merged into one golden record. This can be executed in different ways as told in the post Three Master Data Survivorship Approaches.

Data matching can be seen as an overarching discipline to deduplication. Data matching is used to identify the duplicate candidates in deduplication. Data matching can also be used to identify matching data records between internal and external data sources as examined in the post Third-Party Data Enrichment in MDM and DQM.

As an end-user organization you can implement data matching / deduplication technology from either pure play Data Quality Management (DQM) solution providers or through data management suites and Master Data Management (MDM) solutions as reported in the post DQM Tools In and Around MDM Tools.

When matching internal data records against external sources one often used approach is utilizing the data matching capabilities at the third-party data provider. Such providers as Dun & Bradstreet (D&B), Experian and others offer this service in addition to offering the third-party data.

To close the circle, end-user organizations can use the external data matching result to improve the internal deduplication and more. One example is to apply a matched duns-numbers from D&B for company records as a strong deduplication candidate selection criterium. In addition, such data matching results may often result not in a deduplication, but in building hierarchies of master data.

Data Matching and Deduplication

 

Are These Familiar Hierarchies in Your MDM / DQM / PIM Solution?

The term family is used in different contexts within Master Data Management (MDM), Data Quality Management (DQM) and Product Information Management (PIM) when working with hierarchy management and entity resolution.

Here are three frequent examples:

Consumer / citizen family

Family consumer citizenWhen handling party master data about consumers / citizens we can deal with the basic definition of a family, being a group consisting of two parents and their children living together as a unit.

This is used when the business scenario does not only target each individual person but also a household with a shared economy. When identifying a household, a common parameter is that the persons live on the same postal address (at the same time) while observing constellations as:

  • Nuclear families consisting of a female and a male adult (and their children)
  • Rainbow families where the gender is not an issue
  • Extended families consisting of more than two generations
  • Persons who happen to live on the same postal address

There are multicultural aspects of these constellations including the different family name constructions around the world and the various frequency and acceptance of rainbow families as well of frequency of extended families.

Company family tree

When handling party master data about companies / organizations a valuable information is how the companies / organizations are related most commonly pictured as a company family tree with mothers and sisters. This can in theory be in infinite levels. The basic levels are:

  • A global ultimate mother being the company that ultimately owns (fully or partly) a range of companies in several countries.
  • A national ultimate mother being the company that owns (fully or partly) a range of companies in a given country.
  • A legal entity being the basic registered company within a country having some form of a business entity identifier.
  • A branch owned by a legal entity and operating from a given postal / visiting address.

Family companyYou can build your own company tree describing your customers, suppliers and other business partners. Alternatively or supplementary, you can rely on third party business directories. It is here worth noticing that a national source will only go to the ultimate national mother level while a global source can include the global ultimate mother and thus form larger families.

Having a company family view in your master data repository is a valuable information asset within credit risk, supply risk, discount opportunities, cross-selling and more.

Product family

The term “product family” is often used to define a level in a homegrown product classification / product grouping scheme. It is used to define a level that can have levels above and levels below with other terms as “product line”, “product category”, “product class”, “product group”, “product type” and more.

Family productSometimes it is also used as a term to define a product with a family of variants below, where variants are the same product produced and kept in stock in different colours, sizes and more.

Read more about Stock Keeping Units (SKUs), product variants, product identification and product classification in the post Five Product Information Management Core Aspects.

Top 15 MDM / PIM Requirements in RFPs

A Request for Proposal (RFP) process for a Master Data Management (MDM) and/or Product Information Management (PIM) solution has a hard fact side as well as there are The Soft Sides of MDM and PIM RFPs.

The hard fact side is the detailed requirements a potential vendor has to answer to in what in most cases is the excel sheet the buying organization has prepared – often with the extensive help from a consultancy.

Here are what I have seen as the most frequently included topics for the hard facts in such RFPs:

  • MDM and PIM: Does the solution have functionality for hierarchy management?
  • MDM and PIM: Does the solution have workflow management included?
  • MDM and PIM: Does the solution support versioning of master data / product information?
  • MDM and PIM: Does the solution allow to tailor the data model in a flexible way?
  • MDM and PIM: Does the solution handle master data / product information in multiple languages / character sets / script systems?
  • MDM and PIM: Does the solution have capabilities for (high speed) batch import / export and real-time integration (APIs)?
  • MDM and PIM: Does the solution have capabilities within data governance / data stewardship?
  • MDM and PIM: Does the solution integrate with “a specific application”? – most commonly SAP, MS CRM/ERPs, SalesForce?
  • MDM: Does the solution handle multiple domains, for example customer, vendor/supplier, employee, product and asset?
  • MDM: Does the solution provide data matching / deduplication functionality and formation of golden records?
  • MDM: Does the solution have integration with third-party data providers for example business directories (Dun & Bradstreet / National registries) and address verification services?
  • MDM: Does the solution underpin compliance rules as for example data privacy and data protection regulations as in GDPR / other regimes?
  • PIM: Does the solution support product classification and attribution standards as eClass, ETIM (or other industry specific / national standards)?
  • PIM: Does the solution support publishing to popular marketplaces (form of outgoing Product Data Syndication)?
  • PIM: Does the solution have a functionality to ease collection of product information from suppliers (incoming Product Data Syndication)?

Learn more about how I can help in the blog page about MDM / PIM Tool Selection Consultancy.

MDM PIM RFP Wordle

What Will you Complicate in the Year of the Rooster?

rooster-6Today is the first day in the new year. The year of the rooster according to the Lunar Calendar observed in East Asia. One of the characteristics of the year of the rooster is that in this year, people will tend to complicate things.

People usually likes to keep things simple. The KISS principle – Keep It Simple, Stupid – has many fans. But not me. Not that I do not like to keep things simple. I do. But only as simple as it should be as Einstein probably said. Sometimes KISS is the shortcut to getting it all wrong.

When working with data quality I have come across the three below examples of striking the right balance in making things a bit complicated and not too simple:

Deduplication

One of the most frequent data quality issues around is duplicates in party master data. Customer, supplier, patient, citizen, member and many other roles of legal entities and natural persons, where the real world entity are described more than once with different values in our databases.

In solving this challenge, we can use methods as match codes and edit distance to detect duplicates. However, these methods, often called deterministic, are far too simple to really automate the remedy. We can also use advanced probabilistic methods. These methods are better, but have the downside that the matching done is hard to explain, repeat and reuse in other contexts.

My best experience is to use something in between these approaches. Not too simple and not too overcomplicated.

Address verification

You can make a good algorithm to perform verification of postal and visit addresses in a database for addresses coming from one country. However, if you try the same algorithm on addresses from another country, it often fails miserably.

Making an algorithm for addresses from all over the world will be very complicated. I have not seen one yet, that works.

My best experience is to accept the complication of having almost as many algorithms as there are countries on this planet.

Product classification

Classifications of products controls a lot of the data quality dimensions related to product master data. The most prominent example is completeness of product information. Whether you have complete product information is dependent on the classification of the product. Some attributes will be mandatory for one product but make no sense at all to another product by a different classification.

If your product classification is too simple, your completeness measurement will not be realistic. A too granular or other way complicated classification system is very hard to maintain and will probably seem as an overkill for many purposes of product master data management.

My best experience is that you have to maintain several classification systems and have a linking between them, both inside your organization and between your trading partners.

Happy New Lunar Year

Visiting the Product Information Castle

Kronborg_Castle
Kronborg Castle

If you have ever visited some of the many castles around in Europe you may have noticed that there are many architectural similarities. You may also compare these basic structures of a castle with how we can imagine the data architecture related to Product Information Management (PIM).

In my vision of a product information castle there is a main building with five floors of product information. There is a basement for pricing information where we often will find the valuable things as the crown jewels and other treasures. The hierarchy tower combines the pricing information and the different levels of product information. Besides the main castle, we find the logistic stables.

PIM0
Hierarchy, pricing and logistic is part of whole picture

What we do not see on this figure is the product lifecycle management wall around the castle area.

Now, let us get back to the main building and examine what is on each of the floors in the building.

PIM01
Ground PIM level: Basic product data

On the ground level, we find the basic product data that typically is the minimum required for creating a product in any system of record. Here we find the primary product identification number or code that is the internal key to all other product data structures and transactions related to the product. Then there usually is a short product description. This description helps internal employees identifying a product and distinguishing that product from other products. If an upstream trading partner produces the product, we may find the identification of that supplier here. If the product is part of internal production, we may have a material type telling about if it is a raw material, semi-finished product, finished good or packing material.

Except for semi-finished products, we may find more things on the next floor.

PIM02
PIM level 2: Product trade data

This level has product data related to trading the product. We may have a unique Global Trade Item Number (GTIN) that may be in the form of an International Article Number (EAN) or a Universal Product Code (UPC). Here we have commodity codes and a lot of other product data that supports buying, receiving, selling and delivering the product.

Most castles were not build in one go. Many castles started modestly in maybe just two floors and a tiny tower. In the same way, our product information management solutions for finished and trading goods usually are built on the top of an elder ERP solution holding the basic and trading data.

PIM03
PIM Level 3: Basic product recognition data

On the third level, we find the two grand ballrooms of product information. These ballrooms were introduced when eCommerce started to grow up.

The extended product description is needed because the usual short product description used internally have no meaning to an outsider as told in the post Customer Friendly Product Master Data. Some good best practices for governing the extended product description is to have a common structure of how the description is written, not to use abbreviations and to have a strict vocabulary as reported in the post Toilet Seats and Data Quality.

Having a product image is pivotal if you want to sell something without showing the real product face-to-face with the customer or other end user. A missing product image is a sign of a broken business process for collecting product data as pondered in the post Image Coming Soon.

PIM04
PIM Level 4: Self-service product data

On the fourth level, we have three main chambers: Product attributes, basic product relations and standard digital assets.This data are the foundation of customer self-service and should, unless you are the manufacturer, be collected from the manufacturer via supplier self-service.

Product attributes are also sometimes called product properties or product features. These are up to thousands of different data elements that describes a product. Some are very common for most products like height, length, weight and colour. Some are very specific to the product category. This challenge is actually the reason of being for dedicated Product Information Management (PIM) solutions as told in the post MDM Tools Revealed.

Basic product relations are the links between a product and other products like a product that have several different accessories that goes with the product or a product being a successor of another now decommissioned product.

Standard digital assets are documents like installation guides, line drawings and data sheets as examined in the post Digital Assets and Product MDM.

PIM05
PIM Level 5: Competitive product data

On the upper fifth floor we find elements like on the fourth floor but usually these are elements that you won’t necessarily apply to all products but only to your top products where you want to stand out from the crowd and distance yourself from your competitors.

Special content are descriptions of and stories about the product above the hard features. Here you tell about why the product is better than other products and in which circumstances the product can to be used. A common aim with these descriptions is also Search Engine Optimization (SEO).

X-sell (cross-sell) and up-sell product relations applies to your particular mix of products and may be made subjective as for example to look at up-sell from a profit margin point of view. X-sell and up-sell relations may be defined from upstream by you or your upstream trading partners but also dripping down on the roof from the behaviour of your downstream trading partners / customers as manifested in the classic webshop message: “Those who bought product A also bought / looked at product B”.

Advanced digital assets are broader and more lively material than the hard fact line drawings and other documents. Increasingly newer digital media types as video are used for this purpose.

All in all the rooftop takes us to the upper side of the cloud.

Hoenzollern Castle in Southern Germany

Bookmark and Share

Copy and Paste versus Inheritance within MDM

A common seen user requirement for Master Data Management (MDM) solutions is an ability to copy the content of the attributes of an existing entity when creating a new entity. For example when creating a new product you may find it nice to copy all the field values from an existing similar product to the new product and then just change what is different for the new product. Just like using copy and paste in excel or other so called productivity tools.

We all know the dangers of copy and paste and there are plenty of horror stories out there of the harsh consequences like when copying and pasting in a job application and forgetting to change the name of the targeted employer. You know: “I have always dreamed about working for IBM” when applying at Oracle.

The exact same bad things may happen when doing copy and paste when working with master data. You may forget to change exactly that one important piece of information because you miss guidance on the copied data within your system of entry.

Yes NoUsing an inheritance approach is a better way. This approach is for product master data based on having a mature hierarchy management in place. When creating a new product you place your product in the hierarchy where it will inherit the attributes common for products on the same branch of the hierarchy and leave it for you to fill in the exact attributes that is specific for the new product. If a new product requires a new branch in the hierarchy, you are forced to think about the common attributes for this branch through.

For party (customer, supplier and other business partner) master data you may inherit from the outside world taking advantage of fetching what is already digitalized, which includes names, addresses and other contact data, and leaving for you to fill in the party master data that is specific to your way of doing business.

Bookmark and Share

Will Graph Databases become Common in MDM?

One of my pet peeves in data quality for CRM and ERP systems is the often used way at looking at entities, not at least party entities, in a flat data model as told in the post A Place in Time.

Party master data, and related location master data, will eventually be modeled in very complex models and surely we see more and more examples of that. For example I remember that I long time ago worked with the ERP system that later became Microsoft Dynamics AX.  Then I had issues with the simplistic and not role aware data model. While I’m currently working in a project using the AX 2012 Address Book it’s good to see that things have certainly developed.

This blog has quite a few posts on hierarchy management in Master Data Management (MDM) and even Hierarchical Data Matching. But I have to admit that even complex relational data models and hierarchical approaches in fact don’t align completely with the real world.

In a comment to the post Five Flavors of Big Data Mike Ferguson asked about graph data quality. In my eyes using graph databases in master data management will indeed bring us closer to the real world and thereby deliver a better data quality for master data.

I remember at this year’s MDM Summit Europe that Aaron Zornes suggested that a graph database will be the best choice for reflecting the most basic reference dataset being The Country List. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

So it could be good to know if you have seen or worked with graph databases in master data management beyond representing a static analysis result as a graph database.

GraphDatabase_PropertyGraph
Wikiopedia article on graph database

Bookmark and Share

Matching for Multiple Purposes

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

HierarchyFor that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Bookmark and Share

The Country List

It’s the second day of the MDM Summit Europe 2013 in London today.

The last session I attended today was an expert panel on Reference Data Management (RDM).

Country ListI guess the list of countries on this planet is the prime example of what is reference data and today’s session provided no exception from that.

Even though a list of countries is fairly small and there shouldn’t be everyday changes to the list, maintaining a country list isn’t as simple as you should think.

First of all official sources for a country list aren’t in agreement. The range of countries given an ISO code isn’t the same as the range of countries where for example the Universal Postal Union (UPU) says you can make a delivery.

Another example I have had some challenges with is that for example the D&B WorldBase (a large word-wide business directory) has four country codes for what is generally regarded as the United Kingdom, as the D&B country reference data probably is defined by a soccer fan recognizing the distinct national soccer teams from England, Wales, Scotland and Northern Ireland.

The expert panel moderator, Aaron Zornes, went as far as suggesting that a graph database maybe the best technology for reflecting the complexity in reference data. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

Bookmark and Share