Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct. The most common example is establishing a link between two different data records probably describing the same person as for example:
Bob Smith at 1 Main Str in Anytown
Robert Smith at One Main Street in Any Town
Data matching can be applied to other master data entity types as companies, locations, products and more.
In the data matching world there has always been attempts to apply machine learning (or artificial intelligence if you like). This is because deterministic approaches usually result in too many false negatives being actual matching entities not found by the computer. Probabilistic / fuzzy logic approaches usually works better, but often not good enough.
One of my own attempts with machine learning was made within a solution at Dun & Bradstreet Nordic called GlobalMatchBox. One happy result of the machine learning capability was described in the post The Art in Data Matching.
In the recent years I have embraced product master data and product data quality within my business activities. The pain points in handling product information does in some cases include matching product entities but even more it is about matching the different taxonomies in use for product data, not at least between trading partners in business ecosystems.
In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.
This kind of thinking is behind the service called Product Data Lake I am working with now. Whereas a publish-subscribe service is usually something that goes on behind the firewall of an enterprise, Product Data Lake takes this theme into the business ecosystem that exists between trading partners as told in the post Product Data Syndication Freedom.
Therefore, a modification to the publish-subscribe concept in this context is that we actually do make it possible for publishers of product information and subscribers of product information to care a little about who gets and who receives the messages as exemplified in the post Using a Business Entity Identifier from Day One. However, the scheme for that is a modern one resembling a social network where partnerships are requested and accepted/rejected.
As messages between global trading partners can be highly asynchronous and as the taxonomy in use often will be different, there is a storage part in between. How this is implemented is examined in the post Product Data Lake Behind the Scenes.
4 years ago, a post on this blog was called The Scary Data Lake. The post was about the fear about if the then new data lake concept would lead to data swamps with horrific data quality, data dumps no one would ever use, data cesspools with all the bad governed data and data sumps that would never be part of the business processes.
For sure, there have been mistakes with data lakes. But it seems that the data lake concept has matured and the understanding of what a data lake can do good is increasing. The data lake concept has even grown out of the analytic world and into more operational cases as told in the post Welcome to Another Data Lake for Data Sharing.
Some of the things we have learned is to apply well known data management principles to data lakes too. This encompasses metadata management, data lineage capabilities and data governance as reported in the post Three Must Haves for your Data Lake.
A couple of weeks ago Microsoft, Adobe and SAP announced their Open Data Initiative. While this, as far as we know, is only a statement for now, it of course has attracted some interest based on that it is three giants in the IT industry who have agreed on something – mostly interpreted as agreed to oppose Salesforce.com.
Forming a business ecosystem among players in the market is not new. However, what we usually see is that a group of companies agrees on a standard and then each one of them puts a product or service, that adheres to that standard, on the market. The standard then caters for the interoperability between the products and services.
In this case its seems to be something different. The product or service is operated by Microsoft based on their Azure platform. There will be some form of a common data model. But it is a data lake, meaning that we should expect that data can be provided in any structure and format and that data can be consumed into any structure and format.
In all humbleness, this concept is the same as the one that is behind Product Data Lake.
The Open Data Initiative from Microsoft, Adobe and SAP focuses at customer data and seems to be about enterprise wide customer data. While it technically also could support ecosystem wide customer data, privacy concerns and compliance issues will restrict that scope in many cases.
At Product Data Lake, we do the same for product data. Only here, the scope is business ecosystem wide as the big pain with product data is the flow between trading partners as examined here.
The intersection between Artificial Intelligence (AI) and Master Data Management (MDM) – and the associated discipline Product Information Management (PIM) – is an emerging topic.
A use case close to me
In my work at setting up a service called Product Data Lake the inclusion of AI has become an important topic. The aim of this service is to translate between the different taxonomies in use at trading partners for example when a manufacturer shares his product information with a merchant.
In some cases the manufacturer, the provider of product information, may use the same standard for product information as the merchant. This may be deep standards as eCl@ss and ETIM or pure product classification standards as UNSPSC. In this case we can apply deterministic matching of the classifications and the attributes (also called properties or features).
However, most often there are uncovered areas even when two trading partners share the same standard. And then again, the most frequent situation is that the two trading partners are using different standards.
As always, applying too much human interaction is costly, time consuming and error prone. Therefore, we are very eagerly training our machines to be able to do this work in a cost-effective way, within a much shorter time frame and with a repeatable and consistent outcome to the benefit of the participating manufacturers, merchants and other enterprises involved in exchanging products and the related product information.
Learning from others
This week I participated in a workshop around exchanging experiences and proofing use cases for AI and MDM. The above-mentioned use case was one of several use cases examined here. And for sure, there is a basis for applying AI with substantial benefits for the enterprises who gets this. The workshop was arranged by Camelot Management Consultants within their Global Community for Artificial Intelligence in MDM.
Gartner, the analyst firm, has a hype cycle for Information Governance and Master Data Management.
Back in 2012 there was a hype cycle for just Master Data Management. It looked like this:
I have made a red circle around the two rightmost terms: “Data Quality Tools” and “Information Exchange and Global Data Synchronization”.
Now, 6 years later, the terms included in the cycle are the below:
The two terms “Data Quality Tools” and “Information Exchange and Global Data Synchronization” are not mentioned here. I do not think it is because the they ever fulfilled their purpose. I think they are being supplemented by something new. One of these terms that have emerged since 2012 is, in red circle, Multienterprise MDM.
As touched in the post Product Data Quality we have seen data quality tools in action for years when it comes to customer (or party) master data, but not that much when it comes to product master data.
Global Data Synchronization has been around the GS1 concept of GDSN (Global Data Synchronization Network) and exchange of product data between trading partners. However, after 40 years in play this concept only covers a fraction of the products traded worldwide and only for very basic product master data. Product data syndication between trading partners for a lot of product information and related digital assets must still be handled otherwise today.
In my eyes Multienterprise MDM comes to the rescue. This concept was examined in the post Ecosystem Wide MDM. You can gain business benefits from extending enterprise wide product master data management to be multienterprise wide. This includes:
Working with the same product classifications or being able to continuously map between different classifications used by trading partners
Utilizing the same attribute definitions (metadata around products) or being able to continuously map between different attribute taxonomies in use by trading partners
Sharing data on product relationships (available accessories, relevant spare parts, updated succession for products, cross-sell information and up-sell opportunities)
Having shared access to latest versions of digital assets (text, audio, video) associated with products.
This is what we work for at Product Data Lake – including Machine Learning Enabled Data Quality, Data Classification, Cloud MDM Hub Service and Multienterprise Metadata Management.
The Information Difference MDM Landscape Q2 2018 is out.
The report confirms the trend of increasing uptake of cloud Master Data Management solutions as examined in the recent post called The Rise of Cloud MDM.
According to the report the coexistence of big data and master data is another trend and more and more MDM vendors are embracing all master data domains while though as stated “most vendors have their roots in either customer or product data, and their particular functionality and track record of deployment is usually deeper where the software had its roots”.
The plot of vendors looks like this:You can read the full report here.
When working in Master Data Management (MDM) programs some of the main pain points always on the list are duplicates. As explained in the post Golden Records in Multi-Domain MDM this may be duplicates in party master data (customer, supplier and other roles) as well as duplicates in product master data, assets, locations and more.
Most of the data quality technology available to solve these problems revolves around identifying duplicates. This is a very intriguing discipline where I have spent some of my best years. However, this is only a remedy to the symptoms of the problem and not a mean to eliminate the root cause as touched in the post The Good, Better and Best Way of Avoiding Duplicates.
The root causes are plentiful and as all challenges they involve technology, processes and people.
Having an IT landscape with multiple applications where master data are a created, updated and consumed is a basic problem and a remedy to that is the main reason of being for Master Data Management (MDM) solutions. The challenge is to implement MDM technology in a way that the MDM solution will not just become another silo of master data but instead be solution for sharing master data within the enterprise – and ultimately in the digital ecosystem around the enterprise.
The main enemy from a technology perspective is in my experience peer-to-peer system integration solutions. If you have chosen application X to support a business objective and application Y to support another business objective and you learn that there is an integration solution between X and Y available, this is very bad news. Because short term cost and timing considerations will make that option obvious. But in the long run it will cost you dearly if the master data involved are handled in other applications as well. Because then you will have blind spots all over the place where through duplicates will enter.
The only sustainable solution is to build a master data hub where through master data are integrated and thus shared with all applications inside the enterprise and around the enterprise. This hub must encompass a shared master data model and related metadata.