Data Matching and Real-World Alignment

Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct.

The most common scenario for data matching is deduplication of customer data records held across an enterprise. In this case we often see a gap between what we technically try to do and the desired business outcome from deduplication. In my experience, this misalignment has something to do with real-world alignment.

Data Matching and Real World Alignment

What we technically do is basically to find a similarity between data records that typically has been pre-processed with some form of standardization. This is often not enough.

Location Intelligence

Deduplication and other forms of data matching with customer master data revolves around names and addresses.

Standardization and verification of addresses is very common element in data quality / data matching tools. Often such at tool will use a service either from its same brand or a third-party service. Unfortunately, no single service is often enough. This is because:

  • Most services are biased towards a certain geography. They may for example be quite good for addresses in The United States but very poor compared to local services for other geographies. This is especially true for geographies with multiple languages in play as exemplified in the post The Art in Data Matching.
  • There is much more to an address than the postal format. In deduplication it is for example useful to know if the address is a single-family house or a high-rise building, a nursing home, a campus or other building with lots of units.
  • Timeliness of address reference data is underestimated. I recently heard from a leader in the Gartner Quadrant for Data Quality Tools that a quarterly refresh is fine. It is not, as told in the post Location Data Quality for MDM.

Identity Resolution

The overlaps and similarities between data matching and identity resolution was discussed in the post Deduplication vs Identity Resolution.

In summary, the capability to tell if two data records represent the same real-world entity will eventually involve identity resolution. And as this is very poorly supported by data quality tools around, we see that a lot of manual work will be involved if the business processes that relies on the data matching cannot tolerate too may, or in some cases any, false positives – or false negatives.

Hierarchy Management

Even telling that a true positive match is true in all circumstances is hard. The predominant examples of this challenge are:

  • Is a match between what seems to be an individual person and what seems to be the household where the person lives a true match?
  • Is a match between what seems to be a person in a private role and what seems to be the same person in a business role a true match? This is especially tricky with sole proprietors working from home like farmers, dentists, free lance consultants and more.
  • Is a match between two sister companies on the same address a true match? Or two departments within the same company?

We often realize that the answer to the questions are different depending on the business processes where the result of the data matching will be used.

The solution is not simple. The data matching functionality must, if we want automated and broadly usable results, be quite sophisticated in order to take advantage of what is available in the real-world. The data model where we hold the result of the data matching must be quite complex if we want to reflect the real-world.

Tibco, Orchestra and Netrics

Today’s Master Data Management (MDM) news is that Tibco Software has bought Orchestra Networks. So, now the 11 vendors in last year’s Gartner Magic Quadrant for Master Data Management Solutions is down to 10.

If Gartner is still postponing this year’s MDM quadrant, they may even manage to reflect this change. We are of course also waiting to see if newcomers will make it to the quadrant and make the crowd of vendors in there go back to an above 10 number. Some of the candidates will be likes of Reltio and Semarchy.

Else, back to the takeover of Orchestra by Tibco, this is not the first time Tibco buys something in the MDM and Data Quality realm. Back in 2010 Tibco bought the data quality tool and data matching front runner Netrics as reported in the post What is a best-in-class match engine?

Then Tibco didn’t defend Netrics’ position in the Gartner Magic Quadrant for Data Quality Tools. The latest Data Quality Tool quadrant is also as the MDM quadrant from 2017 and was touched on this blog here.

So, will be exciting to see how Tibco will defend the joint Tibco MDM solution, which in 2017 was a sliding niche player at Gartner, and the Orchestra MDM solution, which in 2017 was a leader at the Gartner MDM quadrant.

Tibco Orchestra Netrics

MDM Hype Cycle, GDSN, Data Quality, Multienterprise MDM and Product Data Syndication

Gartner, the analyst firm, has a hype cycle for Information Governance and Master Data Management.

Back in 2012 there was a hype cycle for just Master Data Management. It looked like this:

Hype cycle MDM 2012
Source: Gartner

I have made a red circle around the two rightmost terms: “Data Quality Tools” and “Information Exchange and Global Data Synchronization”.

Now, 6 years later, the terms included in the cycle are the below:

Hype Cycle MDM 2018
Source: Gartner

The two terms “Data Quality Tools” and “Information Exchange and Global Data Synchronization” are not mentioned here. I do not think it is because the they ever fulfilled their purpose. I think they are being supplemented by something new. One of these terms that have emerged since 2012 is, in red circle, Multienterprise MDM.

As touched in the post Product Data Quality we have seen data quality tools in action for years when it comes to customer (or party) master data, but not that much when it comes to product master data.

Global Data Synchronization has been around the GS1 concept of GDSN (Global Data Synchronization Network) and exchange of product data between trading partners. However, after 40 years in play this concept only covers a fraction of the products traded worldwide and only for very basic product master data. Product data syndication between trading partners for a lot of product information and related digital assets must still be handled otherwise today.

In my eyes Multienterprise MDM comes to the rescue. This concept was examined in the post Ecosystem Wide MDM. You can gain business benefits from extending enterprise wide product master data management to be multienterprise wide. This includes:

  • Working with the same product classifications or being able to continuously map between different classifications used by trading partners
  • Utilizing the same attribute definitions (metadata around products) or being able to continuously map between different attribute taxonomies in use by trading partners
  • Sharing data on product relationships (available accessories, relevant spare parts, updated succession for products, cross-sell information and up-sell opportunities)
  • Having shared access to latest versions of digital assets (text, audio, video) associated with products.

This is what we work for at Product Data Lake – including Machine Learning Enabled Data Quality, Data Classification, Cloud MDM Hub Service and Multienterprise Metadata Management.

Ecosystem Wide Product Information Management

The concept of doing Master Data Management (MDM) not only enterprise wide but ecosystem wide was examined in the post Ecosystem Wide MDM.

As mentioned, product master data is an obvious domain where business outcomes may occur first when stretching your digital transformation to encompass business ecosystems.

The figure below shows the core delegates in the ecosystem wide Product Information Management (PIM) landscape we support at Product Data Lake:

Ecosystem Wide PIM.png

Your enterprise is in the centre. You may have or need an in-house PIM solution where you manipulate and make product information more competitive as elaborated in the post Using Internal and External Product Information to Win.

At Product Data Lake we collaborate with providers of Artificial Intelligence (AI) capabilities and similar technologies in order to improve data quality and analyse product information.

As shown in the top, there may be a relevant data pool with a consensus structure for your industry available, where you exchange some of product information with trading partners. At Product Data Lake we embrace that scenario with our reservoir concept.

Else, you will need to make partnerships with individual trading partners. At Product Data Lake we make that happen with a win-win approach. This means, that providers can push their product information in a uniform way with the structure and with the taxonomy they have. Receivers can pull the product information in a uniform way with the structure and with the taxonomy they have. This product data syndication concept is outlined in the post Sell more. Reduce costs.

Product Data Lake Behind the Scenes

Product Data Lake is a cloud service for exchanging product information (product data syndication) between manufacturers, distributors and merchants. When telling about the service I usually concentrate on the business benefits and how the service will make you sell more and reduce costs.

However, there will always be one or two persons in the audience who wants to know about the technology behind. And for sure, this is important too.

The service is built using some of the newest and best-of-breed technologies available for this purpose today. This includes Amazon Elastic Computing Cloud for hosting the public cloud version, MongoDB for storing data, RabbitMQ for handling data streams and ElasticSearch for finding data.

PDL Architecture

You can dive into the geeky parts in this PDF document: Product Data Lake Architecture.

Happiness vs Market Strength

When following analyst market reports one thing that always strike me is that the vendors who have charged the most for licenses (being to the right on the market strength axis) seldom are the same as those having the most satisfied customers.

The Data Quality Product Landscape 2018 from Information Difference has no surprises there either.

On the technology vertical axis, the vendors are pretty even, while they stretch out on the horizontal market strength axis.

DQ Landscape 2018

The report states: “The happiest customers based on this survey were those of Datactics followed by ActivePrime”. You will find those to the left.

(Innovative Systems, Experian and Syncsort were the better of the rest it must be said.)

See the full report here.

The Good, the Better and the Best Kinds of Data Quality Technology

If I look at my journey in data quality I think you can say, that I started with working with the good way of implementing data quality tools, then turned to some better ways and, until now at least, is working with the best way of implementing data quality technology.

It is though not that the good old kind of tools are obsolete. They are just relieved from some of the repeating of the hard work in cleaning up dirty data.

The good (old) kind of tools are data cleansing and data matching tools. These tools are good at finding errors in postal addresses, duplicate party records and other nasty stuff in master data. The bad thing about finding the flaws long time after the bad master data has entered the databases, is that it often is very hard to do the corrections after transactions has been related to these master data and that, if you do not fix the root cause, you will have to do this periodically. However, there still are reasons to use these tools as reported in the post Top 5 Reasons for Downstream Cleansing.

The better way is real time validation and correction at data entry where possible. Here a single data element or a range of data elements are checked when entered. For example the address may be checked against reference data, phone number may be checked for adequate format for the country in question or product master data is checked for the right format and against a value list. The hard thing with this is to do it at all entry points. A possible approach to do it is discussed in the post Service Oriented MDM.

The best tools are emphasizing at assisting data capture and thus preventing data quality issues while also making the data capture process more effective by connecting opposite to collecting. Two such tools I have worked with are:

·        IDQ™ which is a tool for mashing up internal party master data and 3rd party big reference data sources as explained further in the post instant Single Customer View.

·        Product Data Lake, a cloud service for sharing product data in the business ecosystems of manufacturers, distributors, merchants and end users of product information. This service is described in detail here.

DQ

When Excel is Stretched too Far

I guess we all have encountered examples on how Excel is used in an over-complicated way to solve business tasks that should have been solved with a tool much better suited for that kind of work.

My pet peeve is using Excel for exchanging product information between supply channel partners. This has been a main driver behind launching Product Data Lake.

What is your example about a too far stretched use of Excel?

Samsung 49 inch

Product Data Quality

The data quality tool industry has always had a hard time offering capabilities for solving the data quality issues that relates to product data.

Customer data quality issues has always been the challenges addressed as examined in the post The Future of Data Quality Tools, where the current positioning from the analyst firm Information Difference was discussed. The leaders as Experian Data Quality, Informatica and Trillium (now part of Syncsort) always promote their data quality tools with use cases around customer data.

Back some years Oracle did have a go for product data quality with their Silver Creek Systems acquisition as mentioned by Andrew White of Gartner in this post. The approach from Silver Creek to product data quality can be seen in this MIT Information Quality Industry Symposium presentation from the year before. However, today Oracle is not even present in the industry report mentioned above.

Multi-Domain MDM and Data Quality DimensionsWhile data quality as a discipline with the methodology and surrounding data governance may be very similar between customer data and product data, the capabilities needed for tools supporting data cleansing, data quality improvement and prevention of data quality issues are somewhat different.

Data profiling is different, as it must be very tightly connected to product classification. Deduplication is useful, but far from in same degree as with customer data. Data enrichment must be much more related to second party data than third party data, which is most useful for customer and other party master data.

Regular readers of this blog will know, that my suggestion for data quality tool vendors is to join Product Data Lake.