Product Data Quality

The data quality tool industry has always had a hard time offering capabilities for solving the data quality issues that relates to product data.

Customer data quality issues has always been the challenges addressed as examined in the post The Future of Data Quality Tools, where the current positioning from the analyst firm Information Difference was discussed. The leaders as Experian Data Quality, Informatica and Trillium (now part of Syncsort) always promote their data quality tools with use cases around customer data.

Back some years Oracle did have a go for product data quality with their Silver Creek Systems acquisition as mentioned by Andrew White of Gartner in this post. The approach from Silver Creek to product data quality can be seen in this MIT Information Quality Industry Symposium presentation from the year before. However, today Oracle is not even present in the industry report mentioned above.

Multi-Domain MDM and Data Quality DimensionsWhile data quality as a discipline with the methodology and surrounding data governance may be very similar between customer data and product data, the capabilities needed for tools supporting data cleansing, data quality improvement and prevention of data quality issues are somewhat different.

Data profiling is different, as it must be very tightly connected to product classification. Deduplication is useful, but far from in same degree as with customer data. Data enrichment must be much more related to second party data than third party data, which is most useful for customer and other party master data.

Regular readers of this blog will know, that my suggestion for data quality tool vendors is to join Product Data Lake.

Golden Records in Multi-Domain MDM

The term golden record is a core concept within Master Data Management (MDM). A golden record is a representation of a real world entity that may be compiled from multiple different representations of that entity in a single or in multiple different databases within the enterprise system landscape.

GoldIn Multi-domain MDM we work with a range of different entity types as party (with customer, supplier, employee and other roles), location, product and asset. The golden record concept applies to all of these entity types, but in slightly different ways.

Party Golden Records

Having a golden record that facilitates a single view of customer is probably the most known example of using the golden record concept. Managing customer records and dealing with duplicates of those is the most frequent data quality issue around.

If you are not able to prevent duplicate records from entering your MDM world, which is the best approach, then you have to apply data matching capabilities. When identifying a duplicate you must be able to intelligently merge any conflicting views into a golden record.

In lesser degree we see the same challenges in getting a single view of suppliers and, which is one of my favourite subjects, you ultimately will want to have a single view on any business partner, also where the same real world entity have both customer, supplier and other roles to your organization.

Location Golden Records

Having the same location only represented once in a golden record and applying any party, product and asset record, and ultimately golden record, to that record may be seen as quite academic. Nevertheless, striving for that concept will solve many data quality conundrums.

GoldLocation management have different meanings and importance for different industries. One example is that a brewery makes business with the legal entity (party) that owns a bar, café, restaurant. However, even though the owner of that place changes, which happens a lot, the brewery is still interested in being the brand served at that place. Also, the brewery wants to keep records of logistics around that place and the historic volumes delivered to that place. Utility and insurance is other examples of industries where the location golden record (should) matter a lot.

Knowing the properties of a location also supports the party deduplication process. For example, if you have two records with the name “John Smith” on the same address, the probability of that being the same real world entity is dependent on whether that location is a single-family house or a nursing home.

Product Golden Record

Product Information Management (PIM) solutions became popular with the raise of multi-channel where having the same representation of a product in offline and online channels is essential. The self-service approach in online sales also drew the requirements of managing a lot more product attributes than seen before, which again points to a solution of handling the product entity centralized.

In large organizations that have many business units around the world you struggle with having a local view and a global view of products. A given product may be a finished product to one unit but a raw material to another unit. Even a global SAP rollout will usually not clarify this – rather the contrary.

GoldWhile third party reference data helps a lot with handling golden records for party and location, this is lesser the case for product master data. Classification systems and data pools do exist, but will certainly not take you all the way. With product master data we must, in my eyes, rely more on second party master data meaning sharing product master data within the business ecosystems where you are present.

Asset (or Thing) Golden Records

In asset master data management you also have different purposes where having a single view of a real world asset helps a lot. There are namely financial purposes and logistic purposes that have to aligned, but also a lot of others purposes depending on the industry and the type of asset.

With the raise of the Internet of Things (IoT) we will have to manage a lot more assets (or things) than we usually have considered. When a thing (a machine, a vehicle, an appliance) becomes intelligent and now produces big data, master data management and indeed multi-domain master data management becomes imperative.

You will want to know a lot about the product model of the thing in order to make sense of the produced big data. For that, you need the product (model) golden record. You will want to have deep knowledge of the location in time of the thing. You cannot do that without the location golden records. You will want to know the different party roles in time related to the thing. The owner, the operator, the maintainer. If you want to avoid chaos, you need party golden records.

What Will you Complicate in the Year of the Rooster?

rooster-6Today is the first day in the new year. The year of the rooster according to the Lunar Calendar observed in East Asia. One of the characteristics of the year of the rooster is that in this year, people will tend to complicate things.

People usually likes to keep things simple. The KISS principle – Keep It Simple, Stupid – has many fans. But not me. Not that I do not like to keep things simple. I do. But only as simple as it should be as Einstein probably said. Sometimes KISS is the shortcut to getting it all wrong.

When working with data quality I have come across the three below examples of striking the right balance in making things a bit complicated and not too simple:

Deduplication

One of the most frequent data quality issues around is duplicates in party master data. Customer, supplier, patient, citizen, member and many other roles of legal entities and natural persons, where the real world entity are described more than once with different values in our databases.

In solving this challenge, we can use methods as match codes and edit distance to detect duplicates. However, these methods, often called deterministic, are far too simple to really automate the remedy. We can also use advanced probabilistic methods. These methods are better, but have the downside that the matching done is hard to explain, repeat and reuse in other contexts.

My best experience is to use something in between these approaches. Not too simple and not too overcomplicated.

Address verification

You can make a good algorithm to perform verification of postal and visit addresses in a database for addresses coming from one country. However, if you try the same algorithm on addresses from another country, it often fails miserably.

Making an algorithm for addresses from all over the world will be very complicated. I have not seen one yet, that works.

My best experience is to accept the complication of having almost as many algorithms as there are countries on this planet.

Product classification

Classifications of products controls a lot of the data quality dimensions related to product master data. The most prominent example is completeness of product information. Whether you have complete product information is dependent on the classification of the product. Some attributes will be mandatory for one product but make no sense at all to another product by a different classification.

If your product classification is too simple, your completeness measurement will not be realistic. A too granular or other way complicated classification system is very hard to maintain and will probably seem as an overkill for many purposes of product master data management.

My best experience is that you have to maintain several classification systems and have a linking between them, both inside your organization and between your trading partners.

Happy New Lunar Year

Using a Business Entity Identifier from Day One

One of the ways to ensure data quality for customer – or rather party – master data when operating in a business-to-business (B2B) environment, is to on-board new entries using an external defined business entity identifier.

By doing that, you tackle some of the most challenging data quality dimensions as:

  • Uniqueness, by checking if a business with that identifier already exist in your internal master data. This approach is superior to using data matching as explained in the post The Good, Better and Best Way of Avoiding Duplicates.
  • Accuracy, by having names, addresses and other information defaulted from a business directory and thus avoiding those spelling mistakes that usually are all over in party master data.
  • Conformity, by inheriting additional data as line-of-business codes and descriptions from a business directory.

Having an external business identifier stored with your party master data helps a lot with maintaining data quality as pondered in the post Ongoing Data Maintenance.

Busienss Entity IdentifiersWhen selecting an identifier there are different options as national IDs, LEI, DUNS Number and others as explained in the post Business Entity Identifiers.

At the Product Data Lake service I am working on right now, we have decided to use an external business identifier from day one. I know this may be something a typical start-up will consider much later if and when the party master data population has grown. But, besides being optimistic about our service, I think it will be a win not to have to fight data quality issues later with guarantied increased costs.

For the identifier to use we have chosen the DUNS Number from Dun & Bradstreet. The reason is that this currently is the only worldwide covered business identifier. Also, Dun & Bradstreet offers some additional data that fits our business model. This includes consistent line-of-business information and worldwide company family trees.

Bookmark and Share

MDM Tools Revealed

Every organization needs Master Data Management (MDM). But does every organization need a MDM tool?

In many ways the MDM tools we see on the market resembles common database tools. But there are some things the MDM tools do better than a common database management tool. The post called The Database versus the Hub outlines three such features being:

  • Controlling hierarchical completeness
  • Achieving a Single Business Partner View
  • Exploiting Real World Awareness

Controlling hierarchical completeness and achieving a single business partner view is closely related to the two things data quality tools do better than common database systems as explained in the post Data Quality Tools Revealed. These two features are:

  • Data profiling and
  • Data matching

Specialized data profiling tools are very good at providing out-of-the-box functionality for statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. These capabilities are often better and easier to use than what you find inside a MDM tool. However, in order to measure the improvement in a business context and fix the problems not just in a one-off you need a solid MDM environment.

When it comes to data matching we also still see specialized solutions that are more effective and easier to use than what is typically delivered inside MDM solutions. Besides that, we also see business scenarios where it is better to do the data matching outside the MDM platform as examined in the post The Place for Data Matching in and around MDM.

Looking at the single MDM domains we also see alternatives. Customer Relation Management (CRM) systems are popular as a choice for managing customer master data.  But as explained in the post CRM systems and Customer MDM: CRM systems are said to deliver a Single Customer View but usually they don’t. The way CRM systems are built, used and integrated is a certain track to create duplicates. Some remedies for that are touched in the post The Good, Better and Best Way of Avoiding Duplicates.

integriertWith product master data we also have Product Information Management (PIM) solutions. From what I have seen PIM solutions has one key capability that is essentially different from a common database solution and how many MDM solutions, that are built with party master data in mind, has. That is a flexible and super user angled way of building hierarchies and assigning attributes to entities – in this case particularly products. If you offer customer self-service, like in eCommerce, with products that have varying attributes you need PIM functionality. If you want to do this smart, you need a collaboration environment for supplier self-service as well as pondered in the post Chinese Whispers and Data Quality.

All in all the necessary components and combinations for a suitable MDM toolbox are plentiful and can be obtained by one-stop-shopping or by putting some best-of-breed solutions together.

IDQ vs iDQ™

The previous post on this blog was called Informatica without Data Quality? This post digs into the messaging around the recent takeover of Informatica and the future for the data quality components in the Informatica toolbox.

In the comments Julien Peltier and Richard Branch discusses the cloud emphasis in the messaging from the new Informatica owners and especially the future of Master Data Management (MDM) in the cloud.

open-doorMy best experience with MDM in the cloud is with a service called iDQ™ – a service that shares TLA (Three Letter Acronym) with Informatica Data Quality by the way. The former stands for instant Data Quality. This is a service that revolves around turning your MDM inside-out as latest touched on this blog in the post The Pros and Cons of MDM 3.0.

iDQ™ specifically deals with customer (or rather party) master data, how to get this kind of master data right the first time and how to avoid duplicates as explored in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

The Data Matching Institute is Here

Within data management we already have “The MDM Institute”, “The Data Governance Institute” and “The Data Warehouse Institute (TDWI)” and now we also have “The Data Matching Institute”.

TDMIThe founder of The Matching Institute is Alexandra Duplicado. Aleksandra says: “The reason I founded The Institute of Data Matching is that I am sick and tired of receiving duplicate letters with different spellings of my name and address”. Alex is also pleased about, that she now have found a nice office in edit distance of her home.

Before founding The Matching of Data Institute Alexander worked at the Universal Postal Union with responsibility for extra-terrestrial partners. When talking about the future of The Match Institute Sasha remarks: “It is a matter of not being too false positive. But it is a unique concept”.

One of the first activities for The Data-Matching Institute will be organizing a conference in Brussels. Many tool vendors such as Statistical Analysis System Inc., Dataflux and SAS Instiute will sponsor the Brüssel conference. I hope to join many record linkage friends in Bruxelles says Alexandre.

The Institute of Matching of Data also plans to offer a yearly report on the capabilities of the tool vendors. Asked about when that is going to happen Aleksander says: “Without being too deterministic a probabilistic release date is the next 1st of April”.

Bookmark and Share