Today is the first day in the new year. The year of the rooster according to the Lunar Calendar observed in East Asia. One of the characteristics of the year of the rooster is that in this year, people will tend to complicate things.
People usually likes to keep things simple. The KISS principle – Keep It Simple, Stupid – has many fans. But not me. Not that I do not like to keep things simple. I do. But only as simple as it should be as Einstein probably said. Sometimes KISS is the shortcut to getting it all wrong.
When working with data quality I have come across the three below examples of striking the right balance in making things a bit complicated and not too simple:
One of the most frequent data quality issues around is duplicates in party master data. Customer, supplier, patient, citizen, member and many other roles of legal entities and natural persons, where the real world entity are described more than once with different values in our databases.
In solving this challenge, we can use methods as match codes and edit distance to detect duplicates. However, these methods, often called deterministic, are far too simple to really automate the remedy. We can also use advanced probabilistic methods. These methods are better, but have the downside that the matching done is hard to explain, repeat and reuse in other contexts.
My best experience is to use something in between these approaches. Not too simple and not too overcomplicated.
You can make a good algorithm to perform verification of postal and visit addresses in a database for addresses coming from one country. However, if you try the same algorithm on addresses from another country, it often fails miserably.
Making an algorithm for addresses from all over the world will be very complicated. I have not seen one yet, that works.
My best experience is to accept the complication of having almost as many algorithms as there are countries on this planet.
Classifications of products controls a lot of the data quality dimensions related to product master data. The most prominent example is completeness of product information. Whether you have complete product information is dependent on the classification of the product. Some attributes will be mandatory for one product but make no sense at all to another product by a different classification.
If your product classification is too simple, your completeness measurement will not be realistic. A too granular or other way complicated classification system is very hard to maintain and will probably seem as an overkill for many purposes of product master data management.
My best experience is that you have to maintain several classification systems and have a linking between them, both inside your organization and between your trading partners.
If you have ever visited some of the many castles around in Europe you may have noticed that there are many architectural similarities. You may also compare these basic structures of a castle with how we can imagine the data architecture related to Product Information Management (PIM).
In my vision of a product information castle there is a main building with five floors of product information. There is a basement for pricing information where we often will find the valuable things as the crown jewels and other treasures. The hierarchy tower combines the pricing information and the different levels of product information. Besides the main castle, we find the logistic stables.
What we do not see on this figure is the product lifecycle management wall around the castle area.
Now, let us get back to the main building and examine what is on each of the floors in the building.
On the ground level, we find the basic product data that typically is the minimum required for creating a product in any system of record. Here we find the primary product identification number or code that is the internal key to all other product data structures and transactions related to the product. Then there usually is a short product description. This description helps internal employees identifying a product and distinguishing that product from other products. If an upstream trading partner produces the product, we may find the identification of that supplier here. If the product is part of internal production, we may have a material type telling about if it is a raw material, semi-finished product, finished good or packing material.
Except for semi-finished products, we may find more things on the next floor.
This level has product data related to trading the product. We may have a unique Global Trade Item Number (GTIN) that may be in the form of an International Article Number (EAN) or a Universal Product Code (UPC). Here we have commodity codes and a lot of other product data that supports buying, receiving, selling and delivering the product.
Most castles were not build in one go. Many castles started modestly in maybe just two floors and a tiny tower. In the same way, our product information management solutions for finished and trading goods usually are built on the top of an elder ERP solution holding the basic and trading data.
On the third level, we find the two grand ballrooms of product information. These ballrooms were introduced when eCommerce started to grow up.
The extended product description is needed because the usual short product description used internally have no meaning to an outsider as told in the post Customer Friendly Product Master Data. Some good best practices for governing the extended product description is to have a common structure of how the description is written, not to use abbreviations and to have a strict vocabulary as reported in the post Toilet Seats and Data Quality.
Having a product image is pivotal if you want to sell something without showing the real product face-to-face with the customer or other end user. A missing product image is a sign of a broken business process for collecting product data as pondered in the post Image Coming Soon.
On the fourth level, we have three main chambers: Product attributes, basic product relations and standard digital assets.This data are the foundation of customer self-service and should, unless you are the manufacturer, be collected from the manufacturer via supplier self-service.
Product attributes are also sometimes called product properties or product features. These are up to thousands of different data elements that describes a product. Some are very common for most products like height, length, weight and colour. Some are very specific to the product category. This challenge is actually the reason of being for dedicated Product Information Management (PIM) solutions as told in the post MDM Tools Revealed.
Basic product relations are the links between a product and other products like a product that have several different accessories that goes with the product or a product being a successor of another now decommissioned product.
Standard digital assets are documents like installation guides, line drawings and data sheets as examined in the post Digital Assets and Product MDM.
On the upper fifth floor we find elements like on the fourth floor but usually these are elements that you won’t necessarily apply to all products but only to your top products where you want to stand out from the crowd and distance yourself from your competitors.
Special content are descriptions of and stories about the product above the hard features. Here you tell about why the product is better than other products and in which circumstances the product can to be used. A common aim with these descriptions is also Search Engine Optimization (SEO).
X-sell (cross-sell) and up-sell product relations applies to your particular mix of products and may be made subjective as for example to look at up-sell from a profit margin point of view. X-sell and up-sell relations may be defined from upstream by you or your upstream trading partners but also dripping down on the roof from the behaviour of your downstream trading partners / customers as manifested in the classic webshop message: “Those who bought product A also bought / looked at product B”.
Advanced digital assets are broader and more lively material than the hard fact line drawings and other documents. Increasingly newer digital media types as video are used for this purpose.
All in all the rooftop takes us to the upper side of the cloud.
A common seen user requirement for Master Data Management (MDM) solutions is an ability to copy the content of the attributes of an existing entity when creating a new entity. For example when creating a new product you may find it nice to copy all the field values from an existing similar product to the new product and then just change what is different for the new product. Just like using copy and paste in excel or other so called productivity tools.
We all know the dangers of copy and paste and there are plenty of horror stories out there of the harsh consequences like when copying and pasting in a job application and forgetting to change the name of the targeted employer. You know: “I have always dreamed about working for IBM” when applying at Oracle.
The exact same bad things may happen when doing copy and paste when working with master data. You may forget to change exactly that one important piece of information because you miss guidance on the copied data within your system of entry.
Using an inheritance approach is a better way. This approach is for product master data based on having a mature hierarchy management in place. When creating a new product you place your product in the hierarchy where it will inherit the attributes common for products on the same branch of the hierarchy and leave it for you to fill in the exact attributes that is specific for the new product. If a new product requires a new branch in the hierarchy, you are forced to think about the common attributes for this branch through.
For party (customer, supplier and other business partner) master data you may inherit from the outside world taking advantage of fetching what is already digitalized, which includes names, addresses and other contact data, and leaving for you to fill in the party master data that is specific to your way of doing business.
One of my pet peeves in data quality for CRM and ERP systems is the often used way at looking at entities, not at least party entities, in a flat data model as told in the post A Place in Time.
Party master data, and related location master data, will eventually be modeled in very complex models and surely we see more and more examples of that. For example I remember that I long time ago worked with the ERP system that later became Microsoft Dynamics AX. Then I had issues with the simplistic and not role aware data model. While I’m currently working in a project using the AX 2012 Address Book it’s good to see that things have certainly developed.
This blog has quite a few posts on hierarchy management in Master Data Management (MDM) and even Hierarchical Data Matching. But I have to admit that even complex relational data models and hierarchical approaches in fact don’t align completely with the real world.
I remember at this year’s MDM Summit Europe that Aaron Zornes suggested that a graph database will be the best choice for reflecting the most basic reference dataset being The Country List. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.
So it could be good to know if you have seen or worked with graph databases in master data management beyond representing a static analysis result as a graph database.
The post has a good walk through on the topic and reaches this conclusion:
“So, which is better, Deterministic Matching or Probabilistic Matching? The question should actually be: ‘Which is better for you, for your specific needs?’ Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”
This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.
For that we need what could be called hierarchical data matching.
The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.
One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.
The last session I attended today was an expert panel on Reference Data Management (RDM).
I guess the list of countries on this planet is the prime example of what is reference data and today’s session provided no exception from that.
Even though a list of countries is fairly small and there shouldn’t be everyday changes to the list, maintaining a country list isn’t as simple as you should think.
First of all official sources for a country list aren’t in agreement. The range of countries given an ISO code isn’t the same as the range of countries where for example the Universal Postal Union (UPU) says you can make a delivery.
Another example I have had some challenges with is that for example the D&B WorldBase (a large word-wide business directory) has four country codes for what is generally regarded as the United Kingdom, as the D&B country reference data probably is defined by a soccer fan recognizing the distinct national soccer teams from England, Wales, Scotland and Northern Ireland.
The expert panel moderator, Aaron Zornes, went as far as suggesting that a graph database maybe the best technology for reflecting the complexity in reference data. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.
During my years working with data quality and master data management it has always struck me how different organizations are managing the party master data domain while in fact the issues are almost the same everywhere.
First of all party master data are describing real world entities being the same to everyone. Everyone is gathering data about the same individuals and the same companies being on the same addresses and having the same digital identities. The real world also comes in hierarchies as households, company families and contacts belonging to companies which are the same to everyone. We may call that the external hierarchy.
Based on that everyone has some kind of demand for intended duplicates as a given individual or company may have several accounts for specific purposes and roles. We may call that the internal hierarchy.
A party master data solution will optimally reflect the internal hierarchy while most of the business processes around are supported by CRM-systems, ERP-systems and special solutions for each industry.
Fulfilling reflecting the external hierarchy will be the same to everyone and there is no need for anyone to reinvent the wheel here. There are already plenty of data models, data services and data sources out there.
Right now I’m working on a service called instant Data Quality that is capable of embracing and mashing up external reference data sources for addresses, properties, companies and individuals from all over the world.
“Why did 85% of the 1700 CMOs interviewed say they use social media as a communications channel and yet only 14% of them measure the ROI?”
A traditional discipline in measuring ROI from a certain market activity is, as told in the post Matchback and Master Data Management, that you try to figure out from which activity a new (prospect) customer was triggered.
The problem is that the trigger may be in one channel but the customer shows up in another channel.
Measuring the Return on Investment (ROI) in doing social media communication and social CRM also requires matchback and in order to do this you will need social master data management where the old systems of records are linked to the new systems of engagement.
The most frequent data quality improvement process done around is deduplication of party master data.
A core functionality of many data quality tools is the capability to find duplicates in large datasets with names, addresses and other party identification data.
When evaluating the result of such a process we usually divide the result of found duplicates into:
False positives being automated match results that actually do not reflect real world duplicates
True positives being automated match results reflecting the same real world entity
The difficulties in reaching the above result aside, you should think the rest is easy. Take the true positives, merge into a golden record and purge the unneeded duplicate records in your database.
Well, I have seen so many well executed deduplication jobs ending just there, because there are a lot of reasons for not making the golden records.
Sure, at lot of duplicates “are bad” and should be eliminated.
But many duplicates “are good” and have actually been put into the databases for a good reason supporting different kind of business processes where one view is needed in one case and another view is needed in another case.
Many, many operational applications, including very popular ERP and CRM systems, do have inferior data models that are not able to reflect the complexity of the real world.
Only a handful of MDM (Master Data Management) solutions are able to do so, but even then the solutions aren’t easy as most enterprises have an IT landscape with all kinds of applications with other business relevant functionality that isn’t replaced by a MDM solution.
Most data quality and master data management gurus, experts and practitioners agree that achieving a “single source of truth” is a nice term, but is not what data quality and master data management is really about as expressed by Michele Goetz in the post Master Data Management Does Not Equal The Single Source Of Truth.
Even among those people, including me, who thinks emphasis on real world alignment could help getting better data and information quality opposite to focusing on fitness for multiple different purposes of use, there is acknowledgement around that there is a “digital distance” between real world aligned data and the real world as explained by Jim Harris in the post Plato’s Data. Also, different public available reference data sources that should reflect the real world for the same entity are often in disagreement.
When working with improvement of data quality in party master data, which is the most frequent and common master data domain with issues, you encounter the same issues over and over again, like:
Many organizations have a considerable overlap of real world entities who is a customer and a supplier at the same time. Expanding to other party roles this intersection is even bigger. This calls for a 360° Business Partner View.
Most organizations divide activities into business-to-business (B2B) and business-to-consumer (B2C). But the great majority of business’s are small companies where business and private is a mixed case as told in the post So, how about SOHO homes.
When doing B2C including membership administration in non-profit you often have a mix of single individuals and households in your core customer database as reported in the post Household Householding.
As examined in the post Happy Uniqueness there is a lot of good fit for purpose of use reasons why customer and other party master data entities are deliberately duplicated within different applications.
Lately doing social master data management (Social MDM) has emerged as the new leg in mastering data within multi-channel business. Embracing a wealth of digital identities will become yet a challenge in getting a single customer view and reaching for the impossible and not always desirable single source of truth.
A way of getting some kind of structure into this possible, and actually very common, mess is to strive for a hierarchical single source of truth where the concept of a golden record is implemented as a model with golden relations between real world aligned external reference data and internal fit for purpose of use master data.