I just realized that this post will be number 1,000 published on this blog. So, let me not say something new but just recap a little bit on what it has been all about in the last nearly 10 years of running a blog on some nerdy stuff.
Data quality has been the main theme. When writing about data quality one will not avoid touching Master Data Management (MDM). In fact, the most applied category used here on this site, with 464 and counting entries, is Master Data.
The second most applied category on this blog is, with 219 entries, Data Architecture.
The most applied data quality activity around is data matching. As this is also where I started my data quality venture, there has been 192 posts about Data Matching.
The newest category relates to Product Information Management (PIM) and is, with 20 posts at the moment, about Product Data Syndication.
Even though that data quality is a serious subject, you must not forget to have fun. 66 posts, including a yearly April Fools post, has been categorized as Supposed to be a Joke.
Thanks to all who are reading this blog and not least to all who from time to time takes time to make a comment, like and share.
When working with data management – and not at least listening to and reading stuff about data management – there is in my experience too little work with the actual data going around out there.
I know this from my own work. Most often presentations, studies and other decision support in the data management realm is based on random anecdotes about the data rather than looking at the data. And don’t get me wrong. I know that data must be seen as information in context, that the processes around data is crucial, that the people working with data is key to achieving better data quality and much more cleverness not about the data as is.
But time and again I always realize that you get the best understanding about the data when getting your hands dirty with working with the data from various organizations. For me that have been when doing a deduplication of party master data, when calibrating a data matching engine for party master data against third party reference data, when grouping and linking product information held by trading partners, when relating other master data to location reference data and all these activities we do in order to raise data quality and get a grip on Master Data Management (MDM) and Product Information Management (PIM).
Well, perhaps it is just me and because I never liked real dirt and gardening.
In a comment to this post Nadim observes that this Gartner quadrant is mixing up pure MDM players and PIM players.
That is true. It has always been a discussion point if one should combine or separate solutions for Master Data Management (MDM) and Product Information Management (PIM). This is a question to be asked by end user organizations and it is certainly a question the vendors on the market(s) ask themselves.
If we look at the vendors included in the 2018 Magic Quadrant the PIM part is represented in some different ways.
I would say that two of the newcomers, Viamedici and Contentserv (yellow dots in below figure), are mostly PIM players today. This is also mentioned as a caution by Gartner and is a reason for the current left-bottom’ish placement in the quadrant. But both companies want to be more multidomain MDM’ish.
8 years ago, I was engaged at Stibo Systems as part of their first steps on the route from PIM to multidomain MDM. Enterworks and Riversand (the orange dots in above figure) is on the same road.
Informatica has taken a different path towards the same destination as they back in 2012 bought the PIM player Heiler. Gartner has some cautions about how well the MDM and PIM components makes up a whole in the Informatica offerings and similar cautions was expressed around the Forrester PIM Wave as seen in the comments to the post There is no PIM quadrant, but there is a PIM wave.
If Gartner is still postponing this year’s MDM quadrant, they may even manage to reflect this change. We are of course also waiting to see if newcomers will make it to the quadrant and make the crowd of vendors in there go back to an above 10 number. Some of the candidates will be likes of Reltio and Semarchy.
Else, back to the takeover of Orchestra by Tibco, this is not the first time Tibco buys something in the MDM and Data Quality realm. Back in 2010 Tibco bought the data quality tool and data matching front runner Netrics as reported in the post What is a best-in-class match engine?
Then Tibco didn’t defend Netrics’ position in the Gartner Magic Quadrant for Data Quality Tools. The latest Data Quality Tool quadrant is also as the MDM quadrant from 2017 and was touched on this blog here.
So, will be exciting to see how Tibco will defend the joint Tibco MDM solution, which in 2017 was a sliding niche player at Gartner, and the Orchestra MDM solution, which in 2017 was a leader at the Gartner MDM quadrant.
Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct. The most common example is establishing a link between two different data records probably describing the same person as for example:
Bob Smith at 1 Main Str in Anytown
Robert Smith at One Main Street in Any Town
Data matching can be applied to other master data entity types as companies, locations, products and more.
In the data matching world there has always been attempts to apply machine learning (or artificial intelligence if you like). This is because deterministic approaches usually result in too many false negatives being actual matching entities not found by the computer. Probabilistic / fuzzy logic approaches usually works better, but often not good enough.
One of my own attempts with machine learning was made within a solution at Dun & Bradstreet Nordic called GlobalMatchBox. One happy result of the machine learning capability was described in the post The Art in Data Matching.
In the recent years I have embraced product master data and product data quality within my business activities. The pain points in handling product information does in some cases include matching product entities but even more it is about matching the different taxonomies in use for product data, not at least between trading partners in business ecosystems.
The intersection between Artificial Intelligence (AI) and Master Data Management (MDM) – and the associated discipline Product Information Management (PIM) – is an emerging topic.
A use case close to me
In my work at setting up a service called Product Data Lake the inclusion of AI has become an important topic. The aim of this service is to translate between the different taxonomies in use at trading partners for example when a manufacturer shares his product information with a merchant.
In some cases the manufacturer, the provider of product information, may use the same standard for product information as the merchant. This may be deep standards as eCl@ss and ETIM or pure product classification standards as UNSPSC. In this case we can apply deterministic matching of the classifications and the attributes (also called properties or features).
However, most often there are uncovered areas even when two trading partners share the same standard. And then again, the most frequent situation is that the two trading partners are using different standards.
As always, applying too much human interaction is costly, time consuming and error prone. Therefore, we are very eagerly training our machines to be able to do this work in a cost-effective way, within a much shorter time frame and with a repeatable and consistent outcome to the benefit of the participating manufacturers, merchants and other enterprises involved in exchanging products and the related product information.
Learning from others
This week I participated in a workshop around exchanging experiences and proofing use cases for AI and MDM. The above-mentioned use case was one of several use cases examined here. And for sure, there is a basis for applying AI with substantial benefits for the enterprises who gets this. The workshop was arranged by Camelot Management Consultants within their Global Community for Artificial Intelligence in MDM.
One of the news this week was that Maersk for the first time is taking a large container ship from East Asia to Europe using a Northern Route through the Arctic waters as told in this Financial Times article.
The purpose of this trip is to explore the possibility of avoiding the longer Southern Route including shoehorning the sea traffic through the narrow Suez Canal. A similar opportunity exists around North America as an alternative to going through The Panama Canal.
Similar to moving products and finding new routes for that we may also explore new routes when it comes to moving information about products. Until now the possibilities, besides cumbersome exchange of spreadsheets, have been to shoehorn product information from the manufacturer into a consensus-based data portal or data pool from where the merchant can fetch the information in accurate the same shape as his competitors does.
The term data monetization is trending in the data management world.
Data monetization is about harvesting direct financial results from having access to data that is stored, maintained, categorized and made accessible in an optimal manner. Traditionally data management & analytics has contributed indirectly to financial outcome by aiming at keeping data fit for purpose in the various business processes that produced value to the business. Today the best performers are using data much more directly to create new services and business models.
In my view there are three flavors of data monetization:
Selling data: This is something that have been known to the data management world for years. Notable examples are the likes of Dun & Bradstreet who is selling business directory data as touched in the post What is a Business Directory? Another examples is postal services around the world selling their address directories. This is the kind of data we know as third party data.
Wrapping data around products: If you have a product – or a service – you can add tremendous value to these products and services and make them more sellable by wrapping data, potentially including third party data, around those products and services. These data will thus become second party data as touched in the post Infonomics and Second Party Data.
Advanced analytics and decision making: You can combine third party data, second party data and first party data (your own data) in order to make advanced analytics and fast operational decision making in order to sell more, reduce costs and mitigate risks.
Please learn more about data monetization by downloading a recent webinar hosted by Information Builders, their expert Rado Kotorov and yours truly here.
Data matching has always been a substantial part of the capabilities in data quality technology and have become a common capability in Master Data Management (MDM) solutions.
We use the term data matching when talking about linking entities where we cannot just use exact keys in databases.
The most prominent example around is matching names and addresses related to parties, where these attributes can be spelled differently and formatted using different standards but do refer to the same real-world entity. Most common scenarios are deduplication, where we clean up databases for duplicate customer, vendor and other party role records and reference matching, where we identify and enrich party data records with external directories.
A way to pre-process party data matching is matching the locations (addresses) with external references, which has become more and more available around the world, so you have a standardized address in order to reduce the fuzziness. In some geographies you can even make use of more extended location data, as whether the location is a single-family house, a high-rise building, a nursing home or campus. Geocodes can also be brought into the process.
Handling the location as a separate unique entity can also be used in many industries as utility, telco, finance, transit and more.
In the old days this was quite difficult as you often only had a product description that had to be parsed into discrete elements as examined in the post Matching Light Bulbs.
With the rise of Product Information Management (PIM) we now often do have the product attributes in a granular form. However, using traditional matching technology made for party master data will not do the trick as this is a different and more complex scenario. My thinking is that graph technology will help as touched in the post Three Ways of Finding a Product.
During my engagements in selecting and working with the major data management tools on the market, I have from time to time experienced that they often lack support for specialized data management needs in minor markets.
Two such areas I have been involved with as a Denmark based consultant are:
The authorities in Denmark offers a free of charge access to very up to data and granular accurate address data that besides the envelope form of an address also comes with a data management friendly key (usually referred to as KVHX) on the unit level for each residential and business address within the country. Besides the existence of the address you also have access to what activity that takes place on the address as for example if it is a single-family house, a nursing home, a campus and other useful information for verification, matching and other data management activities.
If you want to verify addresses with the major international data managements tools I have come around, much of these goodies are gone, as for example:
Address reference data are refreshed only once per quarter
The key and the access to more information is not available
A price tag for data has been introduced
In Denmark (and other Scandinavian countries) we have a national identification number (known as personnummer) used much more intensively than the national IDs known from most other countries as told in the post Citizen ID within seconds.
The data masking capabilities in major data management solutions comes with pre-build functions for national IDs – but only covering major markets as the United States Social Security Number, the United Kingdom NINO and the kind of national id in use in a few other large western countries.
So, GDPR compliance is just a little bit harder here even when using a major tool.