Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct.
The most common scenario for data matching is deduplication of customer data records held across an enterprise. In this case we often see a gap between what we technically try to do and the desired business outcome from deduplication. In my experience, this misalignment has something to do with real-world alignment.
What we technically do is basically to find a similarity between data records that typically has been pre-processed with some form of standardization. This is often not enough.
Standardization and verification of addresses is very common element in data quality / data matching tools. Often such at tool will use a service either from its same brand or a third-party service. Unfortunately, no single service is often enough. This is because:
Most services are biased towards a certain geography. They may for example be quite good for addresses in The United States but very poor compared to local services for other geographies. This is especially true for geographies with multiple languages in play as exemplified in the post The Art in Data Matching.
There is much more to an address than the postal format. In deduplication it is for example useful to know if the address is a single-family house or a high-rise building, a nursing home, a campus or other building with lots of units.
In summary, the capability to tell if two data records represent the same real-world entity will eventually involve identity resolution. And as this is very poorly supported by data quality tools around, we see that a lot of manual work will be involved if the business processes that relies on the data matching cannot tolerate too may, or in some cases any, false positives – or false negatives.
Even telling that a true positive match is true in all circumstances is hard. The predominant examples of this challenge are:
Is a match between what seems to be an individual person and what seems to be the household where the person lives a true match?
Is a match between what seems to be a person in a private role and what seems to be the same person in a business role a true match? This is especially tricky with sole proprietors working from home like farmers, dentists, free lance consultants and more.
Is a match between two sister companies on the same address a true match? Or two departments within the same company?
We often realize that the answer to the questions are different depending on the business processes where the result of the data matching will be used.
The solution is not simple. The data matching functionality must, if we want automated and broadly usable results, be quite sophisticated in order to take advantage of what is available in the real-world. The data model where we hold the result of the data matching must be quite complex if we want to reflect the real-world.
The Gartner Magic Quadrant for Data Quality Tools 2019 is out. It will take you 43 minutes to read through, so let me provide a short overview.
Gartner says that “data quality tools are vital for digital business transformation, especially now that many have emerging features like automation, machine learning, business-centric workflows and cloud deployment models.”
The data quality software tools market was at 1.61 billion USD in 2017 which was an increase of 11.6% compared to 2016.
Gartner sees that end-user demand is shifting toward having broader capabilities spanning data management and information governance. Therefore, the data quality tool market continues to interact closely with the markets for data integration tools and for Master Data Management (MDM) products.
Among the capabilities mentioned is multidomain support meaning capabilities covering all the specific data subject areas, such as customer, product, asset and location. Interestingly Gartner continues to focus on customer as the one of several party data domains out there. In my experience, there are the same data quality challenges with vendor and other business partner data as well as with employee data.
According to Gartner, data quality tool vendors are competing to address shifting market requirements by introducing an array of new technologies, such as machine learning, interactive visualization and predictive/prescriptive analytics, all of which they are embedding in data quality tools. They are, according to Gartner, also offering new pricing models, based on open source and subscriptions.
The vendors included in the quadrant are positioned as seen below:
If you want a full copy of the report you can, against providing your personal data, get it from Information Builders here.
The Forrester Report has this saying on that theme: “The internet of things has led to systems of automation and systems of design, which introduce new MDM usage scenarios to support co-design and the exchange of information on customers, products, and assets within ecosystems”.
Else, the report of course ranks the best selling MDM solutions as seen below:
When it has been about mergers and acquisitions on the Master Data Management (MDM) solution market, there have until recently not been so much going around since 2012. Rather we have seen people leaving the established vendors and formed or joined new companies.
Then on Valentine’s day 2019 Symphony Technology Group Acquired PIM and MDM Provider EnterWorks with the aim of coupling their offerings with the ones from WinShuttle. WinShuttle has been more a data management generalist company with focus on ERP data – not at least in SAP. This merger ties into the trend of extending MDM platforms to other kinds of data than traditional master data. It will also make an alternative to SAPs own MDM and data governance offering called MDG.
Fourteen days later there was a new coupling as reported in the post MDM Market News: Informatica acquires AllSight. This must also be seen as a step in the trend of providing an extended MDM platform with Artificial Intelligence (AI) capabilities. Also, Informatica is here going against the new MDM solution provider Reltio, who has been successful in promoting their big data extended MDM platform.
The Gartner Magic Quadrant for Master Data Management (MDM) Solutions 2018 was published last month.
Some of the numbers in the market that were revealed in the report was the number and distribution of MDM licenses from the included vendors. These covered their top-three master data domains and estimated license counts as well as the number of customers managing multiple domains:
One should of course be aware of the data quality issues related to comparing these numbers, as they in some degree are estimates based on different perceptions at the included vendors. So, let me just highlight these observations:
The overall number of MDM licenses and unique MDM customers (at the included vendors) is not high. Under 10,000 organizations world-wide is running such a solution. The potential new market out there for the salesforce at the MDM vendors is huge.
If you find an existing MDM solution user organization, they probably have a solution from SAP or Informatica – or maybe IBM. To be complete, Oracle has been dropped from the MDM quadrant, they practically do not promote their MDM solutions anymore, but there are still existing solutions operating out there.
The reign of Customer MDM is over. Product MDM is selling and multidomain is becoming the norm. Several MDM vendors are making their way into the quadrant from a Product Information Management (PIM) base as reported in the post The Road from PIM to Multidomain MDM.
PS: If you, as an end customer organization or a MDM and PIM vendor, want to work with me on the consequences for MDM solutions, here are some Popular Offerings for you.
Ultima Thule is a name for a distant place beyond the known world and the nickname of the most distant object in the solar system closely observed by a man-made object today the 1st January 2019. Before the flyby scientists were unsure if it was two objects, a peanut formed object or another shape. The images probing what it is will be downloaded during the next couple of months.
In a comment to this post Nadim observes that this Gartner quadrant is mixing up pure MDM players and PIM players.
That is true. It has always been a discussion point if one should combine or separate solutions for Master Data Management (MDM) and Product Information Management (PIM). This is a question to be asked by end user organizations and it is certainly a question the vendors on the market(s) ask themselves.
If we look at the vendors included in the 2018 Magic Quadrant the PIM part is represented in some different ways.
I would say that two of the newcomers, Viamedici and Contentserv (yellow dots in below figure), are mostly PIM players today. This is also mentioned as a caution by Gartner and is a reason for the current left-bottom’ish placement in the quadrant. But both companies want to be more multidomain MDM’ish.
8 years ago, I was engaged at Stibo Systems as part of their first steps on the route from PIM to multidomain MDM. Enterworks and Riversand (the orange dots in above figure) is on the same road.
Informatica has taken a different path towards the same destination as they back in 2012 bought the PIM player Heiler. Gartner has some cautions about how well the MDM and PIM components makes up a whole in the Informatica offerings and similar cautions was expressed around the Forrester PIM Wave as seen in the comments to the post There is no PIM quadrant, but there is a PIM wave.
But there was also a good deal of steadiness. Informatica still holds pole position in the race for going towards the top-right corner. Orchestra EBX, now disguised as Tibco EBX, is trailing them in the leaders quadrant. Old challengers as IBM, SAP and Stibo is watching them among the newcomers in the challengers quadrant and still as the only visionary – according to Gartner – we have Riversand.
In the niche players quadrant, we also still have Ataccama and Enterworks.
But there is still lot of free space in the top-right corner. There is still room for disruption. Gartner mentions some traditional forces still on the move being the good old 360 degree view on party data (customer, patient and the bit US biased provider) as well as Product Information Management (PIM) maybe in new wrappings as PCM or PXM.
If Gartner is still postponing this year’s MDM quadrant, they may even manage to reflect this change. We are of course also waiting to see if newcomers will make it to the quadrant and make the crowd of vendors in there go back to an above 10 number. Some of the candidates will be likes of Reltio and Semarchy.
Else, back to the takeover of Orchestra by Tibco, this is not the first time Tibco buys something in the MDM and Data Quality realm. Back in 2010 Tibco bought the data quality tool and data matching front runner Netrics as reported in the post What is a best-in-class match engine?
Then Tibco didn’t defend Netrics’ position in the Gartner Magic Quadrant for Data Quality Tools. The latest Data Quality Tool quadrant is also as the MDM quadrant from 2017 and was touched on this blog here.
So, will be exciting to see how Tibco will defend the joint Tibco MDM solution, which in 2017 was a sliding niche player at Gartner, and the Orchestra MDM solution, which in 2017 was a leader at the Gartner MDM quadrant.
Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct. The most common example is establishing a link between two different data records probably describing the same person as for example:
Bob Smith at 1 Main Str in Anytown
Robert Smith at One Main Street in Any Town
Data matching can be applied to other master data entity types as companies, locations, products and more.
In the data matching world there has always been attempts to apply machine learning (or artificial intelligence if you like). This is because deterministic approaches usually result in too many false negatives being actual matching entities not found by the computer. Probabilistic / fuzzy logic approaches usually works better, but often not good enough.
One of my own attempts with machine learning was made within a solution at Dun & Bradstreet Nordic called GlobalMatchBox. One happy result of the machine learning capability was described in the post The Art in Data Matching.
In the recent years I have embraced product master data and product data quality within my business activities. The pain points in handling product information does in some cases include matching product entities but even more it is about matching the different taxonomies in use for product data, not at least between trading partners in business ecosystems.