Data matching is a sub discipline within data quality management. Data matching is about establishing a link between data elements and entities, that does not have the same value, but are referring to the same real-world construct. The most common example is establishing a link between two different data records probably describing the same person as for example:
- Bob Smith at 1 Main Str in Anytown
- Robert Smith at One Main Street in Any Town
Data matching can be applied to other master data entity types as companies, locations, products and more.
In the data matching world there has always been attempts to apply machine learning (or artificial intelligence if you like). This is because deterministic approaches usually result in too many false negatives being actual matching entities not found by the computer. Probabilistic / fuzzy logic approaches usually works better, but often not good enough.
One of my own attempts with machine learning was made within a solution at Dun & Bradstreet Nordic called GlobalMatchBox. One happy result of the machine learning capability was described in the post The Art in Data Matching.
In the recent years I have embraced product master data and product data quality within my business activities. The pain points in handling product information does in some cases include matching product entities but even more it is about matching the different taxonomies in use for product data, not at least between trading partners in business ecosystems.
So, machine learning leading to artificial intelligence is on my agenda again in a quest for matching metadata as told in the post It is time to apply AI to MDM and PIM.
How about you? Do you see a future with machine learning in data matching? Have you seen any happy results?
I definitely see a future for machine learning in data matching, there are already some non-specialist implementations for generic data matching using machine learning with solutions such as Reltio and H2O.ai. IMO, there is no single best way of matching, especially for contact data which is our focus – the best solution is to combine different approaches. The typical problem with a machine learning approach is that it takes time and a lot of user feedback to train the software on the organization’s data and therefore it is unlikely to deliver acceptable results during the evaluation process. I would expect an approach based on intelligent algorithms and standardization tables (built up as a result of experience on a lot of different customer datasets) to outperform a pure machine learning approach in the early days of evaluation/implementation. Following up the algorithm/table-based approach with machine learning to refine and improve results still further seems to promise better results than either approach in isolation.
Thanks a lot for commenting Steve. I had the pleasure of looking into the Reltio roadmap for data matching earlier this year. It is promising, but still in the making when it comes to machine learning. I agree about the need for intensive training on the machine learning part. The solution I was involved with at Dun & Bradstreet was gifted by a setup, where D&B received various client party master data on a daily basis to be matched against the D&B Worldbase often with inspection of the dubious results. That catered for a lot of training data, with improvement in the matching process as a result. This solution was exactly as you suggest a mix of different approaches.
First of all, thank you for running this blog, which I’ve been reading for a while now and which is quite unique in its genre.
Reading this post made me want to write a few words.
In my experience, the undermatching issue, aka leaving too many false negatives in the results, doesn’t seem to plague deterministic matching specifically. In the past, I’ve seen cases of clear overmatching with this approach, given good standardization beforehand and a rich enough set of comparison algorithms ; this could typically be fixed by decreasing the tolerance in the parameters and thresholds of those algorithms.
Is it implicitly assumed in this post that deterministic matching results in undermatching if one of the previously mentioned factors (good prior standardization and/or rich set of comparison algorithms) is lacking ? In this case, I’d argue that an ML approach would still be sensitive to a good standardization or lack thereof : while the decision process (match vs not match) does indeed differ, the intrinsic nature of the matching task remains the clustering of records based on their similarity, which is deeply affected by the standardization process.
I’m curious to read your thoughts about this.
Hi Gani. Thanks for the kind words and adding in. I remember when I was involved in putting a fuzzy logic based data matching solution on the Nordic market we had trials where we competed against more deterministic based solutions from the established data quality vendors. We did a much better job. I know standardization, typically around address data, is a way to improve results. But this is not straight forward. Sometimes you can get a false negative because the similarity gets lower between two records after standardization.
First of all this ideas my own views. In my opinion Artificial Intelligence is imposible and peril for humanity. In big picture all about “desicion trees”
Human can not create everything and human can not control everything. But human always imitateting, acting like GOD, imitateting, acting NATURE. But stays fake, ıt stays absurd.
Artificial Intelligence is a dream will cost million dollars. Humanity must think about hunger poverty and climate change or right information ect.
I can’t believe and i’m laughing this news. Spending Money and time for nothing is this.
Some of them
P.S. : I hope you remember me I couldn’t visit your blog for a long time. But never forget.
Thanks for commenting Aysegül and being back as a reader of this blog. I share your concerns about hunger, poverty and climate change. I also follow your sentiment about what AI is and that many things we call AI is nothing but a bunch of decision trees no one can fully understand. However, I think we will get there one algorithm at the time.
Working with fuzzy logic has been a starting point for me. Instead of assuming that there is a yes or no answer to everything, as in decision trees, computers must work with that there is a probability for a right answer to every question. Pretty much as in real life.
Data Matching through the use of ML is exactly a use case I’m investigating. The main reason for this is that when physically writing algorithms to match data production data is ideally needed. However this will have several concerns regarding GDPR, Data Privacy etc. So I wonder if using ML negates risks around users having access to production data whilst doing development.
Thoughts? Use Cases?
It is a good question, Mick.
On the other hand I have experienced that ML solutions for data matching rarely are shared between different organizations because they include real-world data. Traditional match algorithms can be shared.