8 thoughts on “Data Matching, Machine Learning and Artificial Intelligence

  1. Steve Tootill (@stevetootill) 28th November 2018 / 16:52

    I definitely see a future for machine learning in data matching, there are already some non-specialist implementations for generic data matching using machine learning with solutions such as Reltio and H2O.ai. IMO, there is no single best way of matching, especially for contact data which is our focus – the best solution is to combine different approaches. The typical problem with a machine learning approach is that it takes time and a lot of user feedback to train the software on the organization’s data and therefore it is unlikely to deliver acceptable results during the evaluation process. I would expect an approach based on intelligent algorithms and standardization tables (built up as a result of experience on a lot of different customer datasets) to outperform a pure machine learning approach in the early days of evaluation/implementation. Following up the algorithm/table-based approach with machine learning to refine and improve results still further seems to promise better results than either approach in isolation.

    • Henrik Liliendahl 28th November 2018 / 17:46

      Thanks a lot for commenting Steve. I had the pleasure of looking into the Reltio roadmap for data matching earlier this year. It is promising, but still in the making when it comes to machine learning. I agree about the need for intensive training on the machine learning part. The solution I was involved with at Dun & Bradstreet was gifted by a setup, where D&B received various client party master data on a daily basis to be matched against the D&B Worldbase often with inspection of the dubious results. That catered for a lot of training data, with improvement in the matching process as a result. This solution was exactly as you suggest a mix of different approaches.

  2. Gani Hamiti 28th November 2018 / 17:02

    Hi Henrik,

    First of all, thank you for running this blog, which I’ve been reading for a while now and which is quite unique in its genre.

    Reading this post made me want to write a few words.
    In my experience, the undermatching issue, aka leaving too many false negatives in the results, doesn’t seem to plague deterministic matching specifically. In the past, I’ve seen cases of clear overmatching with this approach, given good standardization beforehand and a rich enough set of comparison algorithms ; this could typically be fixed by decreasing the tolerance in the parameters and thresholds of those algorithms.
    Is it implicitly assumed in this post that deterministic matching results in undermatching if one of the previously mentioned factors (good prior standardization and/or rich set of comparison algorithms) is lacking ? In this case, I’d argue that an ML approach would still be sensitive to a good standardization or lack thereof : while the decision process (match vs not match) does indeed differ, the intrinsic nature of the matching task remains the clustering of records based on their similarity, which is deeply affected by the standardization process.

    I’m curious to read your thoughts about this.

    Gani

    • Henrik Liliendahl 28th November 2018 / 17:58

      Hi Gani. Thanks for the kind words and adding in. I remember when I was involved in putting a fuzzy logic based data matching solution on the Nordic market we had trials where we competed against more deterministic based solutions from the established data quality vendors. We did a much better job. I know standardization, typically around address data, is a way to improve results. But this is not straight forward. Sometimes you can get a false negative because the similarity gets lower between two records after standardization.

  3. Ayşegül Yüksel Pİ 314 6th December 2018 / 23:03

    Hi Henrik!
    First of all this ideas my own views. In my opinion Artificial Intelligence is imposible and peril for humanity. In big picture all about “desicion trees”
    Human can not create everything and human can not control everything. But human always imitateting, acting like GOD, imitateting, acting NATURE. But stays fake, ıt stays absurd.
    Artificial Intelligence is a dream will cost million dollars. Humanity must think about hunger poverty and climate change or right information ect.
    I can’t believe and i’m laughing this news. Spending Money and time for nothing is this.
    Some of them
    https://www.washingtonpost.com/business/2018/12/05/dozens-amazon-workers-sickened-after-bear-repellent-accidentally-discharged-warehouse/?noredirect=on&utm_term=.99319ca468bd
    About Sophia
    http://www.africanews.com/2018/06/30/sophia-the-robot-misses-dinner-with-ethiopia-pm-after-losing-some-parts-at/
    P.S. : I hope you remember me I couldn’t visit your blog for a long time. But never forget.

    • Henrik Liliendahl 7th December 2018 / 08:28

      Thanks for commenting Aysegül and being back as a reader of this blog. I share your concerns about hunger, poverty and climate change. I also follow your sentiment about what AI is and that many things we call AI is nothing but a bunch of decision trees no one can fully understand. However, I think we will get there one algorithm at the time.

      Working with fuzzy logic has been a starting point for me. Instead of assuming that there is a yes or no answer to everything, as in decision trees, computers must work with that there is a probability for a right answer to every question. Pretty much as in real life.

  4. Mick Rothwell 30th July 2021 / 14:45

    Data Matching through the use of ML is exactly a use case I’m investigating. The main reason for this is that when physically writing algorithms to match data production data is ideally needed. However this will have several concerns regarding GDPR, Data Privacy etc. So I wonder if using ML negates risks around users having access to production data whilst doing development.

    Thoughts? Use Cases?

    • Henrik Gabs Liliendahl 30th July 2021 / 14:54

      It is a good question, Mick.

      On the other hand I have experienced that ML solutions for data matching rarely are shared between different organizations because they include real-world data. Traditional match algorithms can be shared.

Leave a comment