Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

Algorithm Envy

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.

Bookmark and Share

Grandpa’s Story

Now I have become a grandfather it’s time for a blog post about lessons learned in life.

One of my favourite authors as a young man was Cyril Northcote Parkinson, the grand father of the famous Parkinson’s Law saying:

Work expands so as to fill the time available for its completion.

Early in my career I learned how true this is. My first experience was also like the statistics behind Parkinson’s Law from within public administration, but later I learned that private enterprises are just the same.

My first real job after graduation was at the Danish Tax Authorities. After having worked there a few years I was assigned on a mission to assist the Faroe Islands Financial Authorities in developing a modernised tax collection solution.

The Faroe Islands

For those readers that hate old people not sticking to the subject – please continue to the next headline.

For those readers who don’t have a clue about where on earth the Faroe Islands are: Well. 1000 years ago the Vikings sailed out from Scandinavia and finally made it to say hello to the Native Americans – 500 years before Columbus. When doing that they used islands in the Northern Atlantic as stepping stones. First British Isles, then Faroe Islands, Iceland, Greenland and finally Newfoundland at the American coast.

Just like Columbus found America by mistake, as he was actually looking for India, the Vikings probably also found America and the stepping stones by mistake when getting lost on the ocean during storms.


Back on track. The mission for the Faroe Island Authorities I joined in the early 1980’s seemed impossible. As the Faroese population is only 1/100 of the population of the continental Denmark there were of course only 1/100 of the resources available for making a solution doing exactly the same as the solution built for continental Denmark

But what I learned was that the solution actually was built using only those resources and in surprisingly short time (and with minimal help from me and my colleagues).

While I during my career have worked in both modest sized organisations and large organisations I have noticed numerous examples on how exactly the same task may consume resources not sized by the nature of the task but by the size of the organisation.

People and technology

Maybe this observation is an explanation to the ever recurring subject on whether people or technology is most important when doing projects like improving data quality. If the technology part is (close to) constant but the over-all resource consumption grows with the size of the organisation in question, well, then the people part becomes more and more important by the size of the organisation

Tool making

I have tried single handed to build a data quality tool – or to be more specific a data matching tool. At several occasions it has been benchmarked with products residing as leaders in the Gartner Magic Quadrant for data quality tools, and it didn’t come out short. Some of the features included in the product called SuperMatch are described in the post “When computer says maybe”.

It’s my impression, that if you look at tool vendors with many employees, it’s only a very small group of people who is actually working on the tool

When computer says maybe

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches.  Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

  • Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
  • Check-out and check-in for multiuser environments
  • Presenting a ranked range of computer selected candidates
  • Color coding elements in matched candidates – like:
    • green for (near) exact name,
    • blue for a close name and
    • red for a far from similar name
  • Possibility for marking:
    • as a manual positive match,
    • as a manual negative match (with reason) or
    • as questionable for later or supervisor inspection (with comments)
  • Entering a match found by other methods
  • Removing one or several members from a duplicate group
  • Splitting a duplicate group into two groups
  • Selecting survivorship
  • Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

The GlobalMatchBox

dnbLogo10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.

bisnode-logoThis solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.

Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.

It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.

So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.

Bookmark and Share