Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

To be called Hamlet or Olaf – that is the question

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share

The Sound of Soundex

The probably oldest and most used error tolerant algorithm in searching and data matching is a phonetic algorithm called Soundex. If you are not familiar with Soundex: Wikipedia to the rescue here.

In the LinkedIn group Data Matching we seem to have an ongoing discussion about the usefulness of Soundex. Link to the discussion here – if you are not already a member: Please join, spammers are dealt with, though it is OK to brag about your data matching superiority.

To sum up the discussion on Soundex I think we at this stage may conclude:

  • Soundex is of course very poor compared to the more advanced algorithms, but it may be better than nothing (which will be exact searching and matching)
  • Soundex (or a variant of Soundex) may be used for indexing in order to select candidates to be scored with better algorithms.

Let’s say you are going to match 100 rows with names and addresses against a table with 100 million rows with names and addresses and let’s say that the real world individual behind the 100 rows is in fact represented among the 100 million, but not necessary spelled the same.

Your results may be as this:

  • If you use exact automated matching you may find 40 matching rows (40 %).
  • If you use automated matching with (a variant of) Soundex you may find 95 matching rows, but only 70 rows (70 %) are correct matches (true positives) as 25 rows (25 %) are incorrect matches (false positives).
  • If you use automated matching with (a variant of) Soundex indexing and advanced algorithm for scoring you may find 75 matching rows where 70 rows (70 %) are correct matches (true positives) and 5 rows (5 %) are incorrect matches (false positives).
  • By tuning the advanced algorithm you may find 67 matching rows where 65 rows (65 %) are correct matches (true positives) and 2 rows (2 %) are incorrect matches (false positives).

So when using Soundex you will find more matching rows but you will also find more manual work in verifying the results. Adding an advanced algorithm may reduce the manual work or eliminate manual work at the cost of some not found matches (false negatives) and the risk of a few wrong matches (false positives).

PS: I have a page about other Match Techniques including standardization, synonyms and probabilistic learning.

PPS: When googling for if the title of this blog has been used before I found this article from a fellow countryman.

Bookmark and Share

Complicated Matters

A while ago I wrote a short blog post about a tweet from the Gartner analyst Ted Friedman saying that clients are disappointed with the ability to support wide deployment of complex business rules in popular data quality tools.

Speaking about popular data quality tools; on the DataFlux Community of Experts blog Founder of DataQualityPro Dylan Jones posted a piece this Friday asking: Are Your Data Quality Rules Complex Enough?

Dylan says: “Many people I speak to still rely primarily on basic data profiling as the backbone of their data quality efforts”.

The classic answers to the challenge of complex business rules are:

  • Relying on people to enforce complex business rules. Unfortunately people are not as consistent in enforcing complex rules as computer programs are.
  • Making less complex business rules. Unfortunately the complexity may be your competitive advantage.

In my eyes there is no doubt about that data quality tool vendors has a great opportunity in research and development of tools that are better at deploying complex business rules. In my current involvement in doing so we work with features as:

  • Deployment as Service Oriented Architecture components. More on this topic here.
  • Integrating multiple external sources. Further explained here.
  • Combining the best algorithms. Example here.

Bookmark and Share

3 out of 10

Just before I left for summer vacation I noticed a tweet by MDM guru Aaron Zornes saying:

This is a subject very close to me as I have worked a lot with business directory matching during the last 15 years not at least matching with the D&B WorldBase.

The problem is that if you match your B2B customers, suppliers and other business partners with a business directory like the D&B WorldBase you could naively expect a 100% match.

If your result is only a 30% hit rate the question is: How many among the remaining 70% are false negatives and how many are true negatives.

True negatives

There may be a lot of reasons for true negatives, namely:

  • Your business entity isn’t listed in the business directory. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordic countries and others have a tight public registration of companies and then it is less tight from countries in North America, other European countries and the rest of the world.
  • Your supposed business entity isn’t a business entity. Many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.
  • Uniqueness may be different defined in the business directory and your table to be matched. This includes the perception of hierarchies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also the different roles as those of small business owners are a challenge. The same is true about roles as franchise takers and the use of trading styles.

False negatives

In business directory matching the false negatives are those records that should have been matched by an automated function, but isn’t.

The number of false negatives is a measure of the effectiveness of the automated matching tool(s) and rules applied. Big companies often use the magic quadrant leaders in data quality tools, but these aren’t necessary the best tools for business directory matching.

Personally I have found that you need a very complex mix of tools and rules for getting a decent match rate in business directory matching, including combining both deterministic and probabilistic matching. Some different techniques are explained in more details here.

Bookmark and Share

Why do you watch it?

Statler and Waldorf is a pair of Muppet characters. They are two ornery, disagreeable old men. Despite constantly complaining about the show and how terrible some acts were, they would always be back the following week in the best seats in the house. At the end of one episode, they looked at the camera and asked: “Why do you watch it?”.

This is a bit like blogging about data quality, isn’t it? Always describing how bad data is everywhere. Bashing executives who don’t get it. Telling about all the hard obstacles ahead. Explaining you don’t have to boil the ocean but might get success by settling for warming up a nice little drop of water.

Despite really wanting to tell a lot of success stories, being the funny Fuzzy Bear on the stage, well, I am afraid I also have been spending most time on the balcony with Statler and Waldorf.

So, from this day forward: More success stories.

This is the start of a series of 1.3 blog posts…. No, just kidding.

Bookmark and Share

Algorithm Envy

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.

Bookmark and Share