Some Deduplication Tactics

When doing the data quality kind of deduplication you will often have two kinds of data matching involved:

  • Data matching in order to find duplicates internally in your master data, most often your customer database
  • Data matching in order to align your master data with an external registry

As the latter activity also helps with finding the internal duplicates, a good question is in which order to do these two activities.

External identifiers

If we for example look at business-to-business (B2B) customer master data it is possible to match against a business directory. Some choices are:

  • If you have mostly domestic data in a country with a public company registration you can obtain a national ID from matching with a business directory based on such a registry. An example will be the French SIREN/SIRET identifiers as mentioned in the post Single Company View.
  • Some registries cover a range of countries. An example is the EuroContactPool where each business entity is identified with a Site ID.
  • The Dun & Bradstreet WorldBase covers the whole world by identifying approximately 200 million active and dissolved business entities with a DUNS-number. The DUNS-number also serves as a privatized national ID for companies in the United States.

If you start with matching your B2B customers against such a registry, you will get a unique identifier that can be attached to your internal customer master data records which will make a succeeding internal deduplication a no-brainer.

Common matching issues

A problem is however is that you seldom get a 100 % hit rate in a business directory matching, often not even close as examined in the post 3 out of 10.

Another issue is the commercial implications. Business directory matching is often performed as an external service priced per record. Therefore you may save money by merging the duplicates before passing on to external matching. And even if everything is done internally, removing the duplicates before directory matching will save process load.

However a common pitfall is that an internal deduplication may merge two similar records that actually are represented by two different entities in the business directory (and the real world).

So, as many things data matching, the answer to the sequence question is often: Both.

A good process sequence may be this one:

  1. An internal deduplication with very tight settings
  2. A match against an external registry
  3. An internal deduplication exploiting external identifiers and having more loose settings for similarities not involving an external identifier

Bookmark and Share

2 thoughts on “Some Deduplication Tactics

  1. Sophie @ 24th August 2011 / 11:46

    Process decisions are so rarely binary; the most common answers I hear are “Both” and “It depends”! A sandwiching sequence, as you describe here, is often the best option if your data quality tools don’t offer any integrated business directories, or don’t integrate with the one you wish to use.

    [Blatant self-promotion: Match2Lists recognises the DUNS number format, and supports matching & deduping on multiple weighted criteria to overcome the “3 out of 10” issue. Also, we’ll soon be announcing a deeper integration with Dun & Bradstreet’s WorldBase.]

    Thanks for this (as ever) thoughtful and informative post!

    • Henrik Liliendahl Sørensen 24th August 2011 / 13:01

      Thanks for joining Sophie.

      Blatant self-promotions are OK within the subject 🙂

      I think your planned business directory integration is a good move.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s