When doing the data quality kind of deduplication you will often have two kinds of data matching involved:
- Data matching in order to find duplicates internally in your master data, most often your customer database
- Data matching in order to align your master data with an external registry
As the latter activity also helps with finding the internal duplicates, a good question is in which order to do these two activities.
If we for example look at business-to-business (B2B) customer master data it is possible to match against a business directory. Some choices are:
- If you have mostly domestic data in a country with a public company registration you can obtain a national ID from matching with a business directory based on such a registry. An example will be the French SIREN/SIRET identifiers as mentioned in the post Single Company View.
- Some registries cover a range of countries. An example is the EuroContactPool where each business entity is identified with a Site ID.
- The Dun & Bradstreet WorldBase covers the whole world by identifying approximately 200 million active and dissolved business entities with a DUNS-number. The DUNS-number also serves as a privatized national ID for companies in the United States.
If you start with matching your B2B customers against such a registry, you will get a unique identifier that can be attached to your internal customer master data records which will make a succeeding internal deduplication a no-brainer.
Common matching issues
A problem is however is that you seldom get a 100 % hit rate in a business directory matching, often not even close as examined in the post 3 out of 10.
Another issue is the commercial implications. Business directory matching is often performed as an external service priced per record. Therefore you may save money by merging the duplicates before passing on to external matching. And even if everything is done internally, removing the duplicates before directory matching will save process load.
However a common pitfall is that an internal deduplication may merge two similar records that actually are represented by two different entities in the business directory (and the real world).
So, as many things data matching, the answer to the sequence question is often: Both.
A good process sequence may be this one:
- An internal deduplication with very tight settings
- A match against an external registry
- An internal deduplication exploiting external identifiers and having more loose settings for similarities not involving an external identifier