One of the things that data quality tools does is data matching. Data matching is mostly related to the party master data domain. It is about comparing two or more data records that does not have exactly the same data but are describing the same real world entity.
Common approaches for that is to compare data records in internal master data repositories within your organization. However, there are great advantages in bringing in external reference data sources to support the data matching.
Some of the ways to do that I have worked with includes these kind of big reference data:
The business-to-business (B2B) world does not have privacy issues in the degree we see in the business-to-consumer (B2C) world. Therefore there are many business directories out there with a quite complete picture of which business entities exists in a given country and even in regions and the whole world.
A common approach is to first match your internal B2B records against a business directory and obtain a unique key for each business entity. The next step of matching business entities with that unique is a no brainer.
The problem is though that an automatic match between internal B2B records and a business directory most often does not yield a 100 % hit rate. Not even close as examined in the post 3 out of 10.
Address directories are mostly used in order to standardize postal address data, so that two addresses in internal master data that can be standardized to an address written in exactly the same way can be better matched.
A deeper use of address directories is to exploit related property data. The probability of two records with “John Smith” on the same address being a true positive match is much higher if the address is a single-family house opposite to a high-rise building, nursery home or university campus.
A common cause of false negatives in data matching is that you have compared two records where one of the postal addresses is an old one.
Bringing in National Change of Address (NCOA) services for the countries in question will help a lot.
The optimal way of doing that (and utilizing business and address directories) is to make it a continuous element of Master Data Management (MDM) as explored in the post The Relocation Event.
Right on Henrik!!! I am right in the middle of a project where I am matching against three commercial 3rd party data sets including the National Register of Businesses, the Postal Address File and the national telephone directories. We have already improved our ‘Completeness’ measure for Company Identifiers from 12% to 60% within 4 weeks. However, the greatest challenge in these projects is getting enrichment data into core systems and databases where people are always concerned with making wholesale changes to databases, particularly customer data. Therefore, it is critical to get engagement from both IT and Business before performing these types of activities to ensure all this great work sees the light of day!
In addition to the data matching sources mentioned, many of the GIS vendors offer a pallet of data that when “matched against” offer added enrichment. There are also numerous other sources of data for product data (i.e. industry standard codes and conventions) but they are generally industry specific. Often in the world of matching, the match result doesn’t need to be adopted (or integrated) but simply the data flagged as having a possible problem with the suggested end result……remember, matching is really art that we are pushing towards a scientific result.
Although it is often domain specific, we could say that all data available within the scope of the Open Data movement are likely to be used in a data matching project (either to improve data quality or enrich your initial data set).
As described in your post regarding address directories, we use for example Openstreetmap data (see http://www.openstreetmap.org) to match addresses and then improve our data quality.
Thanks Duane, John and Martin for adding in.