Using External Data in Data Matching

One of the things that data quality tools does is data matching. Data matching is mostly related to the party master data domain. It is about comparing two or more data records that does not have exactly the same data but are describing the same real world entity.

Common approaches for that is to compare data records in internal master data repositories within your organization. However, there are great advantages in bringing in external reference data sources to support the data matching.

Some of the ways to do that I have worked with includes these kind of big reference data:

identityBusiness directories:

The business-to-business (B2B) world does not have privacy issues in the degree we see in the business-to-consumer (B2C) world. Therefore there are many business directories out there with a quite complete picture of which business entities exists in a given country and even in regions and the whole world.

A common approach is to first match your internal B2B records against a business directory and obtain a unique key for each business entity. The next step of matching business entities with that unique is a no brainer.

The problem is though that an automatic match between internal B2B records and a business directory most often does not yield a 100 % hit rate. Not even close as examined in the post 3 out of 10.

Address directories:

Address directories are mostly used in order to standardize postal address data, so that two addresses in internal master data that can be standardized to an address written in exactly the same way can be better matched.

A deeper use of address directories is to exploit related property data. The probability of two records with “John Smith” on the same address being a true positive match is much higher if the address is a single-family house opposite to a high-rise building, nursery home or university campus.

Relocation services:

A common cause of false negatives in data matching is that you have compared two records where one of the postal addresses is an old one.

Bringing in National Change of Address (NCOA) services for the countries in question will help a lot.

The optimal way of doing that (and utilizing business and address directories) is to make it a continuous element of Master Data Management (MDM) as explored in the post The Relocation Event.

Bookmark and Share

Where to put Master Data?

The core of most Master Data Management (MDM) solutions is a master data hub. MDM solutions as those appearing in analyst reports revolves around a store for master data that is a new different place than where master data usually are. That is for example being in CRM, SCM and ERP systems.

For large organizations with a complex IT landscape having a MDM hub is usually the only sensible solution.

However for many midsize and smaller organizations, and even large organizations with a dominant ERP system as well, the choice is often naming one of the application databases to be the main master data hub for a given master data domain as customer, supplier, product and what else is considered a master data entity.

In such cases you may apply things as data quality services as described in the post Lean MDM and other master data related services as told in post Service Oriented MDM.

scaleThere are arguments for and against both approaches. The probably most used argument against the MDM hub approach is that why you should solve the issue of having X data silos with creating data silo X + 1. The argument against naming a given application as the place of master data is that an application is built for a specific purpose and therefore is not good for other purposes of master data use.

Where do you put your master data? Why?

Bookmark and Share

Can you have data quality without data governance?

The question about if you can successfully make a data quality program without doing data governance is a recurring subject in the data management realm. This question was again discussed by Rachel Haines in a recent article called Is the Data Governance Value Message Getting Lost?

Yin and yangI think we have used the term data quality much longer than we have used the term data governance. Before data governance became a popular term organizations did make data quality programs without doing something called data governance. However, doing something about data quality is an act of data governance just maybe without some of the formalized things we just recently have put under the umbrella called data governance.

As I remember, we have always worked with assigning responsibilities, understanding and documenting business rules and some of the other good stuff now seen to be embraced by data governance. Doing data quality improvement without such considerations has always been pointless.

Today we have good frameworks available for data governance. Of course you should take advantage of using the maturing data governance discipline to support achieving and sustaining better data quality in order to provide better business outcomes.

Bookmark and Share

Data Quality Dimensions and Real World Alignment

Real world alignment is often seen as a competing measure of data quality opposite to the popular approach of data quality being seen as fitness for purpose of use.

When we try to narrow down what constitutes quality of data we may use data quality dimensions. So, how does data quality dimensions look like in the light of real world alignment? Here is a few thoughts:

  • Uniqueness is probably the data quality dimension that most closely relates to real world alignment as the opposite of uniqueness is duplication which in the data quality world means that two or more different data records describes the same real world entity.
  • Accuracy is best measured as in what degree data describes something in the real world.
  • Credibility was recently proposed as an important data quality dimension by Malcolm Chisholm on Information Management in the article called Data Credibility: A New Dimension of Data Quality? Here credibility is if data is without any malicious manipulation performed to fulfill an evil purpose of use.
Some data quality dimensions
Some data quality dimensions

Bookmark and Share

Service Oriented MDM

puzzleMuch of the talking and doing related to Master Data Management (MDM) today revolves around the master data repository being the central data store for information about customers, suppliers and other parties, products, locations, assets and what else are regarded as master data entities.

The difficulties in MDM implementations are often experienced because master data are born, maintained and consumed in a range of applications as ERP systems, CRM solutions and heaps of specialized applications.

It would be nice if these applications were MDM aware. But usually they are not.

As discussed in the post Service Oriented Data Quality the concepts of Service Oriented Architecture (SOA) makes a lot of sense in deploying data quality tool capacities that goes beyond the classic batch cleansing approach.

In the same way, we also need SOA thinking when we have to make the master data repository doing useful stuff all over the scattered application landscape that most organizations live with today and probably will in the future.

MDM functionality deployed as SOA components have a lot to offer, as for example:

  •  Reuse is one of the core principles of SOA. Having the same master data quality rules applied to every entry point of the same sort of master data will help with consistency.
  •  Interoperability will make it possible to deploy master data quality prevention as close to the root as possible.
  •  Composability makes it possible to combine functionality with different advantages – e.g. combining internal master data lookup with external reference data lookup.

Bookmark and Share

Completeness is still bad, while uniqueness is improving

In a recent report called The State of Marketing Data prepared by Netprospex over 60 million B2B records were analyzed in order to assess the quality of the data measured as fitness for use related to marketing purposes.

An interesting find was that out of a score of maximum 5.0 duplication, the dark side of uniqueness, was given the average score 4.2 while completeness was given the average score 2.7.

The STaTe of MarkeTing DaTa

This corresponds well with my experience. We have in the data quality realm worked very hard with deduplication tools using data matching approaches over the years and results are showing up. We are certainly not there yet, but it seems that completeness, and in my experience also accuracy, are data quality dimensions currently suffering more.

In my eyes the remedy for improvement in completeness and accuracy goes hand in hand with even better uniqueness. It is about getting the basic data right the first time as described in the post instant Single Customer View and being able to keep up completeness and accuracy as told in the post External Events, MDM and Data Stewardship.

Bookmark and Share