Liliendahl on Data Quality

The Country List

16th April 2013Henrik Gabs Liliendahl6 Comments

It’s the second day of the MDM Summit Europe 2013 in London today.

The last session I attended today was an expert panel on Reference Data Management (RDM).

I guess the list of countries on this planet is the prime example of what is reference data and today’s session provided no exception from that.

Even though a list of countries is fairly small and there shouldn’t be everyday changes to the list, maintaining a country list isn’t as simple as you should think.

First of all official sources for a country list aren’t in agreement. The range of countries given an ISO code isn’t the same as the range of countries where for example the Universal Postal Union (UPU) says you can make a delivery.

Another example I have had some challenges with is that for example the D&B WorldBase (a large word-wide business directory) has four country codes for what is generally regarded as the United Kingdom, as the D&B country reference data probably is defined by a soccer fan recognizing the distinct national soccer teams from England, Wales, Scotland and Northern Ireland.

The expert panel moderator, Aaron Zornes, went as far as suggesting that a graph database maybe the best technology for reflecting the complexity in reference data. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

Name the Pope

14th April 201314th April 2013Henrik Gabs Liliendahl1 Comment

Here’s a quiz for you:

Voters with a wrong answer will earn the right to call him(or her)self the antipope.

Names, Addresses and National Identification Numbers

11th April 2013Henrik Gabs Liliendahl6 Comments

When working with customer, or rather party, master data management and related data quality improvement and prevention for traditional offline and some online purposes, you will most often deal with names, addresses and national identification numbers.

While this may be tough enough for domestic data, doing this for international data is a daunting task.

Names

In reality there should be no difference between dealing with domestic data and international data when it comes to names, as people in today’s globalized world move between countries and bring their names with them.

Traditionally the emphasize on data quality related to names has been on dealing with the most frequent issues be that heaps of nick names in the United States and other places, having a “van” in bulks of names in the Netherlands or having loads of surname like middle names in Denmark.

With company names there are some differences to be considered like the inclusion of legal forms in company names as told in the post Legal Forms from Hell.

Addresses

Address formats varies between countries. That’s one thing.

The availability of public sources for address reference data varies too. These variations are related to for example:

Coverage: Is every part of the country included?
Depth: Is it street level, house number level or unit level?
Costs: Are reference data expensive or free of charge?

As told in the post Postal Code Musings the postal code system in a given country may be the key (or not) to how to deal with addresses and related data quality.

National Identification Numbers

The post called Business Entity Identifiers includes how countries have different implementations of either all-purpose national identification numbers or single-purpose national identification numbers for companies.

The same way there are different administrative practices for individuals, for example:

As I understand it is forbidden by constitution down under to have all-purpose identification numbers for individuals.
The United States Social Security Number (SSN) is often mentioned in articles about party data management. It’s an example of a single-purpose number in fact used for several purposes.
In Scandinavian countries all-purpose national identification numbers are in place as explained in the post Citizen ID within seconds.

Dealing with diversity

Managing party master data in the light of the above mentioned differences around the world isn’t simple. You need comprehensive data governance policies and business rules, you need elaborate data models and you need a quite well equipped toolbox regarding data quality prevention and exploiting external reference data.

How important is big data quality?

10th April 201310th April 2013Henrik Gabs Liliendahl4 Comments

Along with the rise of big data the question about quality of big data and the importance of taking data quality into consideration when analyzing big data is raised again and again.

We had a poll in the LinkedIn Big Data Quality group. The results are as shown below:

So, some people consider data quality to be more important for big data than for small data (the data we have analyzed until the rise of big data), some people consider data quality to be less important with big data, but the majority of people who voted (included yours truly), consider the quality of big data to be equally important as it has been with small data.

As expressed in some comments voting “the same” is often an aggregate of some things that are more important and other things that are less important.

Also some people have voted “mu” (wrong question) and in the comments explained that you really can’t compare small data with big data.

A repeated sentiment in the comments is that data quality for small data is going to be more important with the rise of big data as examined in the post Small Data with Big Impact.

Multi-Channel Data Matching

4th April 20134th April 2013Henrik Gabs LiliendahlLeave a comment

Most data matching activities going on are related to matching customer, other rather party, master data.

In today’s business world we see data matching related to party master data in those three different channels types:

Offline is the good old channel type where we have the mother of all business cases for data matching being avoiding unnecessary costs by sending the same material with the postman twice (or more) to the same recipient.
Online has been around for some time. While the cost of sending the same digital message to the same recipient may not be a big problem, there are still some other factors to be considered, like:
- Duplicate digital messages to the same recipient looks like spam (even if the recipient provided different eMail addresses him/her self).
- You can’t measure a true response rate
Social is the new channel type for data matching. Most business cases for data matching related to social network profiles are probably based on multi-channel issues.

The concept of having a single customer view, or rather single party view, involves matching identities over offline, online and social channels, and typical elements used for data matching are not entirely the same for those channels as seen in the figure to the right.

Most data matching procedures are in my experience quite simple with only a few data elements and no history track taking into considering. However we do see more sophisticated data matching environments often referred to as identity resolution, where we have historical data, more data elements and even unstructured data taking into consideration.

When doing multi-channel data matching you can’t avoid going from the popular simple data matching environments to more identity resolution like environments.

Some advices for getting it right without too much complication are:

Emphasize on data capturing by getting it right the first time. It helps a lot.
Get your data models right. Here reflecting the real world helps a lot.
Don’t reinvent the wheel. There are services for this out here. They help a lot.

Read more about such a service in the post instant Single Customer View.

Big Data and Data Matching

2nd April 2013Henrik Gabs Liliendahl2 Comments

Data matching has been an established discipline for many years and most data quality tools have more or less sophisticated features for data matching as well as many MDM (Master Data Management) platforms have data matching capabilities.

BigDataQuality — The LinkedIn Big Data Quality group

In a way the data matching realm has become slightly dull the recent years. People don’t get excited anymore over a discussion about if deterministic matching or probabilistic matching is the right way. Soundex is old, edit distance has been around for ages and matchcodes may have outlived themselves.

So, it’s good to see a new beast turning up. Data matching with big data.

It may be about deduplicating (deduping) volumes that is bigger than traditional data matching can handle. You know: Dedoop’ing.

But it is also very much about matching big data with small data, first and foremost master data. And having well matched master data. Kimmo Kontra wrote a good post about that recently. The post is called Big Grease, Big Data, and Big Apple – manholes and MDM.

The case presented by Kimmo holds many exciting implementations of data matching like for example proximity matching of locations.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph