Standardization – Page 5 – Liliendahl on Data Quality

The Present Birthday

28th September 201128th September 2013Henrik Gabs Liliendahl4 Comments

Today (or maybe yesterday) Steve Jones of Capgemeni wrote a blog post called Same name, same birth date – how likely is it? The post examines the likelihood of that two records with the same name and birthday is representing same real world individual. The chance that a match is a false positive is of course mainly depending on the frequency of the name.

Another angle in this context I have observed over and over again is the chance of a false negative if the name and other data are the same, but the birthday is different. In this case you may miss matching two records that are actually reflecting the same real world individual.

One should think that a datum like a birthday usually should be pretty accurate. My practical experience is that it in many cases isn’t.

Some examples:

Running against the time

Every fourth year when we have Olympic Games there is always controversies about if a tiny female athlete really is as old as said.

I have noticed the same phenomenon when I had the chance to match data about contesters from several years of subscription data at a large city marathon in order to identify “returning customers”.

I’m always looking for false positives in data matching and was really surprised when I found several examples of same name and contact data but a birthday been raised one year for each appearance at the marathon.

That’s not my birthday, this is my birthday

Swedish driving license numbers includes the birthday of the holder as the driving license number is the same as the all-purpose national ID that starts with the birthday.

In a database with both a birthday field and a driving license number field there were heaps of records with mismatch between those two fields.

This wasn’t usually discovered because this rule only applies to Swedish driving license numbers and the database also had registrations for a lot of other nationalities.

When investigating the root cause of this there were as usual not a single explanation and the problem could be both that the birthday belonged to someone else and the driving license belonged to someone else.

Using both fields cut down the number of false negatives here.

Today’s date format is?

In the United States and a few other countries it’s custom to use the month-day-year format when typing a date. In most other places we have the correct sequence of either day-month-year or year-month-day. Once I matched data concerning foreign seamen working on ships in the Danish merchant fleet. When tuning the match process I found great numbers of good matches when twisting the date formats for birthdays, as the same seaman was registered on different ships with different captains and at different ports around the world.

When adding the fact that many birthdays was typed as 1^st January of the known year of birth or 1^st day in the known month of birth a lot of false positives was saved.

The question about occupation in the merchant fleet was actually a political hot potato at that time and until then the parliament had discussed the matter based on wrong statistics.

I have used birthday synonymously with “date of birth” which of course is a (meta) data quality problem.

Down the Street

3rd September 201114th July 2012Henrik Gabs Liliendahl9 Comments

Having an address consisting of a house number and a street name, or vice versa, is the usual way of addressing in most parts of the world. This construct is also featured in the presentation of the Universal Postal Union’s (UPU) international standard initiative (S42):

(Click on image to see the presentation)

Somehow I always end up living at a place with issues in relation to this construct.

Our current address is (without unit):

“Kenny Drews Vej 27” which would be “27 Kenny Drews Way” in an Anglo-phone country.

But our area has a new style of block buildings with canals between as we like to pretend that we live in Venice or Amsterdam:

This means that the house numbers aren’t sequenced down the street, but is spread round the block as if we were living in Japan. Google maps have the position exactly as it is:

Number 27 on Kenny Drews Vej is actually much closer to two other streets, which makes it very difficult when people are visiting us the first time and for some also the second time.

But that’s because I, and some of our visitors, are old fashioned. As Prashanta Chan says in his blog post Geocoding: Accurate Location Master Data: It will be much better to invite folks to your geocode.

The same thing applies to when you want some goods delivered to your premises or want a taxi as close to your front door as possible.

And regarding letters delivered by the good old postman: They will probably all be sent electronically before the UPU S42 addressing mapping standard is adapted by everyone.

Managing Client On-Boarding Data

19th July 201115th April 2012Henrik Gabs Liliendahl3 Comments

This year I will be joining FIMA: Europe’s Premier Financial Reference Data Management Conference for Data Management Professionals. The conference is held in London from 8^th to 10^th November.

I will present “Diversities In Using External Registries In A Globalised World” and take part in the panel discussion “Overcoming Key Challenges In Managing Client On-Boarding Data: Opportunities & Efficiency Ideas”.

As said in the panel discussion introduction: The industry clearly needs to normalise (or is it normalize?) regional differences and establish global standards.

The concept of using external reference data in order to improve data quality within master data management has been a favorite topic of mine for long.

I’m not saying that external reference data is a single source of truth. Clearly external reference data may have data quality issues as exemplified in my previous blog post called Troubled Bridge Over Water.

However I think there is a clear trend in encompassing external sources, increasingly found in the cloud, to make a shortcut in keeping up with data quality. I call this Data Quality 3.0.

The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

Lately I have worked with a concept called instant Data Quality. The idea is to make simple yet powerful functionality that helps with hooking up with many external sources at the same time when on-boarding clients and making continuous maintenance possible.

One aspect of such a concept is how to exploit the different opportunities available in each country as public administrative practices and privacy norms varies a lot over the world.

I’m looking forward to present and discuss these challenges and getting a lot of feedback.

How long is a Marathon?

22nd May 201122nd May 2011Henrik Gabs LiliendahlLeave a comment

Many large cities around the world have a yearly marathon event. Today it’s Copenhagen (and possibly other cities too).

The marathon distance today is 42,195 kilometers (if I use comma as decimal point) which resembles 26 miles and 385 yards or 26.22 miles (if I use a dot as decimal point).

So even if we today agree about the distance we might represent that distance in various ways. The distance has however varied during history as seen in the table with the length of the Olympic marathons.

What about real world alignment?

Well, if the Greek runner called Pheidippides (sometimes spelled Phidippides or Philippides) took the long but flat Southern route from Marathon to Athens it would have been around 42 kilometers. If he took the shorter but steeper Northern route it would only have been around 35 kilometers.

What about me? Oh, I’ll go for 42,195 kilometers – on the bike.

Foreign Affairs

11th March 201111th August 2011Henrik Gabs Liliendahl2 Comments

There is a famous poster called The New Yorker. This poster perfectly illustrates the centricity we often have about the town, region or country we live in.

The same phenomenon is often seen in data management.

I mentioned United States centricity as a minor criticism in my recent book review about the excellent book “Master Data Management and Data Governance”.

An example from the book is this statement:

“It is important to differentiate between U.S. domestic addresses and international addresses. This distinction is important for U.S.-centric MDM solutions because U.S. domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

The same fact could be expressed by saying:

“It is important to differentiate between Danish domestic addresses and international addresses. This distinction is important for Danish-centric MDM solutions because Danish domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

Only, the better formatted address in the first case is the messy address in the last case, and the better formatted address in the last case is the messy address in the first case.

If your MDM scope is country-centric it is sensible to concentrate on automation related to that country.

If your MDM scope is international there are two options:

The easy way: The one size fits all option. This is a moderate investment, but also, it only yields moderate results in terms of automation and data quality.
The hard way: You have to implement specialized automation and investigate best external reference data for each country. I made a Danish-centric post on that last year here.

Legal Forms from Hell

27th October 20109th October 2014Henrik Gabs Liliendahl2 Comments

When doing data matching with company names a basic challenge is that a proper company name in most cultures in most cases have two elements:

The actual company name
The legal form

Some worldwide examples:

Informatica Corporation
Talend SA
SAP Deutschland AG & Co. KG
Sony Kabushiki Kaisha
LEGO A/S

There are hundreds of different legal forms in full and abbreviated forms. Wikipedia has a list here (here called types of business entity).

However, when typing in company names in databases the legal form is often omitted. And even where legal forms are present they may be represented differently in full or abbreviated forms, with varying spelling and punctuation and so on. As the actual company names also suffer from this fuzziness, the complexity is overwhelming.

A common way of handling this issue in data matching is to separate the legal form and then emphasize on comparing the remaining part being the actual company name. When doing that it has to be done country specific or else you may remove the entire name of a company like with a name of an Italian company called Société Anonyme, which is a French legal form.

While the practice of having legal forms in company names may serve well for the original purpose of knowing the risk of doing business with that entity, it is certainly not serving the purpose of having the uniqueness data quality dimension solved.

One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.

To be called Hamlet or Olaf – that is the question

18th October 20105th November 2010Henrik Gabs LiliendahlLeave a comment

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

Pegy Smith
Peggy Smith
Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Business Directory Match: Global versus Local

6th October 201029th May 2012Henrik Gabs LiliendahlLeave a comment

When doing data quality improvement in business-to-business party master data an often used shortcut is matching your portfolio of business customers with a business directory and preferably picking new customers from the directory in the future.

If you are doing business in more than one country you will have some considerations about what business directory to use like engaging with a local business directory for each country or engaging with a single business directory covering all countries in question.

There are pro’s and con’s.

One subject is conformity. I have met this issue a couple of times. A business directory covering many countries will have a standardized way of formatting the different elements like a postal address, whereas a local (national) business directory will use best practice for the particular country.

An example from my home country Denmark:

The Dun & Bradstreet WorldBase is a business directory holding 170 million business entities from all over the world. A Danish street address is formatted like this:

Address Line 1 = Hovedgaden 12 A, 4. th

Observe that Denmark belongs to that half of the earth where house numbers are written after the street name.

In a local business directory (based on the public registry) you will be able to get this format:

Street name = Hovedgaden

Street code = 202 4321

House number = 012A

Floor = 04

Side/door = TH

Here you get an atomized address with metadata for the atomized elements and the unique address coding used in Denmark.

What are they doing?

19th August 201019th August 2010Henrik Gabs Liliendahl12 Comments

A core attribute in customer master data when dealing with business entities is assigning values for your customers/prospects industry vertical (or Line-of-Business or market segment or whatever metadata name you like).

When handling this particular data element you will come across many of the classic different options in data and information management.

Unstructured versus structured

Many early CRM (Customer Relationship Management) implementations offered a free text field for the industry vertical. While this approach may have been good for the free flow in data entry it of course has created havoc when business intelligence was applied to the CRM data. Countless cleansing projects have been done (and is going on) around in order to fix this basic mistake.

Most data entry forms today having an industry vertical value has a value list to choose from.

Your list versus an external standard

When having a value list it may be a list of your own creation or be based on an external standard list, for example SIC or NACE codes.

Having a list of your own tends to fulfill the data quality principle of fit for purpose of use while an external standard tends to fulfill the data quality principle of reflecting the real world construct.

The main weaknesses of a list of your own are that it requires continuous manual based maintenance and may cause conflicts. Deep down into a discussion on the Initiate MDM blog Julian Schwarzenbach offered a good example saying:

“I have also come across ‘flip-flop’ data – which is typically subjective data where two users cannot agree what the correct value is and it keeps getting changed between two values. This could be the classification of a customer by market sector where two different territories are reflecting different capabilities in their territories.” – Link here.

The main weaknesses of an external standard are that they seldom offer the granularity you need and for global data the different standards (SIC versions and different national NACE implementations and others) are a pain in the…

One versus several values

Many companies have more than one distinct activity. Catching only one (the primary) value for each company is keeping it simple, stupid. Having more than one value in relevant cases is adding complexity but may lead to better decisions.

Feasible Names and Addresses

17th July 201029th May 2012Henrik Gabs LiliendahlLeave a comment

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph