The Present Birthday

Today (or maybe yesterday) Steve Jones of Capgemeni wrote a blog post called Same name, same birth date – how likely is it? The post examines the likelihood of that two records with the same name and birthday is representing same real world individual. The chance that a match is a false positive is of course mainly depending on the frequency of the name.

Another angle in this context I have observed over and over again is the chance of a false negative if the name and other data are the same, but the birthday is different. In this case you may miss matching two records that are actually reflecting the same real world individual.

One should think that a datum like a birthday usually should be pretty accurate. My practical experience is that it in many cases isn’t.

Some examples:

Running against the time

Every fourth year when we have Olympic Games there is always controversies about if a tiny female athlete really is as old as said.

I have noticed the same phenomenon when I had the chance to match data about contesters from several years of subscription data at a large city marathon in order to identify “returning customers”.

I’m always looking for false positives in data matching and was really surprised when I found several examples of same name and contact data but a birthday been raised one year for each appearance at the marathon.

That’s not my birthday, this is my birthday

Swedish driving license numbers includes the birthday of the holder as the driving license number is the same as the all-purpose national ID that starts with the birthday.

In a database with both a birthday field and a driving license number field there were heaps of records with mismatch between those two fields.

This wasn’t usually discovered because this rule only applies to Swedish driving license numbers and the database also had registrations for a lot of other nationalities.

When investigating the root cause of this there were as usual not a single explanation and the problem could be both that the birthday belonged to someone else and the driving license belonged to someone else.

Using both fields cut down the number of false negatives here.

Today’s date format is?

In the United States and a few other countries it’s custom to use the month-day-year format when typing a date. In most other places we have the correct sequence of either day-month-year or year-month-day.  Once I matched data concerning foreign seamen working on ships in the Danish merchant fleet. When tuning the match process I found great numbers of good matches when twisting the date formats for birthdays, as the same seaman was registered on different ships with different captains and at different ports around the world.

When adding the fact that many birthdays was typed as 1st January of the known year of birth or 1st day in the known month of birth a lot of false positives was saved.

The question about occupation in the merchant fleet was actually a political hot potato at that time and until then the parliament had discussed the matter based on wrong statistics.

PS

I have used birthday synonymously with “date of birth” which of course is a (meta) data quality problem.

Bookmark and Share

Down the Street

Having an address consisting of a house number and a street name, or vice versa, is the usual way of addressing in most parts of the world. This construct is also featured in the presentation of the Universal Postal Union’s (UPU) international standard initiative (S42):

(Click on image to see the presentation)

Somehow I always end up living at a place with issues in relation to this construct.

Our current address is (without unit):

“Kenny Drews Vej 27” which would be “27 Kenny Drews Way” in an Anglo-phone country.

But our area has a new style of block buildings with canals between as we like to pretend that we live in Venice or Amsterdam:

This means that the house numbers aren’t sequenced down the street, but is spread round the block as if we were living in Japan. Google maps have the position exactly as it is:

Number 27 on Kenny Drews Vej is actually much closer to two other streets, which makes it very difficult when people are visiting us the first time and for some also the second time.

But that’s because I, and some of our visitors, are old fashioned. As Prashanta Chan says in his blog post Geocoding: Accurate Location Master Data: It will be much better to invite folks to your geocode.

The same thing applies to when you want some goods delivered to your premises or want a taxi as close to your front door as possible.

And regarding letters delivered by the good old postman: They will probably all be sent electronically before the UPU S42 addressing mapping standard is adapted by everyone.

Bookmark and Share

Managing Client On-Boarding Data

This year I will be joining FIMA: Europe’s Premier Financial Reference Data Management Conference for Data Management Professionals. The conference is held in London from 8th to 10th November.

I will present “Diversities In Using External Registries In A Globalised World” and take part in the panel discussion “Overcoming Key Challenges In Managing Client On-Boarding Data: Opportunities & Efficiency Ideas”.

As said in the panel discussion introduction: The industry clearly needs to normalise (or is it normalize?) regional differences and establish global standards.

The concept of using external reference data in order to improve data quality within master data management has been a favorite topic of mine for long.

I’m not saying that external reference data is a single source of truth. Clearly external reference data may have data quality issues as exemplified in my previous blog post called Troubled Bridge Over Water.

However I think there is a clear trend in encompassing external sources, increasingly found in the cloud, to make a shortcut in keeping up with data quality. I call this Data Quality 3.0.

The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

Lately I have worked with a concept called instant Data Quality. The idea is to make simple yet powerful functionality that helps with hooking up with many external sources at the same time when on-boarding clients and making continuous maintenance possible.

One aspect of such a concept is how to exploit the different opportunities available in each country as public administrative practices and privacy norms varies a lot over the world.

I’m looking forward to present and discuss these challenges and getting a lot of feedback.

Bookmark and Share

How long is a Marathon?

Many large cities around the world have a yearly marathon event. Today it’s Copenhagen (and possibly other cities too).

The marathon distance today is 42,195 kilometers (if I use comma as decimal point) which resembles 26 miles and 385 yards or 26.22 miles (if I use a dot as decimal point).

So even if we today agree about the distance we might represent that distance in various ways. The distance has however varied during history as seen in the table with the length of the Olympic marathons.

What about real world alignment?

Well, if the Greek runner called Pheidippides (sometimes spelled Phidippides or Philippides) took the long but flat Southern route from Marathon to Athens it would have been around 42 kilometers. If he took the shorter but steeper Northern route it would only have been around 35 kilometers.

What about me? Oh, I’ll go for 42,195 kilometers – on the bike.   

Bookmark and Share

Foreign Affairs

There is a famous poster called The New Yorker. This poster perfectly illustrates the centricity we often have about the town, region or country we live in.

The same phenomenon is often seen in data management.

I mentioned United States centricity as a minor criticism in my recent book review about the excellent book “Master Data Management and Data Governance”.  

An example from the book is this statement:

“It is important to differentiate between U.S. domestic addresses and international addresses. This distinction is important for U.S.-centric MDM solutions because U.S. domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

The same fact could be expressed by saying:

“It is important to differentiate between Danish domestic addresses and international addresses. This distinction is important for Danish-centric MDM solutions because Danish domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

Only, the better formatted address in the first case is the messy address in the last case, and the better formatted address in the last case is the messy address in the first case.

If your MDM scope is country-centric it is sensible to concentrate on automation related to that country.

If your MDM scope is international there are two options:

  • The easy way: The one size fits all option. This is a moderate investment, but also, it only yields moderate results in terms of automation and data quality.
  • The hard way: You have to implement specialized automation and investigate best external reference data for each country. I made a Danish-centric post on that last year here.

Bookmark and Share

Legal Forms from Hell

When doing data matching with company names a basic challenge is that a proper company name in most cultures in most cases have two elements:

  • The actual company name
  • The legal form

Some worldwide examples:

  • Informatica Corporation
  • Talend SA
  • SAP Deutschland AG & Co. KG
  • Sony Kabushiki Kaisha
  • LEGO A/S

There are hundreds of different legal forms in full and abbreviated forms. Wikipedia has a list here (here called types of business entity).

However, when typing in company names in databases the legal form is often omitted. And even where legal forms are present they may be represented differently in full or abbreviated forms, with varying spelling and punctuation and so on. As the actual company names also suffer from this fuzziness, the complexity is overwhelming.

A common way of handling this issue in data matching is to separate the legal form and then emphasize on comparing the remaining part being the actual company name. When doing that it has to be done country specific or else you may remove the entire name of a company like with a name of an Italian company called Société Anonyme, which is a French legal form.

While the practice of having legal forms in company names may serve well for the original purpose of knowing the risk of doing business with that entity, it is certainly not serving the purpose of having the uniqueness data quality dimension solved.

One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.

Bookmark and Share

To be called Hamlet or Olaf – that is the question

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share

Business Directory Match: Global versus Local

When doing data quality improvement in business-to-business party master data an often used shortcut is matching your portfolio of business customers with a business directory and preferably picking new customers from the directory in the future.

If you are doing business in more than one country you will have some considerations about what business directory to use like engaging with a local business directory for each country or engaging with a single business directory covering all countries in question.

There are pro’s and con’s.

One subject is conformity. I have met this issue a couple of times. A business directory covering many countries will have a standardized way of formatting the different elements like a postal address, whereas a local (national) business directory will use best practice for the particular country.

An example from my home country Denmark:

The Dun & Bradstreet WorldBase is a business directory holding 170 million business entities from all over the world. A Danish street address is formatted like this:

Address Line 1 = Hovedgaden 12 A, 4. th

Observe that Denmark belongs to that half of the earth where house numbers are written after the street name.

In a local business directory (based on the public registry) you will be able to get this format:

Street name = Hovedgaden
Street code = 202 4321
House number = 012A
Floor = 04
Side/door = TH

Here you get an atomized address with metadata for the atomized elements and the unique address coding used in Denmark.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share