Today (or maybe yesterday) Steve Jones of Capgemeni wrote a blog post called Same name, same birth date – how likely is it? The post examines the likelihood of that two records with the same name and birthday is representing same real world individual. The chance that a match is a false positive is of course mainly depending on the frequency of the name.
Another angle in this context I have observed over and over again is the chance of a false negative if the name and other data are the same, but the birthday is different. In this case you may miss matching two records that are actually reflecting the same real world individual.
One should think that a datum like a birthday usually should be pretty accurate. My practical experience is that it in many cases isn’t.
Some examples:
Running against the time
Every fourth year when we have Olympic Games there is always controversies about if a tiny female athlete really is as old as said.
I have noticed the same phenomenon when I had the chance to match data about contesters from several years of subscription data at a large city marathon in order to identify “returning customers”.
I’m always looking for false positives in data matching and was really surprised when I found several examples of same name and contact data but a birthday been raised one year for each appearance at the marathon.
That’s not my birthday, this is my birthday
Swedish driving license numbers includes the birthday of the holder as the driving license number is the same as the all-purpose national ID that starts with the birthday.
In a database with both a birthday field and a driving license number field there were heaps of records with mismatch between those two fields.
This wasn’t usually discovered because this rule only applies to Swedish driving license numbers and the database also had registrations for a lot of other nationalities.
When investigating the root cause of this there were as usual not a single explanation and the problem could be both that the birthday belonged to someone else and the driving license belonged to someone else.
Using both fields cut down the number of false negatives here.
Today’s date format is?
In the United States and a few other countries it’s custom to use the month-day-year format when typing a date. In most other places we have the correct sequence of either day-month-year or year-month-day. Once I matched data concerning foreign seamen working on ships in the Danish merchant fleet. When tuning the match process I found great numbers of good matches when twisting the date formats for birthdays, as the same seaman was registered on different ships with different captains and at different ports around the world.
When adding the fact that many birthdays was typed as 1st January of the known year of birth or 1st day in the known month of birth a lot of false positives was saved.
The question about occupation in the merchant fleet was actually a political hot potato at that time and until then the parliament had discussed the matter based on wrong statistics.
PS
I have used birthday synonymously with “date of birth” which of course is a (meta) data quality problem.












