Today (or maybe yesterday) Steve Jones of Capgemeni wrote a blog post called Same name, same birth date – how likely is it? The post examines the likelihood of that two records with the same name and birthday is representing same real world individual. The chance that a match is a false positive is of course mainly depending on the frequency of the name.
Another angle in this context I have observed over and over again is the chance of a false negative if the name and other data are the same, but the birthday is different. In this case you may miss matching two records that are actually reflecting the same real world individual.
One should think that a datum like a birthday usually should be pretty accurate. My practical experience is that it in many cases isn’t.
Some examples:
Running against the time
Every fourth year when we have Olympic Games there is always controversies about if a tiny female athlete really is as old as said.
I have noticed the same phenomenon when I had the chance to match data about contesters from several years of subscription data at a large city marathon in order to identify “returning customers”.
I’m always looking for false positives in data matching and was really surprised when I found several examples of same name and contact data but a birthday been raised one year for each appearance at the marathon.
That’s not my birthday, this is my birthday
Swedish driving license numbers includes the birthday of the holder as the driving license number is the same as the all-purpose national ID that starts with the birthday.
In a database with both a birthday field and a driving license number field there were heaps of records with mismatch between those two fields.
This wasn’t usually discovered because this rule only applies to Swedish driving license numbers and the database also had registrations for a lot of other nationalities.
When investigating the root cause of this there were as usual not a single explanation and the problem could be both that the birthday belonged to someone else and the driving license belonged to someone else.
Using both fields cut down the number of false negatives here.
Today’s date format is?
In the United States and a few other countries it’s custom to use the month-day-year format when typing a date. In most other places we have the correct sequence of either day-month-year or year-month-day. Once I matched data concerning foreign seamen working on ships in the Danish merchant fleet. When tuning the match process I found great numbers of good matches when twisting the date formats for birthdays, as the same seaman was registered on different ships with different captains and at different ports around the world.
When adding the fact that many birthdays was typed as 1st January of the known year of birth or 1st day in the known month of birth a lot of false positives was saved.
The question about occupation in the merchant fleet was actually a political hot potato at that time and until then the parliament had discussed the matter based on wrong statistics.
PS
I have used birthday synonymously with “date of birth” which of course is a (meta) data quality problem.
In India we have a different sort of problem which relates to the Date of Birth. A large population of India do not have government records for their birth as their birth has taken place in remote areas or at home. Also India has been grappling with the problem of Illegal Immigrants from neighbors who do not have any proof of any sort and over time have become part of the population. Some of the older population do not have any proof of birth as in those pre-independence days it was hardly followed. another factor to add to the mix was Census in India was only started in the year 1931 and is held once every 10 yrs which was skipped twice after independence. While capturing date of birth during census for such people a guesstimate of the age is taken based on an event that they might remember like Gandhi’s Quit India Movement or Some big personalities death. In such a case relying on Date of Birth could be full of issues.
Wehn filling out my Grandmother’s death registration recently we ran across a similar issue. Because my grandmother was born in rural Australia nearly 100 years ago, it took her father many days to get to somewhere to register her birth. She was quite old before she found out that the day she celebrated was not the one that was officially recorded. Similarly, no-one was entirely sure about the order of her middle names (she had two). Partly due to her old age (and the confusion that ensues), and partly due to poor record keeping her name had been written down in official records with both orderings! I now work in the government department responsible for maintaining the registers of births and deaths in New Zealand, and situations like this (not to mention the lack of precision caused by grief-stricken people filling out confusing forms) cause significnat issues in terms of data quality – as it becomes very hard to match records of registrations across people’s life events.
A friend of mine, at age 13, got a second middle name. IOW, he went from Joe Wilson Smith to Joe Wilson Harrison Smith. It has plague him for 30+ years. I have often wondered how common these self inflicted wounds are?
Final note: It used to be that you could ‘apply’ for a new social security number (SSN) and actually choose the number. And once the number was unused for 13 years they would issue it. So there was a brief fad of people changing their SSN. I read a story about a woman who changed her number to 000-00-0001. I can only imagine the confusion.
-XC
PS – Ok, final final point – women’s maiden names and hyphenation. Always a problem here.
Thanks Rohin, Doug and Cliff for sharing your experiences with date of birth and person names.