Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.
Just before Christmas 2009 I made such a mistake in my professional work.
It’s not that I don’t have a lot of excuses. Sure I have.
The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.
I was very busy before Christmas – but accepted this minor trivial assignment.
When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.
But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.
The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.
Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.
Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.
Henrik! I partly agree. Only partly because professionals and B2B contact names are difficult to dedupe because the only criteria you have is the name (I assume the private clinic or home address differs from the professional address in most cases). Deduping all named John Smith or Jens Hansen (in Denmark) would be overkill, I guess. But taking name frequency into concideration, I Agree.
Hi Henrik. Thanks for the comment. Actually I also only partly agree with myself.
Many times these kinds of duplicates are low frequent and have a lot of issues with false positives why the cost of risk may be higher than the benefit.
In this case it was however possible to use combinations of same e-mail, phone similarity, address similarity, high person name similarity and fair organization name similarity to identify very probable representation of the same person counting for 3 % of all original entities.