Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.
Just before Christmas 2009 I made such a mistake in my professional work.
It’s not that I don’t have a lot of excuses. Sure I have.
The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.
I was very busy before Christmas – but accepted this minor trivial assignment.
When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.
But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.
The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.
Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.
Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.
The term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.
When working with data quality you often encounter situations where the answer to a simple question must be “mu”.
Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):
Margaret Smith, 1 Main Street, Anytown
Margaret & John Smith, 1 Main Street, Anytown
Is this a duplicate situation?
In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.
Similar considerations apply to this example (Name, Address, City):
One Truth Consultants att: John Smith, 3 Main Street, Anytown
One Truth Consultants Ltd, 3 Main Street, Anytown
And this (Contact, Company, Address, City):
John Smith, One Truth Consultants, 3 Main Street, Anytown
John Smith, One Truth Services, 3 Main Street, Anytown
The latter example is explained in more details in this post.
A basic structure of B2B (Business-to-Business) Party Master Data is that you have accounts being business entities each having one or several contacts being employees in each business entity. These employees act in the roles of decision makers, gate keepers, invoice receivers and so on. In Data Model language there is a parent-child relationship between accounts and contacts.
When doing deduplication with such data you aim to make a golden copy with unique business entities having unique contacts.
After achieving that you may gaze the data and stumble over rows in the golden copy as these (function, contact name, account name, address):
- HR, John Smith, Smashing Estates Ltd, Same Place in Anytown
- HR, John Smith, Smashing Solicitors Ltd, Same Place in Anytown
- IT, Tushnelda von Keine-Mustermann, The Old Treadmill Ltd, Anytown
- IT, Tushnelda von Keine-Mustermann, Brand New Brands Ltd, Anytown
Duplicates? Probably it’s the same real world individuals.
John Smith is the ultimate Anglo common name, but if your favorite external business directory tells you that the 2 companies has the same mother and are modest size organizations, the possibility of John Smith being the same person having the same role at the same time in 2 companies is very high.
Tushnelda has a very unique name, so here there is a high possibility that she has got a new job in a new company, which makes one of the entries inactive. If one is going to be selected as the active survivor it may be chosen from newest update, found in external reference data or investigated otherwise.
B2B is often not actually Business-to-Business but also E2E – Employee-to-Employee – as the relationship exists between employees in the selling and buying business entities and it is not unusual that the relation may follow the employees when they change employer.
So striving for “one version of the truth” through “360 degree view on customer” is not a one layer exercise. This fact must be modeled in the Master Data structure, supported by functionality and prevented by feasible data quality implementations.
It’s my plan to do some blog posts around hierarchies in Party Master Data and how this must be handled in data matching. Next post will be about B2C data.