Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.
When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.
Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.
When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.
Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?
As promised earlier today, here is the first post in an endless row of positive posts about success in data quality improvement.
This beautiful morning I finished yet another of these nice recurring jobs I do from time to time: Deduplicating bunches of files ready for direct marketing making sure that only one, the whole one and nothing but one unique message reaches a given individual decision maker, be that in the online or offline mailbox.
Most jobs are pretty similar and I have a fantastic tool that automates most of the work. I only have the pleasure to learn about the nature of the data and configure the standardisation and matching process accordingly in a user friendly interface. After the automated process I’m enjoying looking for any false positives and checking for false negatives. Sometimes I’m so lucky that I have the chance to repeat the process with a slightly different configuration so we reach the best result possible.
It’s a great feeling that this work reduces the costs of mailings at my clients, makes them look more smart and professional and facilitates that correct measure of response rates that is so essential in planning future even better direct marketing activities.
But that’s not all. I’m also delighted to be able to have a continuing chat about how we over time may introduce data quality prevention upstream at the point of data entry so we don’t have to do these recurring downstream cleansing activities any more. It’s always fascinating going through all the different applications that many organisations are running, some of them so old that I didn’t dream about they existed anymore. Most times we are able to build a solution that will work in the given landscape and anyway soon the credit crunch is totally gone and here we go.
I’ll be back again with more success from the data quality improvement frontier very soon.
I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.
When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name. And when you have done this a few times you also learn the risks of doing so.
Here is some of the learning I have been through:
Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture. The subject was discussed on DataQualityPro here.
In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.
So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.
If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.
In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.
I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.
Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.
When cleansing party master data it is often necessary to typify the records in order to settle if it is a business entity, a private consumer, a department (or project) in a business, an employee at a business, a household or some kind of dirt, test, comic name or other illegible name and address.
Once I made such a cleansing job for a client in the farming sector. When I browsed the result looking for false positives in the illegible group this name showed up:
- The Slurry Project (in Danish: Gylleprojektet)
So, normally it could be that someone called a really shitty project a bad name or provided dirty data for whatever reason. But in the context of the farming sector it makes a good name for a project dealing with better exploitation of slurry in growing crops.
A good example of the need for having the capability to adjust the bad word lists according to the context when cleansing data.
Let’s look at some statements:
• Business Intelligence and Data Mining is based on looking into historical data in order to make better decisions for the future.
• Some of the best results from Business Intelligence and Data Mining are made when looking at data in different ways than done before.
• It’s a well known fact that Business Intelligence and Data Mining is very much dependent on the quality of the (historical) data.
• We all agree that you should not start improving data quality (like anything else) without a solid business case.
• Upstream prevention of poor data quality is superior to downstream data cleansing.
Unfortunately the wise statements above have some serious interrelated timing issues:
• The business case can’t be established before we start to look at the data in the different way.
• Data is already stored downstream when that happens.
• Anyway we didn’t know precisely what data quality issues we have in that context before trying out new possible ways of looking at data.
Solutions to these timing issues may be:
• Always try to have the data reflect the real world objects they represent as close as possible – or at least include data elements that makes enrichment from external sources possible.
• Accept that downstream data cleansing will be needed from time to time and be sure to have the necessary instruments for that.
Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.
Just before Christmas 2009 I made such a mistake in my professional work.
It’s not that I don’t have a lot of excuses. Sure I have.
The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.
I was very busy before Christmas – but accepted this minor trivial assignment.
When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.
But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.
The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.
Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.
Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.
Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.
It’s time for my 2010 predictions.
Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.
Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.
Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.
Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.
I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that.
Some areas where we will see improvements will in my eyes be:
- Exploiting rich external reference data
- International capabilities
- Service orientation
- Small business support
- Human like technology
The page Data Quality 2.0 has more content on these topics.
Merry Christmas and a Happy New Year.
There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.
- Completeness – Missing values
- Precision – Inclusion of country codes, area codes, extensions
- Reliability – Real world alignment, pseudo numbers: 1234.., 555…
- Timeliness – Outdated and converted numbers
- Conformity – Formatting of numbers
- Uniqueness – Handling shared numbers and multiple numbers per party entity
You may work with improving phone number quality with these approaches:
Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:
- Count of filled values
- Minimum and maximum lengths
- Represented formats – best inspected per country if international data
- Minimum and maximum values – highlighting invalid numbers
National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.
Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.
In many countries a fixed net number includes an area code telling about place.
Match and enrichment:
Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.
Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.
Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.
The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.
Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.
An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.
Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.
There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.
Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.
Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.
Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.
Figure 1: one year old prime information
PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.
When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:
- Margaret & John Smith
- Margaret Smith. John Smith
- Maria Dolores St. John Smith
- Johnson & Johnson Limited
- Johnson & Johnson Limited, John Smith
- Johnson Furniture Inc., Sales Dept
- Johnson, Johnson and Smith Sales Training
Some of the entities having these names must be split into two entities before we can do the proper processing.
When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.
Making a computer program that does the same is an exiting but fearful journey.
What I have been working with includes the following techniques:
- String manipulation
- Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
- Matching with address directories, used for checking if the address is a private residence or a business address.
- Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
- Matching with consumer/citizen directories, used for checking which names are known on an address.
- Probabilistic learning, storing and looking up previous human decisions.
As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:
- A: The ones the computer may split automatically with an accepted failure rate of false positives
- B: The dubious ones, selected for human inspection
- C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)
For the listed names a suggestion for the golden single version of the truth could be:
- “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
- “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
- “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
- “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
- “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
- “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
- “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”
For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.