Big Reference Data – Page 15 – Liliendahl on Data Quality

Phony Phones and Real Numbers

8th December 20098th December 2009Henrik Gabs Liliendahl6 Comments

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

Completeness – Missing values
Precision – Inclusion of country codes, area codes, extensions
Reliability – Real world alignment, pseudo numbers: 1234.., 555…
Timeliness – Outdated and converted numbers
Conformity – Formatting of numbers
Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

Count of filled values
Minimum and maximum lengths
Represented formats – best inspected per country if international data
Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.

Ongoing Data Maintenance

17th November 20098th January 2011Henrik Gabs Liliendahl8 Comments

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.

Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.

There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.

Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.

Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.

Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.

Figure 1: one year old prime information

PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.

Sharing data is key to a single version of the truth

12th November 200920th October 2010Henrik Gabs Liliendahl10 Comments

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Charles Blyth and Jim Harris. Our contest is a Blogging Olympics of sorts, with the Great Britain, United States and Denmark competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.”

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below). Thanks!

My take

According to Wikipedia data may be of high quality in two alternative ways:

Either they are fit for their intended uses
Or they correctly represent the real-world construct to which they refer

In my eyes the term “single version of the truth” relates best to the real-world way of data being of high quality while “shared version of the truth” relates best to the hard work of making data fit for multiple intended uses of shared data in the enterprise.

My thesis is that there is a break even point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

The map analogy

In search for this truth we will go on a little journey around the world.

For a journey we need a map.

Traditionally we have the challenge that the real-world being the planet Earth is round (3 dimensions) but a map shows a flat world (2 dimensions). If a map shows a limited part of the world the difference doesn’t matter that much. This is similar to fitting the purpose of use in a single business unit.

If the map shows the whole world we may have all kind of different projections offering different kind of views on the world having some advantages and disadvantages. A classic world map is the rectangle where Alaska, Canada, Greenland, Svalbard, Siberia and Antarctica are presented much larger than in the real-world if compared to regions closer to equator. This is similar to the problems in fulfilling multiple uses embracing all business units in an enterprise.

Today we have new technology coming to the rescue. If you go into Google Earth the world indeed looks round and you may have any high altitude view of a apparently round world. If you go closer the map tends to be more and more flat. My guess is that the solutions to fit the multiple uses conondrum will be offered from the cloud.

Exploiting rich external reference data

But Google Earth offers more than powerfull technolgy. The maps are connected with rich information on places, streets, companies and so on obtained from multiple sources – and also some crowdsourced photos not always placed with accuracy. Even if external reference data is not “the truth” these data, if used by more and more users (one instance, multiple tenants), will tend to be closer to “the truth” than any data collected and maintained solely in a single enterprise.

Shared data makes fit for pupose information

You may divide the data held by an enterprise into 3 pots:

Global data that is not unique to operations in your enterprise but shared with other enterprises in the same industry (e.g. product reference data) and eventually the whole world (e.g. business partner data and location data). Here “shared data in the cloud” will make your “single version of the truth” easier and closer to the real world.
Bilateral data concerning business partner transactions and related master data. If you for example buy a spare part then also “share the describing data” making your “single version of the truth” easier and more accurate.
Private data that is unique to operations in your enterprise. This may be a “single version of the truth” that you find superior to what others have found, data supporting internal business rules that make your company more competitive and data referring to internal events.

While private and then next bilateral data makes up the largest amount of data held by an enterprise it is often seen that it is data that could be global that have the most obvious data quality issues like duplicated, missing, incorrect and outdated party master data information.

Here “a global or bilateral shared version of the truth” helps approaching “a single version of the truth” to be shared in your enterprise. This way accurate raw data may be consumed as valuable information in a given context at once when needed.

Call to action

If not done already, please take the time to read posts from fellow bloggers Charles Blyth and Jim Harris and then vote for who you think has won the debate. A link to the same poll is provided on all three blogs. Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals.

The poll will remain open for one week, closing at midnight on 19^th November so that the “medal ceremony” can be conducted via Twitter on Friday, 20^th November. Additionally, please share your thoughts and perspectives on this debate by posting a comment below. Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.

Vote here.

Who is working where doing what?

8th November 200924th July 2010Henrik Gabs Liliendahl2 Comments

A classic core data model for Master Data in CRM databases and Master Data hubs when doing B2B is that you have:

Accounts being the BUSINESS entities who are your customers, prospects and all kind of other business partners
Contacts being the EMPLOYEEs working there and acting in the roles as decision makers, influencers, gate keepers, users and so on – and having some kind of job title

Establishing and maintaining an optimal data quality with B2B records are often done by integrating with external reference data.

Available sources for the account layer have been in place for many years as business directories. The D&B Worldbase is one example but there are plenty around with varying scopes. Those directories offered by service providers often also covers the contact layer. But actuality has always been a problem and depth (completeness) have been limited not at least with large business entities. So in most cases I have witnessed only the account level has been integrated with external reference data while the use of external contact layer data have been limited to new market campaigns (with varying results).

With the rise of social network sites information about employees are made more or less available to anyone. Last time (mid-October) I checked on LinkedIn the rate of profiles compared to population was:

Denmark had 435,628 profiles, population 5,519,441 giving a ratio of 7.89 %.
Netherlands had 1,278,927 profiles, population 16,500,156 giving a ratio of 7.75 %
USA had 23,089,079 profiles, population 307,698,000 giving a ratio of 7.50 %.

Other countries I checked had lesser ratios but fast increasing numbers. All in all a formidable source of reference data for the contact layer.

Of course there are data quality issues with social networking sites. Data are maintained by the persons themselves which most often means good actuality and validity – but sometimes also means exaggeration and deceit. And yes, there are duplicate profiles.

Doing Social CRM is already hot stuff. Social MDM – in the meaning of exploiting social network reference data – will follow.

Slowly Changing Hierarchies

4th November 200923rd June 2010Henrik Gabs Liliendahl4 Comments

The term “slowly changing dimensions” is known from building data warehouses and attempting to make sense of data with business intelligence using reference data.

The fact that the world is changing all the time is also present when we look at Master Data Management and the essential hierarchy building taking place when structuring these data.

Company family trees are a common hierarchy structure in Master Data. One source of information about company family trees is the D&B Worldbase – a database operated by Dun & Bradstreet holding over 150 million business entities from all over the world.

I used to have Dun & Bradstreet as a customer. I don’t have that anymore – but I’m still working with the very same project. Because since I started this assignment US based Dun & Bradstreet handed over the operation in a range of European countries to the Swedish publishing group Bonnier. They later handed it over to Swedish company Bisnode. I started the project when I worked for Swedish consultancy group Sigma, continued in my Danish sole proprietorship and now serve Bisnode through German data quality tool vendor Omikron. Slowly changing relationships indeed.

As with many other activities in the realm of data quality establishing the “golden view”, “the single version of the truth” is only the beginning. If that “golden view” is not put into an ongoing maintenance the shiny gold will fade – slowly but steady.

Master Data Survivorship

28th October 20092nd July 2010Henrik Gabs Liliendahl1 Comment

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

This spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

Mrs Margaret Smith, 1 Main Str, Anytown
Peggy Smith, 1 Main Street, Anytown
Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Splitting names

21st October 20095th July 2010Henrik Gabs Liliendahl11 Comments

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

Margaret & John Smith
Margaret Smith. John Smith
Maria Dolores St. John Smith
Johnson & Johnson Limited
Johnson & Johnson Limited, John Smith
Johnson Furniture Inc., Sales Dept
Johnson, Johnson and Smith Sales Training

Some of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

String manipulation
Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
Matching with address directories, used for checking if the address is a private residence or a business address.
Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
Matching with consumer/citizen directories, used for checking which names are known on an address.
Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

A: The ones the computer may split automatically with an accepted failure rate of false positives
B: The dubious ones, selected for human inspection
C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

“Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
“Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
“Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
“Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
“Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Settling a Match

5th October 200919th June 2010Henrik Gabs Liliendahl4 Comments

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

1, Smashing Estates, , Central Square, Anytown
2, Smashing Holding, 1, Main Street, Anytown
3, Smashing East, 1, Main Street, Anytown
4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

4 first consonants in City
4 first consonants in Street
4 digit with leading zero of HouseNo
4 first consonants in Name

This makes:

Input: NTWN-MNST-0001-SMSH
Directory 1: NTWN-CNTR-0000-SMSH
Directory 2: NTWN-MNST-0001-SMSH
Directory 3: NTWN-MNST-0001-SMSH
Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

A being exact or very close e.g. scores above 90
B being close e.g. scores between 50 and 90
F being no match e.g. scores below 50
Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

Directory 1: AZFA
Directory 2: BAAA
Directory 3: BAAA
Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

AAAA = 10 (High)
BAAA = 9
AZAA = 8
…
A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.

Process of consolidating Master Data

27th September 20096th July 2010Henrik Gabs Liliendahl4 Comments

stormp1

In my previous blog post “Multi-Purpose Data Quality” we examined a business challenge where we have multiple purposes with party master data.

The comments suggested some form of consolidation should be done with the data.

How do we do that?

I have made a PowerPoint show “Example process of consolidating master data” with a suggested way of doing that.

The process uses the party master data types explained here.

The next questions in solving our business challenge will include:

Is it necessary to have master data in optimal shape real time – or is it OK to make periodic consolidation?
How do we design processes for maintaining the master data when:
- New members and customers are inserted?
- We update existing members and customers?
- External reference data changes?
What changes must be made with the existing applications handling the member database and the eShop?

Also the question of what style of Master Data Hub is suitable is indeed very common in these kinds of implementations.

Household Householding

13th August 200923rd June 2010Henrik Gabs Liliendahl11 Comments

When doing B2C (business-to-consumer) activities often you really want to do B2H (business-to-household). But sometimes you also actually want B2C, having a dialogue with the individual customer. So yet again we have a Party Master Data hierarchy, here households each consisting of one or several consumers (typically a nuclear family). In Data Model language there is a parent-child relationship between households and consumers.

The classic reason for wanting to identify households is that it’s a waste of money sending several printed catalogues and other offline mailings to the same household. But a lot of other good reasons based on a shared household budget exist too.

Data captured about consumers could look like this (name, address, city):

Margaret Smith, 1 Main Street, Anytown
Margaret & John Smith, 1 Main Str, Anytown
John Smith, 1 Main Street, Anytown
Peggy Smith, 1 Main Street, Anytown
Mr. J. Smith, 1 Main Street, Anytown

Here it seems fair to assume that we have:

A HOUSEHOLD being the Smith family consisting of
A CONSUMER being Margaret nicknamed Peggy
And a CONSUMER being John

(About party master data entity types please have a look here.)

But this is an easy example compared to what you see when working with names and addresses. Among complications I have seen are:

Households consisting of individuals with separate family names
Multi adult generation households and other kinds of households
Not having unique addresses may cause forming not existing households
Some addresses are not for traditional households, but are nursing homes, campus residence halls and the like
The time dimension: un-synchronous relocation capture, marriage (couples), divorce (split)

In other words: The real world is not that simple and the picture of how households are forming does change.

Available composable methods for maintaining household information are:

Ask your customers. An obvious choice but not easy to keep on going – your ROI may not be positive.
Fuzzy Data Matching. The higher percent of all citizens in a given region you have in your database the better your matching may be aligned with the real world.
Exploiting external reference data. Having knowledge about public address data helps a lot. Such data may tell you about uniqueness of addresses and the attributes of the buildings there. Availability differs around the world, but the trend in open government data may help.

This is the second post in a series around hierarchies in Party Master Data and how this must be handled in data matching. Previous post was about B2B (E2E) data. Next post planned is about SOHO’s.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph