October 2009 – Liliendahl on Data Quality

Master Data Survivorship

28th October 20092nd July 2010Henrik Gabs Liliendahl1 Comment

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

This spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

Mrs Margaret Smith, 1 Main Str, Anytown
Peggy Smith, 1 Main Street, Anytown
Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Gorilla Data Quality

24th October 200921st June 2010Henrik Gabs Liliendahl2 Comments

My previous blog post was titled “Guerrilla Data Quality”. In that post – and the excellent comments – we came around that while we should have a 100% vision for data (or rather information) quality most actual (and realistic) activity is minor steps compromising on:

Business unit versus enterprise wide scope
Single purpose versus multiple purpose capabilities
Reactive versus proactive approach

I think the reason why it is so is the widely used metaphor saying “Pick the low-hanging fruit first”. Such a metaphor is appealing to mankind since it relates to core activities made by our ancestors when gathering food – and still practiced by our cousins the gorillas.

Steve Sarsfield explained the logic of picking low hanging fruits in his blog post Data Quality Project Selection by presenting the Project Selection Quadrant.

So what we are looking for now is the missing link between Gorilla / Guerrilla Data Quality and the teaching in available literature on how to get data (information) quality right.

Guerrilla Data Quality

23rd October 200923rd October 2009Henrik Gabs Liliendahl7 Comments

Oh yes, in my crazy berserkergang of presenting stupid buzzword suggestions it’s time for “Guerrilla Data Quality”. And this time there is no previous hits on google to point at as the original source.

But I noticed that “Guerrilla Data Governance” is in use and as Data Governance and Data Quality are closely related disciplines, I think there could be something being “Guerrilla Data Quality”.

Also recently an article called “How to set data quality goals any business can achieve” was published by Dylan Jones on DataQualityPro. Here the need for setting short term realistic goals is emphasised in contrast to making a full size enterprise wide all domain massive initiative. This article sets focus on the people and process side of what may be “Guerrilla Data Quality”.

Recently I wrote a blog post called “Driving Data Quality in 2 Lanes” focussing on the tool selection for what may be “Guerrilla Data Quality” and the enterprise wide follow up.

Actually I guess most Data Quality activity going on is in fact “Guerrilla Data Quality”. The problem then is that most literature and teaching on Data Quality is aimed at the massive enterprise wide implementations.

Any thoughts?

Splitting names

21st October 20095th July 2010Henrik Gabs Liliendahl11 Comments

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

Margaret & John Smith
Margaret Smith. John Smith
Maria Dolores St. John Smith
Johnson & Johnson Limited
Johnson & Johnson Limited, John Smith
Johnson Furniture Inc., Sales Dept
Johnson, Johnson and Smith Sales Training

Some of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

String manipulation
Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
Matching with address directories, used for checking if the address is a private residence or a business address.
Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
Matching with consumer/citizen directories, used for checking which names are known on an address.
Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

A: The ones the computer may split automatically with an accepted failure rate of false positives
B: The dubious ones, selected for human inspection
C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

“Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
“Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
“Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
“Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
“Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Business Rules and Duplicates

10th October 200910th October 2010Henrik Gabs Liliendahl2 Comments

When finding or avoiding duplicates or doing similar kind of consolidation with party master data you will encounter lots of situations, where it is disputable what to do.

The “political correct” answer is: Depends on your business rules.

Yea right. Easier said than done.

Often you face the following:

Business rules doesn’t exist. Decisions are based on common sense.
Business rules differs between data providers.

Lets have an example.

We have these business rules (Owner, Brief):

Finance, No sales and deliveries to dissolved business entities

Logistics, Access to premises must be stated in Address2 if different from Address1

Sales, Every event must be registered with an active contact

Customer Service, In case of duplicate contacts the contact with the first event date wins

In a CRM system we have these 2 accounts (AccountID, CompanyName, Address1, Address2, City):

1, Restaurant San Remo, 2 Main Street, entrance thru no 4, Anytown

2, Ristorante San Remo, 2 Main Street, , Anytown

Also we have some contacts (AccountID, ContactID, JobTitle, ContactName, Status, StartYear. EventCount):

1, 1, Manager, Luigi Calda, Inactive, 2001, 2

1, 2, Chef de la Cusine, John Hothead, Active, 2002, 87

2, 1, Chef de la Cuisine, John Hothead, Duplicate, 2008, 2

2, 2, Owner, Gordon Testy, Active, 2008, 7

We are so lucky that a business directory is available now. Here we have (NationalID, Name, Address, City, Owner, Status):

3, Ristorante San Remo, 2 Main Street, Anytown, Luigi Calda, Dissolved

4, Ristorante San Remo, 2 Main Street, Anytown, Gordon Testy, Active

So, I don’t think we will produce a golden view of this business relationship based on the data (structure) available and the business rules available.

Building and aligning business rules and data structures to solve this example – and a lot of other examples with different challenges – may seem difficult and are often omitted in the name of simplicity. But:

Master data – not at least business partners – is a valuable asset in the enterprise, so why treat it with simplicity while we do complex handling with a lot of other (transaction) data.
Common sense may help you a lot. Many of these questions are not specific to your business but are shared among most other enterprises in your industry and many others in the whole real world.
I guess the near future will bring increased number of available services with software and external data support that helps a lot in selecting common business rules and apply these in the master data processing landscape.

Mu

7th October 20095th January 2011Henrik Gabs LiliendahlLeave a comment

The term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.

When working with data quality you often encounter situations where the answer to a simple question must be “mu”.

Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):

Margaret Smith, 1 Main Street, Anytown

Margaret & John Smith, 1 Main Street, Anytown

Is this a duplicate situation?

In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.

Similar considerations apply to this example (Name, Address, City):

One Truth Consultants att: John Smith, 3 Main Street, Anytown

One Truth Consultants Ltd, 3 Main Street, Anytown

And this (Contact, Company, Address, City):

John Smith, One Truth Consultants, 3 Main Street, Anytown

John Smith, One Truth Services, 3 Main Street, Anytown

The latter example is explained in more details in this post.

Settling a Match

5th October 200919th June 2010Henrik Gabs Liliendahl4 Comments

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

1, Smashing Estates, , Central Square, Anytown
2, Smashing Holding, 1, Main Street, Anytown
3, Smashing East, 1, Main Street, Anytown
4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

4 first consonants in City
4 first consonants in Street
4 digit with leading zero of HouseNo
4 first consonants in Name

This makes:

Input: NTWN-MNST-0001-SMSH
Directory 1: NTWN-CNTR-0000-SMSH
Directory 2: NTWN-MNST-0001-SMSH
Directory 3: NTWN-MNST-0001-SMSH
Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

A being exact or very close e.g. scores above 90
B being close e.g. scores between 50 and 90
F being no match e.g. scores below 50
Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

Directory 1: AZFA
Directory 2: BAAA
Directory 3: BAAA
Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

AAAA = 10 (High)
BAAA = 9
AZAA = 8
…
A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

Month: October 2009