Liliendahl on Data Quality

Guerrilla Data Quality

23rd October 200923rd October 2009Henrik Gabs Liliendahl7 Comments

Oh yes, in my crazy berserkergang of presenting stupid buzzword suggestions it’s time for “Guerrilla Data Quality”. And this time there is no previous hits on google to point at as the original source.

But I noticed that “Guerrilla Data Governance” is in use and as Data Governance and Data Quality are closely related disciplines, I think there could be something being “Guerrilla Data Quality”.

Also recently an article called “How to set data quality goals any business can achieve” was published by Dylan Jones on DataQualityPro. Here the need for setting short term realistic goals is emphasised in contrast to making a full size enterprise wide all domain massive initiative. This article sets focus on the people and process side of what may be “Guerrilla Data Quality”.

Recently I wrote a blog post called “Driving Data Quality in 2 Lanes” focussing on the tool selection for what may be “Guerrilla Data Quality” and the enterprise wide follow up.

Actually I guess most Data Quality activity going on is in fact “Guerrilla Data Quality”. The problem then is that most literature and teaching on Data Quality is aimed at the massive enterprise wide implementations.

Any thoughts?

Splitting names

21st October 20095th July 2010Henrik Gabs Liliendahl11 Comments

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

Margaret & John Smith
Margaret Smith. John Smith
Maria Dolores St. John Smith
Johnson & Johnson Limited
Johnson & Johnson Limited, John Smith
Johnson Furniture Inc., Sales Dept
Johnson, Johnson and Smith Sales Training

Some of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

String manipulation
Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
Matching with address directories, used for checking if the address is a private residence or a business address.
Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
Matching with consumer/citizen directories, used for checking which names are known on an address.
Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

A: The ones the computer may split automatically with an accepted failure rate of false positives
B: The dubious ones, selected for human inspection
C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

“Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
“Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
“Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
“Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
“Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Business Rules and Duplicates

10th October 200910th October 2010Henrik Gabs Liliendahl2 Comments

When finding or avoiding duplicates or doing similar kind of consolidation with party master data you will encounter lots of situations, where it is disputable what to do.

The “political correct” answer is: Depends on your business rules.

Yea right. Easier said than done.

Often you face the following:

Business rules doesn’t exist. Decisions are based on common sense.
Business rules differs between data providers.

Lets have an example.

We have these business rules (Owner, Brief):

Finance, No sales and deliveries to dissolved business entities

Logistics, Access to premises must be stated in Address2 if different from Address1

Sales, Every event must be registered with an active contact

Customer Service, In case of duplicate contacts the contact with the first event date wins

In a CRM system we have these 2 accounts (AccountID, CompanyName, Address1, Address2, City):

1, Restaurant San Remo, 2 Main Street, entrance thru no 4, Anytown

2, Ristorante San Remo, 2 Main Street, , Anytown

Also we have some contacts (AccountID, ContactID, JobTitle, ContactName, Status, StartYear. EventCount):

1, 1, Manager, Luigi Calda, Inactive, 2001, 2

1, 2, Chef de la Cusine, John Hothead, Active, 2002, 87

2, 1, Chef de la Cuisine, John Hothead, Duplicate, 2008, 2

2, 2, Owner, Gordon Testy, Active, 2008, 7

We are so lucky that a business directory is available now. Here we have (NationalID, Name, Address, City, Owner, Status):

3, Ristorante San Remo, 2 Main Street, Anytown, Luigi Calda, Dissolved

4, Ristorante San Remo, 2 Main Street, Anytown, Gordon Testy, Active

So, I don’t think we will produce a golden view of this business relationship based on the data (structure) available and the business rules available.

Building and aligning business rules and data structures to solve this example – and a lot of other examples with different challenges – may seem difficult and are often omitted in the name of simplicity. But:

Master data – not at least business partners – is a valuable asset in the enterprise, so why treat it with simplicity while we do complex handling with a lot of other (transaction) data.
Common sense may help you a lot. Many of these questions are not specific to your business but are shared among most other enterprises in your industry and many others in the whole real world.
I guess the near future will bring increased number of available services with software and external data support that helps a lot in selecting common business rules and apply these in the master data processing landscape.

Mu

7th October 20095th January 2011Henrik Gabs LiliendahlLeave a comment

The term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.

When working with data quality you often encounter situations where the answer to a simple question must be “mu”.

Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):

Margaret Smith, 1 Main Street, Anytown

Margaret & John Smith, 1 Main Street, Anytown

Is this a duplicate situation?

In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.

Similar considerations apply to this example (Name, Address, City):

One Truth Consultants att: John Smith, 3 Main Street, Anytown

One Truth Consultants Ltd, 3 Main Street, Anytown

And this (Contact, Company, Address, City):

John Smith, One Truth Consultants, 3 Main Street, Anytown

John Smith, One Truth Services, 3 Main Street, Anytown

The latter example is explained in more details in this post.

Settling a Match

5th October 200919th June 2010Henrik Gabs Liliendahl4 Comments

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

1, Smashing Estates, , Central Square, Anytown
2, Smashing Holding, 1, Main Street, Anytown
3, Smashing East, 1, Main Street, Anytown
4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

4 first consonants in City
4 first consonants in Street
4 digit with leading zero of HouseNo
4 first consonants in Name

This makes:

Input: NTWN-MNST-0001-SMSH
Directory 1: NTWN-CNTR-0000-SMSH
Directory 2: NTWN-MNST-0001-SMSH
Directory 3: NTWN-MNST-0001-SMSH
Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

A being exact or very close e.g. scores above 90
B being close e.g. scores between 50 and 90
F being no match e.g. scores below 50
Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

Directory 1: AZFA
Directory 2: BAAA
Directory 3: BAAA
Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

AAAA = 10 (High)
BAAA = 9
AZAA = 8
…
A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.

Process of consolidating Master Data

27th September 20096th July 2010Henrik Gabs Liliendahl4 Comments

stormp1

In my previous blog post “Multi-Purpose Data Quality” we examined a business challenge where we have multiple purposes with party master data.

The comments suggested some form of consolidation should be done with the data.

How do we do that?

I have made a PowerPoint show “Example process of consolidating master data” with a suggested way of doing that.

The process uses the party master data types explained here.

The next questions in solving our business challenge will include:

Is it necessary to have master data in optimal shape real time – or is it OK to make periodic consolidation?
How do we design processes for maintaining the master data when:
- New members and customers are inserted?
- We update existing members and customers?
- External reference data changes?
What changes must be made with the existing applications handling the member database and the eShop?

Also the question of what style of Master Data Hub is suitable is indeed very common in these kinds of implementations.

Multi-Purpose Data Quality

24th September 200924th September 2011Henrik Gabs Liliendahl3 Comments

Say you are an organisation within charity fundraising. Since many years you had a membership database and recently you also introduced an eShop with related accessories.

The membership database holds the following record (Name, Address, City, YearlyContribution):

Margaret & John Smith, 1 Main Street, Anytown, 100 Euro

The eShop system has the following accounts (Name, Address, Place, PurchaseInAll):

Mrs Margaret Smith, 1 Main Str, Anytown, 12 Euro
Peggy Smith, 1 Main Street, Anytown, 218 Euro
Local Charity c/o Margaret Smith, 1 Main Str, Anytown, 334 Euro

Now the new management wants to double contributions from members and triple eShop turnover. Based on the recommendations from “The One Truth Consulting Company” you plan to do the following:

Establish a platform for 1-1 dialogue with your individual members and customers
Analyze member and customer behaviour and profiles in order to:
- Support the 1-1 dialogue with existing members and customers
- Find new members and customers who are like your best members and customers

As the new management wants to stay for many years ahead, the solution must not be a one-shot exercise but must be implemented as a business process reengineering with a continuous focus on the best fit data governance, master data management and data (information) quality.

So, what are you going to do with your data so they are fit for action with the old purposes and the new purposes?

Recently I wrote some posts related to these challenges:

Any other comments on the issues in how to do it are welcome.

Driving Data Quality in 2 Lanes

23rd September 20091st July 2010Henrik Gabs LiliendahlLeave a comment

Yesterday I visited a client in order to participate in a workshop on using a Data Quality Desktop tool by more users within that organisation.

This organisation makes use of 2 different Data Quality tools from Omikron:

The Data Quality Server, a complete framework of SOA enabled Data Quality functionality where we need the IT-department to be a critical part of the implementation.
The Data Quality Desktop tool, a user friendly piece of windows software installable by any PC user, but with sophisticated cleansing and matching features.

During the few hours of this workshop we were able to link several different departmental data sources to the server based MDM hub, setting up and confirming the business rules for this and reporting the foreseeable outcome of this process if it were to be repeated.

Some of the scenarios exercised will continue to run as ad hoc departmental processes and others will be upgraded into services embraced by the enterprise wide server implementation.

As I – for some reasons – went to this event going by car over a larger distance I had the time to compare the data quality progress made by different organisations with the traffic on the roads where we have:

Large busses with persons and large lorries with products being the most sustainable way of transport – but they are slow going and not too dynamic. Like the enterprise wide server implementations of Data Quality tools.
Private cars heading at different destinations in different but faster speeds. Like the desktop Data Quality tools.

I noticed that:

One lane with busses or lorries works fine but slowly.
One lane with private cars is bit of a mess with some hazardous driving.
One lane with busses, lorries and private cars tends to be mortal.
2 (or more) lanes works nice with good driving habits.

So, encouraged by the workshop and the ride I feel comfortable with the idea of using both kind of Data Quality tools to have coherent user involved agile processes backed by some tools and a sustainable enterprise wide solution at the same time.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph