Guerrilla Data Quality

Estatua_La_GalanaOh yes, in my crazy berserkergang of presenting stupid buzzword suggestions it’s time for “Guerrilla Data Quality”. And this time there is no previous hits on google to point at as the original source.

But I noticed that “Guerrilla Data Governance” is in use and as Data Governance and Data Quality are closely related disciplines, I think there could be something being “Guerrilla Data Quality”.

Also recently an article called “How to set data quality goals any business can achieve” was published by Dylan Jones on DataQualityPro. Here the need for setting short term realistic goals is emphasised in contrast to making a full size enterprise wide all domain massive initiative. This article sets focus on the people and process side of what may be “Guerrilla Data Quality”.

Recently I wrote a blog post called “Driving Data Quality in 2 Lanes” focussing on the tool selection for what may be “Guerrilla Data Quality” and the enterprise wide follow up.

Actually I guess most Data Quality activity going on is in fact “Guerrilla Data Quality”. The problem then is that most literature and teaching on Data Quality is aimed at the massive enterprise wide implementations.

Any thoughts?

Splitting names

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training

SplitSome of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

  • String manipulation
  • Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
  • Matching with address directories, used for checking if the address is a private residence or a business address.
  • Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
  • Matching with consumer/citizen directories, used for checking which names are known on an address.
  • Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

  • A: The ones the computer may split automatically with an accepted failure rate of false positives
  • B: The dubious ones, selected for human inspection
  • C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

  • “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
  • “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
  • “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
  • “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
  • “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Bookmark and Share

Man versus Computer

In a recent social network happening Jim Harris and Phil Simon discussed whether IT projects are like the board games Monopoly or Risk.

I notice that both these games are played with dice.

I remember back in the early 80’s I had some programming training by constructing a Yahtzee game on a computer. The following parts were at my disposal:

  • Platform: IBM 8100 minicomputer
  • Language: COBOL compiler
  • User Interface: Screen with 80 characters in 24 rows

As the user interface design options were limited the exiting part became the one player mode where I had to teach (program) the computer which dice to save in a given situation – and make that logic be based on patterns rather than every possible combination.

While having some other people testing the man versus computer in the one player mode I found out that I could actually construct a compact program that in the long run won more rounds than (ordinary) people.

Now, what about games without dice? Here we know that there has been a development even around chess where now the computer is the better one compared to any human.

So, what about data quality? Is it man or computer who is best at solving the matter. A blog post from Robert Barker called “Avoiding False Positives: Analytics or Humans?” has a sentiment.

diceAlso seen from a time and cost perspective the computer does have some advantages compared to humans.

But still we need humans to select what game to be played. Throw the dice…

Bookmark and Share

Business Rules and Duplicates

When finding or avoiding duplicates or doing similar kind of consolidation with party master data you will encounter lots of situations, where it is disputable what to do.

The “political correct” answer is: Depends on your business rules.

Yea right. Easier said than done.

Often you face the following:

  • Business rules doesn’t exist. Decisions are based on common sense.
  • Business rules differs between data providers.

Lets have an example.

We have these business rules (Owner, Brief):

Finance, No sales and deliveries to dissolved business entities
Logistics, Access to premises must be stated in Address2 if different from Address1
Sales, Every event must be registered with an active contact
Customer Service, In case of duplicate contacts the contact with the first event date wins

In a CRM system we have these 2 accounts (AccountID, CompanyName, Address1, Address2, City):

1, Restaurant San Remo, 2 Main Street, entrance thru no 4, Anytown
2, Ristorante San Remo, 2 Main Street, , Anytown

Also we have some contacts (AccountID, ContactID, JobTitle, ContactName, Status, StartYear. EventCount):

1, 1, Manager, Luigi Calda, Inactive, 2001, 2
1, 2, Chef de la Cusine, John Hothead, Active, 2002, 87
2, 1, Chef de la Cuisine, John Hothead, Duplicate, 2008, 2
2, 2, Owner, Gordon Testy, Active, 2008, 7

We are so lucky that a business directory is available now. Here we have (NationalID, Name, Address, City, Owner, Status):

3, Ristorante San Remo, 2 Main Street, Anytown, Luigi Calda, Dissolved
4, Ristorante San Remo, 2 Main Street, Anytown, Gordon Testy, Active

Under New ManagementSo, I don’t think we will produce a golden view of this business relationship based on the data (structure) available and the business rules available.

Building and aligning business rules and data structures to solve this example – and a lot of other examples with different challenges – may seem difficult and are often omitted in the name of simplicity. But:

  • Master data – not at least business partners – is a valuable asset in the enterprise, so why treat it with simplicity while we do complex handling with a lot of other (transaction) data.
  • Common sense may help you a lot. Many of these questions are not specific to your business but are shared among most other enterprises in your industry and many others in the whole real world.
  • I guess the near future will bring increased number of available services with software and external data support that helps a lot in selecting common business rules and apply these in the master data processing landscape.

Bookmark and Share

Mu

muThe term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.

When working with data quality you often encounter situations where the answer to a simple question must be “mu”.

Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):

Margaret Smith, 1 Main Street, Anytown
Margaret & John Smith, 1 Main Street, Anytown

Is this a duplicate situation?

In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.

Similar considerations apply to this example (Name, Address, City):

One Truth Consultants att: John Smith, 3 Main Street, Anytown
One Truth Consultants Ltd, 3 Main Street, Anytown

And this (Contact, Company, Address, City):

John Smith, One Truth Consultants, 3 Main Street, Anytown
John Smith, One Truth Services, 3 Main Street, Anytown

The latter example is explained in more details in this post.

Bookmark and Share

Settling a Match

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

  • Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

  • 1, Smashing Estates, , Central Square, Anytown
  • 2, Smashing Holding, 1, Main Street, Anytown
  • 3, Smashing East, 1, Main Street, Anytown
  • 4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

  • 4 first consonants in City
  • 4 first consonants in Street
  • 4 digit with leading zero of HouseNo
  • 4 first consonants in Name

This makes:

  • Input: NTWN-MNST-0001-SMSH
  • Directory 1: NTWN-CNTR-0000-SMSH
  • Directory 2: NTWN-MNST-0001-SMSH
  • Directory 3: NTWN-MNST-0001-SMSH
  • Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

match score

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

  • A being exact or very close e.g. scores above 90
  • B being close e.g. scores between 50 and 90
  • F being no match e.g. scores below 50
  • Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

  • Directory 1: AZFA
  • Directory 2: BAAA
  • Directory 3: BAAA
  • Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

  • AAAA = 10 (High)
  • BAAA = 9
  • AZAA = 8
  • A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

  • Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
  • LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
  • Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.

LinkedIn Group Statistics

LinkedInI am currently a member of 40 LinkedIn groups mostly targeted at Master Data Management, Data Quality and Data Matching.

As I have noticed that some groups covers the same topic I wondered if they have the same members.

So I did a quick analysis.

With Master Data Management the largest groups seems to be:

Using the LinkedIn Profile Organizer I found that 907 are members at both groups. This is not as many as I would have guessed.

With Data Quality the largest groups seems to be:

Using the LinkedIn Profile Organizer I found that 189 are members at both groups. This is not as many as I would have guessed despite the renaming of the last group.

As for Data Matching I have founded the Data Matching group. The group has 235 members where:

  • 77 are members in the two large Master Data Management groups also.
  • 80 are members in the two large Data Quality groups also.

Also this is not as many as I would have guessed.

You may find many other similar groups on my LinkedIn profile – among them:

Bookmark and Share

Process of consolidating Master Data

stormp1

In my previous blog post “Multi-Purpose Data Quality” we examined a business challenge where we have multiple purposes with party master data.

The comments suggested some form of consolidation should be done with the data.

How do we do that?

I have made a PowerPoint show “Example process of consolidating master data” with a suggested way of doing that.

The process uses the party master data types explained here.

The next questions in solving our business challenge will include:

  • Is it necessary to have master data in optimal shape real time – or is it OK to make periodic consolidation?
  • How do we design processes for maintaining the master data when:
    • New members and customers are inserted?
    • We update existing members and customers?
    • External reference data changes?   
  • What changes must be made with the existing applications handling the member database and the eShop?

Also the question of what style of Master Data Hub is suitable is indeed very common in these kinds of implementations.

Bookmark and Share

Multi-Purpose Data Quality

Say you are an organisation within charity fundraising. Since many years you had a membership database and recently you also introduced an eShop with related accessories.

The membership database holds the following record (Name, Address, City, YearlyContribution):

  •  Margaret & John Smith, 1 Main Street, Anytown, 100 Euro

The eShop system has the following accounts (Name, Address, Place, PurchaseInAll):

  • Mrs Margaret Smith, 1 Main Str, Anytown, 12 Euro
  • Peggy Smith, 1 Main Street, Anytown, 218 Euro
  • Local Charity c/o Margaret Smith, 1 Main Str, Anytown, 334 Euro

Now the new management wants to double contributions from members and triple eShop turnover. Based on the recommendations from “The One Truth Consulting Company” you plan to do the following:

  • Establish a platform for 1-1 dialogue with your individual members and customers
  • Analyze member and customer behaviour and profiles in order to:
    • Support the 1-1 dialogue with existing members and customers
    • Find new members and customers who are like your best members and customers

As the new management wants to stay for many years ahead, the solution must not be a one-shot exercise but must be implemented as a business process reengineering with a continuous focus on the best fit data governance, master data management and data (information) quality.

question-marksSo, what are you going to do with your data so they are fit for action with the old purposes and the new purposes?

Recently I wrote some posts related to these challenges:

Any other comments on the issues in how to do it are welcome.

Bookmark and Share

Driving Data Quality in 2 Lanes

Yesterday I visited a client in order to participate in a workshop on using a Data Quality Desktop tool by more users within that organisation.

This organisation makes use of 2 different Data Quality tools from Omikron:

  • The Data Quality Server, a complete framework of SOA enabled Data Quality functionality where we need the IT-department to be a critical part of the implementation.
  • The Data Quality Desktop tool, a user friendly piece of windows software installable by any PC user, but with sophisticated cleansing and matching features.

During the few hours of this workshop we were able to link several different departmental data sources to the server based MDM hub, setting up and confirming the business rules for this and reporting the foreseeable outcome of this process if it were to be repeated.

Some of the scenarios exercised will continue to run as ad hoc departmental processes and others will be upgraded into services embraced by the enterprise wide server implementation.

As I – for some reasons – went to this event going by car over a larger distance I had the time to compare the data quality progress made by different organisations with the traffic on the roads where we have:

  • Large busses with persons and large lorries with products being the most sustainable way of transport – but they are slow going and not too dynamic. Like the enterprise wide server implementations of Data Quality tools.
  • Private cars heading at different destinations in different but faster speeds. Like the desktop Data Quality tools.

 I noticed that:

  • One lane with busses or lorries works fine but slowly.
  • One lane with private cars is bit of a mess with some hazardous driving.
  • One lane with busses, lorries and private cars tends to be mortal.
  • 2 (or more) lanes works nice with good driving habits.

800px-E20_53So, encouraged by the workshop and the ride I feel comfortable with the idea of using both kind of Data Quality tools to have coherent user involved agile processes backed by some tools and a sustainable enterprise wide solution at the same time.

Bookmark and Share