Man versus Computer. Special Edition.

trafficFollowing up on my previous post on Man versus Computer I am actually most workday mornings reminded about how man sucks.

Most workday mornings I leave home in my car heading into the following traffic:

  • A 4 lane motorway rolling in from southern Copenhagen, rest of Denmark, Germany and ultimately rest of Eurasia.
  • A 5th lane coming in from a local area.

These 5 lanes then split into:

  • 2 lanes heading for the Danish answer to Silicon Valley (called Ballerup)
  • 3 lanes leading to downtown Copenhagen or the main fair (called Bella Center), airport, Sweden and rest of Scandinavia.

Of course you will expect some mingling here. What happens every morning is rather a complete stop in traffic and the cause is not the merge and splitting but humans being drivers as:

  • Experienced local selfish drivers staying in the fastest lane until they suddenly want to switch lane according to their ongoing route.
  • Unexperienced (in this area) foreign drivers coming up from crowded central Europe in search for tranquility deep into the Swedish forests having no clue about where to position in this intersection. The same goes for Swedes returning for the opposite reason.
  • Everyone else having fun rejecting the switching from the selfish types and the foreign ones who should know better than passing in rush hours.

Some solutions to this problem might be:

  • Change Management learning people better driving habits.
  • Onboard computer in every car taking care of lane positioning. Should go smooth splitting 5 lanes into 2 + 3 lanes.

Now I am waiting for which solution that will be implemented first.

Master Data Survivorship

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

mergeThis spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

  • Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
  • Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
  • External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
  • Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

  • Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

  • Mrs Margaret Smith, 1 Main Str, Anytown
  • Peggy Smith, 1 Main Street, Anytown
  • Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Bookmark and Share

Universal Pearls of Wisdom

When we are looking for what is really important and absolutely necessary to get data quality right some sayings could be:

  • “Change management is a critical factor in ensuring long-term data quality success”.
  •  “Focussing only on technology is doomed to fail”.
  •  “You have to get buy-in from executive sponsors”.

PearlsThese statements are in my eyes very true and I guess anyone else will agree.

But I also notice that they are true for many other disciplines like MDM, BI, CRM, ERP, SOA, ITIL… you name it.

Also take the new SOA manifesto. I have tried to swap SOA (and the full words) with XYZ, and this is the result:

 XYZ Manifesto

XY is a paradigm that frames what you do. XYZ is a type of Z that results from applying XY. We have been applying XY to help organizations consistently deliver sustainable business value, with increased agility and cost effectiveness, in line with changing business needs. Through our work we have come to prioritize:

Business value over technical strategy

Strategic goals over project-specific benefits

Intrinsic interoperability over custom integration

Shared services over specific-purpose implementations

Flexibility over optimization

Evolutionary refinement over pursuit of initial perfection

That is, while we value the items on the right, we value the items on the left more.

I think a Data Quality and several other manifestos could be very close.

But what I am looking for in Data Quality is the specific pearls of wisdom related to Data and Information Quality – while I of course value to be reminded about the universal ones.

Bookmark and Share

Gorilla Data Quality

My previous blog post was titled “Guerrilla Data Quality”. In that post – and the excellent comments – we came around that while we should have a 100% vision for data (or rather information) quality most actual (and realistic) activity is minor steps compromising on:

  • Business unit versus enterprise wide scope
  • Single purpose versus multiple purpose capabilities
  • Reactive versus proactive approach

gorillaI think the reason why it is so is the widely used metaphor saying “Pick the low-hanging fruit first”. Such a metaphor is appealing to mankind since it relates to core activities made by our ancestors when gathering food – and still practiced by our cousins the gorillas.

Steve Sarsfield explained the logic of picking low hanging fruits in his blog post Data Quality Project Selection by presenting the Project Selection Quadrant.

So what we are looking for now is the missing link between Gorilla / Guerrilla Data Quality and the teaching in available literature on how to get data (information) quality right.

Bookmark and Share

Guerrilla Data Quality

Estatua_La_GalanaOh yes, in my crazy berserkergang of presenting stupid buzzword suggestions it’s time for “Guerrilla Data Quality”. And this time there is no previous hits on google to point at as the original source.

But I noticed that “Guerrilla Data Governance” is in use and as Data Governance and Data Quality are closely related disciplines, I think there could be something being “Guerrilla Data Quality”.

Also recently an article called “How to set data quality goals any business can achieve” was published by Dylan Jones on DataQualityPro. Here the need for setting short term realistic goals is emphasised in contrast to making a full size enterprise wide all domain massive initiative. This article sets focus on the people and process side of what may be “Guerrilla Data Quality”.

Recently I wrote a blog post called “Driving Data Quality in 2 Lanes” focussing on the tool selection for what may be “Guerrilla Data Quality” and the enterprise wide follow up.

Actually I guess most Data Quality activity going on is in fact “Guerrilla Data Quality”. The problem then is that most literature and teaching on Data Quality is aimed at the massive enterprise wide implementations.

Any thoughts?

Splitting names

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training

SplitSome of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

  • String manipulation
  • Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
  • Matching with address directories, used for checking if the address is a private residence or a business address.
  • Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
  • Matching with consumer/citizen directories, used for checking which names are known on an address.
  • Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

  • A: The ones the computer may split automatically with an accepted failure rate of false positives
  • B: The dubious ones, selected for human inspection
  • C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

  • “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
  • “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
  • “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
  • “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
  • “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Bookmark and Share

Man versus Computer

In a recent social network happening Jim Harris and Phil Simon discussed whether IT projects are like the board games Monopoly or Risk.

I notice that both these games are played with dice.

I remember back in the early 80’s I had some programming training by constructing a Yahtzee game on a computer. The following parts were at my disposal:

  • Platform: IBM 8100 minicomputer
  • Language: COBOL compiler
  • User Interface: Screen with 80 characters in 24 rows

As the user interface design options were limited the exiting part became the one player mode where I had to teach (program) the computer which dice to save in a given situation – and make that logic be based on patterns rather than every possible combination.

While having some other people testing the man versus computer in the one player mode I found out that I could actually construct a compact program that in the long run won more rounds than (ordinary) people.

Now, what about games without dice? Here we know that there has been a development even around chess where now the computer is the better one compared to any human.

So, what about data quality? Is it man or computer who is best at solving the matter. A blog post from Robert Barker called “Avoiding False Positives: Analytics or Humans?” has a sentiment.

diceAlso seen from a time and cost perspective the computer does have some advantages compared to humans.

But still we need humans to select what game to be played. Throw the dice…

Bookmark and Share

Business Rules and Duplicates

When finding or avoiding duplicates or doing similar kind of consolidation with party master data you will encounter lots of situations, where it is disputable what to do.

The “political correct” answer is: Depends on your business rules.

Yea right. Easier said than done.

Often you face the following:

  • Business rules doesn’t exist. Decisions are based on common sense.
  • Business rules differs between data providers.

Lets have an example.

We have these business rules (Owner, Brief):

Finance, No sales and deliveries to dissolved business entities
Logistics, Access to premises must be stated in Address2 if different from Address1
Sales, Every event must be registered with an active contact
Customer Service, In case of duplicate contacts the contact with the first event date wins

In a CRM system we have these 2 accounts (AccountID, CompanyName, Address1, Address2, City):

1, Restaurant San Remo, 2 Main Street, entrance thru no 4, Anytown
2, Ristorante San Remo, 2 Main Street, , Anytown

Also we have some contacts (AccountID, ContactID, JobTitle, ContactName, Status, StartYear. EventCount):

1, 1, Manager, Luigi Calda, Inactive, 2001, 2
1, 2, Chef de la Cusine, John Hothead, Active, 2002, 87
2, 1, Chef de la Cuisine, John Hothead, Duplicate, 2008, 2
2, 2, Owner, Gordon Testy, Active, 2008, 7

We are so lucky that a business directory is available now. Here we have (NationalID, Name, Address, City, Owner, Status):

3, Ristorante San Remo, 2 Main Street, Anytown, Luigi Calda, Dissolved
4, Ristorante San Remo, 2 Main Street, Anytown, Gordon Testy, Active

Under New ManagementSo, I don’t think we will produce a golden view of this business relationship based on the data (structure) available and the business rules available.

Building and aligning business rules and data structures to solve this example – and a lot of other examples with different challenges – may seem difficult and are often omitted in the name of simplicity. But:

  • Master data – not at least business partners – is a valuable asset in the enterprise, so why treat it with simplicity while we do complex handling with a lot of other (transaction) data.
  • Common sense may help you a lot. Many of these questions are not specific to your business but are shared among most other enterprises in your industry and many others in the whole real world.
  • I guess the near future will bring increased number of available services with software and external data support that helps a lot in selecting common business rules and apply these in the master data processing landscape.

Bookmark and Share

Mu

muThe term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.

When working with data quality you often encounter situations where the answer to a simple question must be “mu”.

Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):

Margaret Smith, 1 Main Street, Anytown
Margaret & John Smith, 1 Main Street, Anytown

Is this a duplicate situation?

In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.

Similar considerations apply to this example (Name, Address, City):

One Truth Consultants att: John Smith, 3 Main Street, Anytown
One Truth Consultants Ltd, 3 Main Street, Anytown

And this (Contact, Company, Address, City):

John Smith, One Truth Consultants, 3 Main Street, Anytown
John Smith, One Truth Services, 3 Main Street, Anytown

The latter example is explained in more details in this post.

Bookmark and Share

Settling a Match

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

  • Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

  • 1, Smashing Estates, , Central Square, Anytown
  • 2, Smashing Holding, 1, Main Street, Anytown
  • 3, Smashing East, 1, Main Street, Anytown
  • 4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

  • 4 first consonants in City
  • 4 first consonants in Street
  • 4 digit with leading zero of HouseNo
  • 4 first consonants in Name

This makes:

  • Input: NTWN-MNST-0001-SMSH
  • Directory 1: NTWN-CNTR-0000-SMSH
  • Directory 2: NTWN-MNST-0001-SMSH
  • Directory 3: NTWN-MNST-0001-SMSH
  • Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

match score

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

  • A being exact or very close e.g. scores above 90
  • B being close e.g. scores between 50 and 90
  • F being no match e.g. scores below 50
  • Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

  • Directory 1: AZFA
  • Directory 2: BAAA
  • Directory 3: BAAA
  • Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

  • AAAA = 10 (High)
  • BAAA = 9
  • AZAA = 8
  • A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

  • Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
  • LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
  • Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.