Data Matching – Page 21 – Liliendahl on Data Quality

Select Company_ID from External_Source where possible

27th January 20101st September 2010Henrik Gabs Liliendahl9 Comments

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

Create Table Homo_Sapiens

23rd January 201027th March 2012Henrik Gabs Liliendahl19 Comments

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
Identity Resolution is about the same but – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.

Diversity in City Names

17th January 201017th August 2010Henrik Gabs Liliendahl8 Comments

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

The data model may reflect the real world by having places described by multiple names in given languages.
Some data matching solutions use synonym listing for this challenge.
Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

A New Year Resolution

1st January 20102nd July 2010Henrik Gabs Liliendahl2 Comments

Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.

Just before Christmas 2009 I made such a mistake in my professional work.

It’s not that I don’t have a lot of excuses. Sure I have.

The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.

I was very busy before Christmas – but accepted this minor trivial assignment.

When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.

But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.

The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.

Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.

Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.

Phony Phones and Real Numbers

8th December 20098th December 2009Henrik Gabs Liliendahl6 Comments

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

Completeness – Missing values
Precision – Inclusion of country codes, area codes, extensions
Reliability – Real world alignment, pseudo numbers: 1234.., 555…
Timeliness – Outdated and converted numbers
Conformity – Formatting of numbers
Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

Count of filled values
Minimum and maximum lengths
Represented formats – best inspected per country if international data
Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.

What’s in an eMail Address?

28th November 200928th November 2010Henrik Gabs Liliendahl11 Comments

When you are deduping, consolidating or doing identity resolution with party master data the elements that may be used includes names, postal addresses and places, phone numbers, national ID’s and eMail addresses.

Types of eMail addresses

In this post I will look closer into eMail addresses based on a general list of types of party master data.

You may divide eMail addresses into these types:

CONSUMER/CITIZEN:

This is a private eMail address belonging to an individual person.

Typical formats are myname@hotmail.com and nickname@gmail.com and name123@anymail.com

You may change your eMail address as a private person as time goes or have several such addresses at a time depending on your favourite providers of eMail services and other reasons to split your personality.

HOUSEHOLD:

A household/family may choose to have a shared eMail Address for private use.

Typical format will be xyz-family@anymail.com where the word family of course could be in a lot of different languages like famiglia-italiano@email.it

A special usage is the GROUP where two (or more) names are included like mary-and-john@anymail.com

EMPLOYEE:

This is the eMail address you are assigned as an employee (including owner) at a company.

Common formats are abc@company.com and name.name@company.com

When you change employer you also change eMail address and you may have several employers or other assignments at the same time. Also different formats like initials and full name may point to the same inbox.

DEPARTMENT:

Here the eMail address is not pointed at a particular person but some sort of a team within a company.

Formats are like sales@company.com and salg@firma.dk and vertrieb@firma.de choosing the sales team in some different languages.

Some eMail are referring to a specific FUNCTION like webmaster@company.com

BUSINESS:

This is an eMail address for the entire company.

Most common formats are info@company.com and company@company.com

INVALID:

Often a field designed for an eMail address is populated with invalid values going from obvious wrong values like XXX to harder detectable syntax errors and not existing domains.

Real world duplication

Many online services are based on registration via an eMail address assuming that one eMail represents one real world entity which of course is not the case.

Even on a service like LinkedIn where you may attach several eMail addresses to one profile you do encounter persons with obvious duplicate profiles.

Multi-channel marketing and sales

An increasing number of organisations are doing both offline and online operations today and when building enterprise wide master data hubs the eMail address becomes an more and more important element in matching party master data.

In such matching activities the eMail address can not stand alone but must be combined with the other elements as names, postal addresses, phone numbers and national ID’s upon availability.

Success in automating such processes is based on advanced algorithms in flexible and configurable solutions.

Comment or eMail me

If you also have been battling here I will be glad to have your comments here or by mail. My mail is hlsgr@mail.tele.dk and hls@omikron.net and hls@locus.dk and hls@dmpartner.dk and nordic@omikron.net

Master Data Survivorship

28th October 20092nd July 2010Henrik Gabs Liliendahl1 Comment

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

This spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

Mrs Margaret Smith, 1 Main Str, Anytown
Peggy Smith, 1 Main Street, Anytown
Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Splitting names

21st October 20095th July 2010Henrik Gabs Liliendahl11 Comments

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

Margaret & John Smith
Margaret Smith. John Smith
Maria Dolores St. John Smith
Johnson & Johnson Limited
Johnson & Johnson Limited, John Smith
Johnson Furniture Inc., Sales Dept
Johnson, Johnson and Smith Sales Training

Some of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

String manipulation
Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
Matching with address directories, used for checking if the address is a private residence or a business address.
Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
Matching with consumer/citizen directories, used for checking which names are known on an address.
Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

A: The ones the computer may split automatically with an accepted failure rate of false positives
B: The dubious ones, selected for human inspection
C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

“Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
“Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
“Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
“Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
“Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
“Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Mu

7th October 20095th January 2011Henrik Gabs LiliendahlLeave a comment

The term ”Mu” has several meanings including being a lost continent. In this post I will use the meaning of “mu” being the answer to a question that can’t be answered with a simple “yes” or “no” or even “unknown” as explained on Wikipedia here.

When working with data quality you often encounter situations where the answer to a simple question must be “mu”.

Let’s say you are looking for duplicates in a customer file and have these two rows (Name, Address, City):

Margaret Smith, 1 Main Street, Anytown

Margaret & John Smith, 1 Main Street, Anytown

Is this a duplicate situation?

In a given context like preparing for a direct mail the answer could be “yes”. But in most other contexts the answer is “mu”. Here the question should be something like: How do you handle hierarchy management with these two rows? And the answer could be something like the process presented in my recent post here.

Similar considerations apply to this example (Name, Address, City):

One Truth Consultants att: John Smith, 3 Main Street, Anytown

One Truth Consultants Ltd, 3 Main Street, Anytown

And this (Contact, Company, Address, City):

John Smith, One Truth Consultants, 3 Main Street, Anytown

John Smith, One Truth Services, 3 Main Street, Anytown

The latter example is explained in more details in this post.

Settling a Match

5th October 200919th June 2010Henrik Gabs Liliendahl4 Comments

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

1, Smashing Estates, , Central Square, Anytown
2, Smashing Holding, 1, Main Street, Anytown
3, Smashing East, 1, Main Street, Anytown
4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

4 first consonants in City
4 first consonants in Street
4 digit with leading zero of HouseNo
4 first consonants in Name

This makes:

Input: NTWN-MNST-0001-SMSH
Directory 1: NTWN-CNTR-0000-SMSH
Directory 2: NTWN-MNST-0001-SMSH
Directory 3: NTWN-MNST-0001-SMSH
Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

A being exact or very close e.g. scores above 90
B being close e.g. scores between 50 and 90
F being no match e.g. scores below 50
Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

Directory 1: AZFA
Directory 2: BAAA
Directory 3: BAAA
Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

AAAA = 10 (High)
BAAA = 9
AZAA = 8
…
A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph