Data Architecture – Page 25 – Liliendahl on Data Quality

Having the right element to the left

27th February 201029th May 2012Henrik Gabs Liliendahl9 Comments

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

In some cultures like in east Asia the family name is written first and the given name is written last.
Some notations indicate that the given name isn’t the first element:
- “DUPONT Michel” is a custom French way of telling, that the family name is the first element
- “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Select Company_ID from External_Source where possible

27th January 20101st September 2010Henrik Gabs Liliendahl9 Comments

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

Create Table Homo_Sapiens

23rd January 201027th March 2012Henrik Gabs Liliendahl19 Comments

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
Identity Resolution is about the same but – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.

Diversity in City Names

17th January 201017th August 2010Henrik Gabs Liliendahl8 Comments

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

The data model may reflect the real world by having places described by multiple names in given languages.
Some data matching solutions use synonym listing for this challenge.
Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

Postal Address Hierarchy, Granularity, Precision and History

15th November 200921st June 2010Henrik Gabs Liliendahl12 Comments

In my last blog post the term “single version of the truth” was discussed. Some prerequisites for having raw data stored in one version that meets all known purposes are that:

They are kept with the granularity needed for all purposes
They have the most advanced precisions with all purposes
They reflect all time states asked for regarding all purposes

In the following I will go through some challenges with postal addresses. Don’t take this as an attempt to list all challenges in the world around this subject – it is only what I have been up to.

Countries

The country is the highest level in the address hierarchy. A source of truth may be a list of ISO 2 character country codes. But there are other lists and between these lists there a different perceptions of the fact that even countries are internally in hierarchies. Some examples related to the Olympic contest as my last blog post was part of are:

York (the old one) is placed in England – or is it Great Britain – or is it United Kingdom?
Referring to United States of America may or may not include Puerto Rico, US Virgin Islands, Guam, Samoa and Northern Mariana Islands.
The Kingdom of Denmark is not Denmark but Denmark, Faroe Islands and Greenland.

An example of a very slow changing dimension in here is that US Virgin Islands was part of the Kingdom of Denmark until 1917.

I had a great deal of fun with country codes and names when setting up a data matching solution around the D&B WorldBase and the world picture kept in there opposite to what is contained in other data samples.

States

Some countries have states, some countries have provinces and some other countries don’t have states or provinces. In some countries the state is a mandatory part of a postal address like in the US. In other countries having states the state is not a part of a printed address like in Germany, but you may have other purposes for storing the data anyway.

Postal codes and districts

Often local postal code systems are translated to the term ZIP-code – but ZIP code is actually the name of the US system.

The granularity of postal code systems differs a lot around the world. The UK postal codes are very specific while a postal code in other countries may refer to a large city. In most countries the postal code system is a hierarchy of numbers. The UK system is different. The Irish is very different – no postal codes until now.

In many countries companies are assigned a postal code of their own. The same goes for post office box addresses. In France the name of the referring district is followed by the word CEDEX for these addresses. So, be careful when matching or grouping city names in French addresses. Paris not Cedex is the centre of the universe in that country.

Locations, streets, blocks, house names, whatever

A lot of different hierarchies in various levels exist around the world – and the custom sequence also varies. This is a too complex and comprehensive subject for a blog post. So I will only emphasis a few selected subjects:

Vanity addressing is a phenonemen not at least in the UK where keeping up appearances rules. Here you may have to include a lie in the single version of truth.
Coding rules in my home country Denmark as we have a way of assigning a unique code to every real world entity. It helps with automated taxation. So a main road in central Copenhagen may be known to people as “H.C. Andersens Boulevard” but is stored in any mature database as “1010148”.
When matching party entities don’t make a false negative with an entity having a visit (geographical) address versus an entity having a mail address.

Entrances

Entrance – most often referred to as house number – is where addressing meets geocoding. Here you by using geocodes can point to an exact value identifying an address. When comparing with other addresses you just have to make sure whether you are talking latitude/longitude in a round world or WGS84 x-y coordinates or other geographic coordinate systems in a flat world and whether we are pointing at the centre of the building, at the door, at the spot where a public road is reachable or it is interpolated values.

Units

Larger buildings, high rising buildings and skyscrapers are usually not one address but is an entrance having multiple family apartments and/or multiple business addresses. These may be presented in many formats and in many depths including floors, sides, door numbers, you name it.

Large business entities may occupy a range of entrances.

Some entrances may in first impression look like a single address occupied by a nuclear family, but are in fact a nursing home or a campus occupied by a number of named individuals living on the same address.

Data models

The postal (geographical and mailing) address elements are in many data models just some of the attributes in a party entity. By separating the postal address elements in a specific entity with granulated attributes you will be more aligned with the real world and thereby have a better chance of fulfilling all purposes with the raw data. One of the most obvious advantages will be history tracking as business’ and consumers/citizens relocates from time to time.

Who is working where doing what?

8th November 200924th July 2010Henrik Gabs Liliendahl2 Comments

A classic core data model for Master Data in CRM databases and Master Data hubs when doing B2B is that you have:

Accounts being the BUSINESS entities who are your customers, prospects and all kind of other business partners
Contacts being the EMPLOYEEs working there and acting in the roles as decision makers, influencers, gate keepers, users and so on – and having some kind of job title

Establishing and maintaining an optimal data quality with B2B records are often done by integrating with external reference data.

Available sources for the account layer have been in place for many years as business directories. The D&B Worldbase is one example but there are plenty around with varying scopes. Those directories offered by service providers often also covers the contact layer. But actuality has always been a problem and depth (completeness) have been limited not at least with large business entities. So in most cases I have witnessed only the account level has been integrated with external reference data while the use of external contact layer data have been limited to new market campaigns (with varying results).

With the rise of social network sites information about employees are made more or less available to anyone. Last time (mid-October) I checked on LinkedIn the rate of profiles compared to population was:

Denmark had 435,628 profiles, population 5,519,441 giving a ratio of 7.89 %.
Netherlands had 1,278,927 profiles, population 16,500,156 giving a ratio of 7.75 %
USA had 23,089,079 profiles, population 307,698,000 giving a ratio of 7.50 %.

Other countries I checked had lesser ratios but fast increasing numbers. All in all a formidable source of reference data for the contact layer.

Of course there are data quality issues with social networking sites. Data are maintained by the persons themselves which most often means good actuality and validity – but sometimes also means exaggeration and deceit. And yes, there are duplicate profiles.

Doing Social CRM is already hot stuff. Social MDM – in the meaning of exploiting social network reference data – will follow.

Slowly Changing Hierarchies

4th November 200923rd June 2010Henrik Gabs Liliendahl4 Comments

The term “slowly changing dimensions” is known from building data warehouses and attempting to make sense of data with business intelligence using reference data.

The fact that the world is changing all the time is also present when we look at Master Data Management and the essential hierarchy building taking place when structuring these data.

Company family trees are a common hierarchy structure in Master Data. One source of information about company family trees is the D&B Worldbase – a database operated by Dun & Bradstreet holding over 150 million business entities from all over the world.

I used to have Dun & Bradstreet as a customer. I don’t have that anymore – but I’m still working with the very same project. Because since I started this assignment US based Dun & Bradstreet handed over the operation in a range of European countries to the Swedish publishing group Bonnier. They later handed it over to Swedish company Bisnode. I started the project when I worked for Swedish consultancy group Sigma, continued in my Danish sole proprietorship and now serve Bisnode through German data quality tool vendor Omikron. Slowly changing relationships indeed.

As with many other activities in the realm of data quality establishing the “golden view”, “the single version of the truth” is only the beginning. If that “golden view” is not put into an ongoing maintenance the shiny gold will fade – slowly but steady.

360° Business Partner View

1st November 20096th July 2010Henrik Gabs Liliendahl2 Comments

Having a 360° customer view is a well established term in CRM and Master Data Management. It is typically defined as “providing everyone in the organization with a consistent view of the customer.”

Then some organizations don’t use the term customer but other words like:

Citizen is the common term in public sector organizations when dealing with private persons
Patient is used in healthcare and the customer/citizen balance is different between countries around the world
Member is used in membership organizations like fundraising and those organizing employers and employees

The concept of a 360° customer view is in my eyes easily swapped with 360° citizen / patient/ member view.

Also related to the position in the pipeline we have words as:

Prospect being an entity with whom we have a 1-1 dialogue about becoming a customer
Lead being an entity we want to engage in such a dialogue

I think embracing prospects and leads is a must for a 360° customer view. Having the same real world object acting as a customer and a prospect/lead at the same time doesn’t make sense.

Hierarchy is of course important here, as the customer and the prospect or lead may belong to the same hierarchy but at a different level or only seen at a higher level. This is true for:

Households in B2C operations
Company family trees in B2B operations
Multiple employee engagements in B2B operations
Small business owners in B2B and B2C coexisting environments

Organizations also have suppliers. In a B2B organization the intersection of business partners being customers / prospects / leads and also suppliers may be surprisingly large. Typically the intersection is not that large seen at branch level but higher if we take a look at the ultimate global mother level.

From my point of view a 360° customer view should be made on consolidated customer and supplier hierarchies in B2B. Even in B2C a private customer may be a business owner or key employee at a supplier.

Employees are another master data entity that may have an intersection with customers and suppliers. Having an employee being a (or spouse of a) business owner at a small business supplier is a classic cause of trouble. I have seen situations where a 360° customer view could include employee entities.

Other Business Partner entities exists depending on industry and specific business operations where a 360° customer view would benefit from catching up on other real world party entities.

I think Data Matching and/or upstream prevention by error tolerant search has a busy near future.

Master Data Survivorship

28th October 20092nd July 2010Henrik Gabs Liliendahl1 Comment

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

This spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

Mrs Margaret Smith, 1 Main Str, Anytown
Peggy Smith, 1 Main Street, Anytown
Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Business Rules and Duplicates

10th October 200910th October 2010Henrik Gabs Liliendahl2 Comments

When finding or avoiding duplicates or doing similar kind of consolidation with party master data you will encounter lots of situations, where it is disputable what to do.

The “political correct” answer is: Depends on your business rules.

Yea right. Easier said than done.

Often you face the following:

Business rules doesn’t exist. Decisions are based on common sense.
Business rules differs between data providers.

Lets have an example.

We have these business rules (Owner, Brief):

Finance, No sales and deliveries to dissolved business entities

Logistics, Access to premises must be stated in Address2 if different from Address1

Sales, Every event must be registered with an active contact

Customer Service, In case of duplicate contacts the contact with the first event date wins

In a CRM system we have these 2 accounts (AccountID, CompanyName, Address1, Address2, City):

1, Restaurant San Remo, 2 Main Street, entrance thru no 4, Anytown

2, Ristorante San Remo, 2 Main Street, , Anytown

Also we have some contacts (AccountID, ContactID, JobTitle, ContactName, Status, StartYear. EventCount):

1, 1, Manager, Luigi Calda, Inactive, 2001, 2

1, 2, Chef de la Cusine, John Hothead, Active, 2002, 87

2, 1, Chef de la Cuisine, John Hothead, Duplicate, 2008, 2

2, 2, Owner, Gordon Testy, Active, 2008, 7

We are so lucky that a business directory is available now. Here we have (NationalID, Name, Address, City, Owner, Status):

3, Ristorante San Remo, 2 Main Street, Anytown, Luigi Calda, Dissolved

4, Ristorante San Remo, 2 Main Street, Anytown, Gordon Testy, Active

So, I don’t think we will produce a golden view of this business relationship based on the data (structure) available and the business rules available.

Building and aligning business rules and data structures to solve this example – and a lot of other examples with different challenges – may seem difficult and are often omitted in the name of simplicity. But:

Master data – not at least business partners – is a valuable asset in the enterprise, so why treat it with simplicity while we do complex handling with a lot of other (transaction) data.
Common sense may help you a lot. Many of these questions are not specific to your business but are shared among most other enterprises in your industry and many others in the whole real world.
I guess the near future will bring increased number of available services with software and external data support that helps a lot in selecting common business rules and apply these in the master data processing landscape.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph