Liliendahl on Data Quality

Postal Address Hierarchy, Granularity, Precision and History

15th November 200921st June 2010Henrik Gabs Liliendahl12 Comments

In my last blog post the term “single version of the truth” was discussed. Some prerequisites for having raw data stored in one version that meets all known purposes are that:

They are kept with the granularity needed for all purposes
They have the most advanced precisions with all purposes
They reflect all time states asked for regarding all purposes

In the following I will go through some challenges with postal addresses. Don’t take this as an attempt to list all challenges in the world around this subject – it is only what I have been up to.

Countries

The country is the highest level in the address hierarchy. A source of truth may be a list of ISO 2 character country codes. But there are other lists and between these lists there a different perceptions of the fact that even countries are internally in hierarchies. Some examples related to the Olympic contest as my last blog post was part of are:

York (the old one) is placed in England – or is it Great Britain – or is it United Kingdom?
Referring to United States of America may or may not include Puerto Rico, US Virgin Islands, Guam, Samoa and Northern Mariana Islands.
The Kingdom of Denmark is not Denmark but Denmark, Faroe Islands and Greenland.

An example of a very slow changing dimension in here is that US Virgin Islands was part of the Kingdom of Denmark until 1917.

I had a great deal of fun with country codes and names when setting up a data matching solution around the D&B WorldBase and the world picture kept in there opposite to what is contained in other data samples.

States

Some countries have states, some countries have provinces and some other countries don’t have states or provinces. In some countries the state is a mandatory part of a postal address like in the US. In other countries having states the state is not a part of a printed address like in Germany, but you may have other purposes for storing the data anyway.

Postal codes and districts

Often local postal code systems are translated to the term ZIP-code – but ZIP code is actually the name of the US system.

The granularity of postal code systems differs a lot around the world. The UK postal codes are very specific while a postal code in other countries may refer to a large city. In most countries the postal code system is a hierarchy of numbers. The UK system is different. The Irish is very different – no postal codes until now.

In many countries companies are assigned a postal code of their own. The same goes for post office box addresses. In France the name of the referring district is followed by the word CEDEX for these addresses. So, be careful when matching or grouping city names in French addresses. Paris not Cedex is the centre of the universe in that country.

Locations, streets, blocks, house names, whatever

A lot of different hierarchies in various levels exist around the world – and the custom sequence also varies. This is a too complex and comprehensive subject for a blog post. So I will only emphasis a few selected subjects:

Vanity addressing is a phenonemen not at least in the UK where keeping up appearances rules. Here you may have to include a lie in the single version of truth.
Coding rules in my home country Denmark as we have a way of assigning a unique code to every real world entity. It helps with automated taxation. So a main road in central Copenhagen may be known to people as “H.C. Andersens Boulevard” but is stored in any mature database as “1010148”.
When matching party entities don’t make a false negative with an entity having a visit (geographical) address versus an entity having a mail address.

Entrances

Entrance – most often referred to as house number – is where addressing meets geocoding. Here you by using geocodes can point to an exact value identifying an address. When comparing with other addresses you just have to make sure whether you are talking latitude/longitude in a round world or WGS84 x-y coordinates or other geographic coordinate systems in a flat world and whether we are pointing at the centre of the building, at the door, at the spot where a public road is reachable or it is interpolated values.

Units

Larger buildings, high rising buildings and skyscrapers are usually not one address but is an entrance having multiple family apartments and/or multiple business addresses. These may be presented in many formats and in many depths including floors, sides, door numbers, you name it.

Large business entities may occupy a range of entrances.

Some entrances may in first impression look like a single address occupied by a nuclear family, but are in fact a nursing home or a campus occupied by a number of named individuals living on the same address.

Data models

The postal (geographical and mailing) address elements are in many data models just some of the attributes in a party entity. By separating the postal address elements in a specific entity with granulated attributes you will be more aligned with the real world and thereby have a better chance of fulfilling all purposes with the raw data. One of the most obvious advantages will be history tracking as business’ and consumers/citizens relocates from time to time.

Sharing data is key to a single version of the truth

12th November 200920th October 2010Henrik Gabs Liliendahl10 Comments

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Charles Blyth and Jim Harris. Our contest is a Blogging Olympics of sorts, with the Great Britain, United States and Denmark competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.”

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below). Thanks!

My take

According to Wikipedia data may be of high quality in two alternative ways:

Either they are fit for their intended uses
Or they correctly represent the real-world construct to which they refer

In my eyes the term “single version of the truth” relates best to the real-world way of data being of high quality while “shared version of the truth” relates best to the hard work of making data fit for multiple intended uses of shared data in the enterprise.

My thesis is that there is a break even point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

The map analogy

In search for this truth we will go on a little journey around the world.

For a journey we need a map.

Traditionally we have the challenge that the real-world being the planet Earth is round (3 dimensions) but a map shows a flat world (2 dimensions). If a map shows a limited part of the world the difference doesn’t matter that much. This is similar to fitting the purpose of use in a single business unit.

If the map shows the whole world we may have all kind of different projections offering different kind of views on the world having some advantages and disadvantages. A classic world map is the rectangle where Alaska, Canada, Greenland, Svalbard, Siberia and Antarctica are presented much larger than in the real-world if compared to regions closer to equator. This is similar to the problems in fulfilling multiple uses embracing all business units in an enterprise.

Today we have new technology coming to the rescue. If you go into Google Earth the world indeed looks round and you may have any high altitude view of a apparently round world. If you go closer the map tends to be more and more flat. My guess is that the solutions to fit the multiple uses conondrum will be offered from the cloud.

Exploiting rich external reference data

But Google Earth offers more than powerfull technolgy. The maps are connected with rich information on places, streets, companies and so on obtained from multiple sources – and also some crowdsourced photos not always placed with accuracy. Even if external reference data is not “the truth” these data, if used by more and more users (one instance, multiple tenants), will tend to be closer to “the truth” than any data collected and maintained solely in a single enterprise.

Shared data makes fit for pupose information

You may divide the data held by an enterprise into 3 pots:

Global data that is not unique to operations in your enterprise but shared with other enterprises in the same industry (e.g. product reference data) and eventually the whole world (e.g. business partner data and location data). Here “shared data in the cloud” will make your “single version of the truth” easier and closer to the real world.
Bilateral data concerning business partner transactions and related master data. If you for example buy a spare part then also “share the describing data” making your “single version of the truth” easier and more accurate.
Private data that is unique to operations in your enterprise. This may be a “single version of the truth” that you find superior to what others have found, data supporting internal business rules that make your company more competitive and data referring to internal events.

While private and then next bilateral data makes up the largest amount of data held by an enterprise it is often seen that it is data that could be global that have the most obvious data quality issues like duplicated, missing, incorrect and outdated party master data information.

Here “a global or bilateral shared version of the truth” helps approaching “a single version of the truth” to be shared in your enterprise. This way accurate raw data may be consumed as valuable information in a given context at once when needed.

Call to action

If not done already, please take the time to read posts from fellow bloggers Charles Blyth and Jim Harris and then vote for who you think has won the debate. A link to the same poll is provided on all three blogs. Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals.

The poll will remain open for one week, closing at midnight on 19^th November so that the “medal ceremony” can be conducted via Twitter on Friday, 20^th November. Additionally, please share your thoughts and perspectives on this debate by posting a comment below. Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.

Vote here.

Who is working where doing what?

8th November 200924th July 2010Henrik Gabs Liliendahl2 Comments

A classic core data model for Master Data in CRM databases and Master Data hubs when doing B2B is that you have:

Accounts being the BUSINESS entities who are your customers, prospects and all kind of other business partners
Contacts being the EMPLOYEEs working there and acting in the roles as decision makers, influencers, gate keepers, users and so on – and having some kind of job title

Establishing and maintaining an optimal data quality with B2B records are often done by integrating with external reference data.

Available sources for the account layer have been in place for many years as business directories. The D&B Worldbase is one example but there are plenty around with varying scopes. Those directories offered by service providers often also covers the contact layer. But actuality has always been a problem and depth (completeness) have been limited not at least with large business entities. So in most cases I have witnessed only the account level has been integrated with external reference data while the use of external contact layer data have been limited to new market campaigns (with varying results).

With the rise of social network sites information about employees are made more or less available to anyone. Last time (mid-October) I checked on LinkedIn the rate of profiles compared to population was:

Denmark had 435,628 profiles, population 5,519,441 giving a ratio of 7.89 %.
Netherlands had 1,278,927 profiles, population 16,500,156 giving a ratio of 7.75 %
USA had 23,089,079 profiles, population 307,698,000 giving a ratio of 7.50 %.

Other countries I checked had lesser ratios but fast increasing numbers. All in all a formidable source of reference data for the contact layer.

Of course there are data quality issues with social networking sites. Data are maintained by the persons themselves which most often means good actuality and validity – but sometimes also means exaggeration and deceit. And yes, there are duplicate profiles.

Doing Social CRM is already hot stuff. Social MDM – in the meaning of exploiting social network reference data – will follow.

Data Quality and Climate Politics

6th November 200924th June 2010Henrik Gabs Liliendahl8 Comments

In 1 month and 1 day the United Nations Climate Change Conference commence in my hometown Copenhagen. Here the people of the Earth will decide if we want to save the planet now or we will wait a while and see what happens.

The Data Quality issue might seem of little importance compared to the climate issue. Nevertheless I have been thinking about some similarities between Data Governance/ Data Quality and climate politics.

It goes like this:

CEO buy-in

It’s often said that CEO’s don’t buy-in on data quality improvements because it’s a loser’s game. In climate politics the CEO’s are the heads of states. It’s still a question how many heads of state who will attend the Copenhagen conference. There is a great deal of attention around whether United States president Barack Obama will attend. His last visit to Copenhagen in early October didn’t turn out as a success as his recommendation for Chicago as Olympic host city was fruitless. I guess he will only come again if success is very likely.

Personal agendas

On the other hand British Prime Minister Gordon Brown has urged all world leaders to come to Copenhagen. While I think this is great for the conference being a success I also have a personal reason to think, that it’s a very bad idea. Having all the world heads of states driving around in the Copenhagen streets surrounded by a horde of police bikes will make traffic jams interfering with my daily work and more seriously my Christmas shopping.

It’s no secret that much of the climate problem is caused by us as individuals not being more careful about our energy consumption in daily routines. Data Quality is all the same about individuals not thinking ahead but focusing on having daily work done as quickly and comfortable as possible.

The business perspective

My fellow countryman Bjørn Lomborg is a prominent proponent of the view of focusing more on battling starvation, diseases and other evils because the resources will be spent more effective here than the marginal effects the same resources will have on fighting changing climate.

Data Quality improvement is often omitted from Business Process Reengineering when the scope of these initiatives is undergoing prioritizing focusing on worthy measurable short term wins.

Final words

My hope for my planet – and my profession – is that we are able to look ahead and do what is best for the future while we take personal responsibility and care in our daily work and life.

Slowly Changing Hierarchies

4th November 200923rd June 2010Henrik Gabs Liliendahl4 Comments

The term “slowly changing dimensions” is known from building data warehouses and attempting to make sense of data with business intelligence using reference data.

The fact that the world is changing all the time is also present when we look at Master Data Management and the essential hierarchy building taking place when structuring these data.

Company family trees are a common hierarchy structure in Master Data. One source of information about company family trees is the D&B Worldbase – a database operated by Dun & Bradstreet holding over 150 million business entities from all over the world.

I used to have Dun & Bradstreet as a customer. I don’t have that anymore – but I’m still working with the very same project. Because since I started this assignment US based Dun & Bradstreet handed over the operation in a range of European countries to the Swedish publishing group Bonnier. They later handed it over to Swedish company Bisnode. I started the project when I worked for Swedish consultancy group Sigma, continued in my Danish sole proprietorship and now serve Bisnode through German data quality tool vendor Omikron. Slowly changing relationships indeed.

As with many other activities in the realm of data quality establishing the “golden view”, “the single version of the truth” is only the beginning. If that “golden view” is not put into an ongoing maintenance the shiny gold will fade – slowly but steady.

360° Business Partner View

1st November 20096th July 2010Henrik Gabs Liliendahl2 Comments

Having a 360° customer view is a well established term in CRM and Master Data Management. It is typically defined as “providing everyone in the organization with a consistent view of the customer.”

Then some organizations don’t use the term customer but other words like:

Citizen is the common term in public sector organizations when dealing with private persons
Patient is used in healthcare and the customer/citizen balance is different between countries around the world
Member is used in membership organizations like fundraising and those organizing employers and employees

The concept of a 360° customer view is in my eyes easily swapped with 360° citizen / patient/ member view.

Also related to the position in the pipeline we have words as:

Prospect being an entity with whom we have a 1-1 dialogue about becoming a customer
Lead being an entity we want to engage in such a dialogue

I think embracing prospects and leads is a must for a 360° customer view. Having the same real world object acting as a customer and a prospect/lead at the same time doesn’t make sense.

Hierarchy is of course important here, as the customer and the prospect or lead may belong to the same hierarchy but at a different level or only seen at a higher level. This is true for:

Households in B2C operations
Company family trees in B2B operations
Multiple employee engagements in B2B operations
Small business owners in B2B and B2C coexisting environments

Organizations also have suppliers. In a B2B organization the intersection of business partners being customers / prospects / leads and also suppliers may be surprisingly large. Typically the intersection is not that large seen at branch level but higher if we take a look at the ultimate global mother level.

From my point of view a 360° customer view should be made on consolidated customer and supplier hierarchies in B2B. Even in B2C a private customer may be a business owner or key employee at a supplier.

Employees are another master data entity that may have an intersection with customers and suppliers. Having an employee being a (or spouse of a) business owner at a small business supplier is a classic cause of trouble. I have seen situations where a 360° customer view could include employee entities.

Other Business Partner entities exists depending on industry and specific business operations where a 360° customer view would benefit from catching up on other real world party entities.

I think Data Matching and/or upstream prevention by error tolerant search has a busy near future.

Master Data Survivorship

28th October 20092nd July 2010Henrik Gabs Liliendahl1 Comment

A Master Data initiative is often described as making a “golden view” of all Master Data records held by an organization in various databases used by different applications serving a range of business units.

In doing that (either in the initial consolidation or the ongoing insertion and update) you will time and again encounter situations where two versions of the same element must be merged into one version of the truth.

In some MDM hub styles the decision is to be taken at consolidation time, in other styles the decision is prolonged until the data (links) is consumed in a given context.

In the following I will talk about Party Master Data being the most common entity in Master Data initiatives.

This spring Jim Harris made a brilliant series of articles on DataQualityPro on the subject of identifying duplicate customers ending with part number 5 dealing with survivorship. Here Jim describes all the basic considerations on how some data elements survives a merge/purge and others will be forgotten and gives good examples with US consumer/citizens.

Taking it from there Master Data projects may have the following additional challenges and opportunities:

Global Data adds diversity into the rule set of consolidation data on record level as well as field level. You will have to comprise on simple global rules versus complex optimized rules (and supporting knowledge data) for each country/culture.
Multiple types of Party Master Data must be handled when Business Partners includes business entities having departments and employees and not at least when they are present together with consumers/citizens.
External Reference Data is becoming more and more common as part of MDM solutions adding valid, accurate and complete information about Business Partners. Here you have to set rules (on field level) of whether they override internal data, fills in the blanks or only supplements internal data.
Hierarchy building is closely related to survivorship. Rules may be set for whether two entities goes into two hierarchies with surviving parts from both or merges as one with survivorship. Even an original entity may be split into two hierarchies with surviving parts.

What is essential in survivorship is not loosing any valuable information while not creating information redundancy.

An example of complex survivorship processing may be this:

A membership database holds the following record (Name, Address, City):

Margaret & John Smith, 1 Main Street, Anytown

An eShop system has the following accounts (Name, Address, Place):

Mrs Margaret Smith, 1 Main Str, Anytown
Peggy Smith, 1 Main Street, Anytown
Local Charity c/o Margaret Smith, 1 Main Str, Anytown

A complex process of consolidation including survivorship may take place. As part of this example the company Local Charity is matched with an external source telling it has a new name being Anytown Angels. The result may be this “golden view”:

ADDRESS in Anytown on Main Street no 1 having
• HOUSEHOLD having
– CONSUMER Mrs. Margaret Smith aka Peggy
– CONSUMER Mr. John Smith
• BUSINESS Anytown Angels having
– EMPLOYEE Mrs. Margaret Smith aka Peggy

Observe that everything survives in a global applicable structure in a fit hierarchy reflecting local rules handling multiple types of party entities using external reference data.

But OK, we didn’t have funny names, dirt, misplaced data…..

Gorilla Data Quality

24th October 200921st June 2010Henrik Gabs Liliendahl2 Comments

My previous blog post was titled “Guerrilla Data Quality”. In that post – and the excellent comments – we came around that while we should have a 100% vision for data (or rather information) quality most actual (and realistic) activity is minor steps compromising on:

Business unit versus enterprise wide scope
Single purpose versus multiple purpose capabilities
Reactive versus proactive approach

I think the reason why it is so is the widely used metaphor saying “Pick the low-hanging fruit first”. Such a metaphor is appealing to mankind since it relates to core activities made by our ancestors when gathering food – and still practiced by our cousins the gorillas.

Steve Sarsfield explained the logic of picking low hanging fruits in his blog post Data Quality Project Selection by presenting the Project Selection Quadrant.

So what we are looking for now is the missing link between Gorilla / Guerrilla Data Quality and the teaching in available literature on how to get data (information) quality right.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph