Master Data – Page 57 – Liliendahl on Data Quality

Dealing with annoying customers

21st March 201023rd June 2010Henrik Gabs Liliendahl2 Comments

No, this is not a blog post about how to handle customers that unjustly complaints about everything.

This is a blog post about how to maintain high quality data in customer databases.

When doing that, there are some types of party entities that are more difficult to handle than others. In general B2B (business) entities are more complex than B2C (consumer/citizen) entities. Some of the B2B types I have spent more time with than others are the following:

Restaurants are some of the more demanding guests in our databases:

They do change owner more often than most other business entities making them a new legal entity each time which is important for some business contexts like credit risk.
On the other hand it’s the same address despite a new owner, which makes it being the same entity in the eyes of other business contexts like logistics.
In many cases you may have a name (trade style) of the restaurant and another official name of the business – a variant of this is when the restaurant is franchised.

Public sector bodies can’t be sliced the same way as private entities:

Often it is hard to state if a business partner belongs to a narrow defined or a broader defined unit within a governmental or local authority.
Public sector bodies tend to have long names that may be used with different inclusion of words, sequence of words and abbreviations of words.

Global enterprises may be seen as one or as thousands of customers:

The need for hierarchy management is obvious when it comes to handle data about business partners that belongs to a global enterprise – risk management, 1-1 marketing, sales force automation and so on will use the same data in many different ways.
Company family trees are useful but treacherous. A mother and a daughter may be very close connected with lots of shared services or it may be a strictly matter of ownership with no operational ties at all.

These are some of the facts of life that make it fun and not trivial when you are conducting data matching and other activities in order to achieve and maintain high quality of customer master data.

Who is Responsible for Data Quality?

12th March 20101st July 2010Henrik Gabs Liliendahl7 Comments

No, I am not going to continue some of the recent fine debates on who within a given company is data owner, accountable and responsible for data quality.

My point today is that many views on data ownership, the importance of upstream prevention and fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company.

This is in the business world today not true in many cases.

Examples:

Direct marketing campaigns

Making a direct marketing campaign and sending out catalogues is often an eye opener for the quality of data in your customer and prospect master files. But such things are very often outsourced.

Your company extracts a file with say 100.000 names and addresses from your databases and you pay a professional service provider a fee for each row for doing the rest of the job.

Now the service provider could do you the kind favour of carefully deduplicating the file, eliminate the 5.000 purge candidates and bring you the pleasant message that the bill will be reduced by 5 %.

Yes I know, some service providers actually includes deduplication in their offerings. And yes, I know, they are not always that interested in using an advanced solution for that.

I see the business context here – but unfortunately it’s not your business.

Factoring

Sending out invoices is often a good test on how well customer master data is entered and maintained. But again, using an outsourced service for that like factoring is becoming more common.

Your company hands over the name and address, receives the most of the money, and the data is out of sight.

Now the factoring service provider has a pretty good interest in assuring the quality of the data and aligning the data with a real world entity.

Unfortunately this can not be done upstream, it’s a downstream batch process probably with no signalling back to the source.

Customer self service

Today data entry clerks are rapidly being replaced as the customer is doing all the work themselves on the internet. Maybe the form is provided by you, maybe – as often with hotel reservations – the form is provided by a service provider.

So here you basically either have to extend your data governance all the way to your customers living room or office or in some degree (fortunately?) accept that the customer owns the data.

Standardise this, standardize that

7th March 20107th March 2011Henrik Gabs Liliendahl1 Comment

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

Compare the original data rows using fuzzy logic techniques
Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

The culture of the real world entity that the data represents
The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Having the right element to the left

27th February 201029th May 2012Henrik Gabs Liliendahl9 Comments

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

In some cultures like in east Asia the family name is written first and the given name is written last.
Some notations indicate that the given name isn’t the first element:
- “DUPONT Michel” is a custom French way of telling, that the family name is the first element
- “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Under new Master Data Management

24th February 20109th July 2010Henrik Gabs Liliendahl4 Comments

”Under new management” is a common sign in the window of a restaurant. The purpose of the sign is to tell: Yes, we know: Really bad food was served in a really bad way here. But from now on we have a new management dedicated to serve really good food in a really good way.

By the way: Restaurants are one of the more challenging business entities to handle in Party Master Data Management:

They do change owner more often than most other business entities making them a new legal entity each time which is important for some business contexts like credit risk.
On the other hand it’s the same address despite a new owner, which makes it being the same entity in the eyes of other business contexts like logistics.
In many cases you may have a name (trade style) of the restaurant and another official name of the business – a variant of this is when the restaurant is franchised.

Master Data Management is not trivial – serving restaurants or not.

Improving Master Data Management starts with the sign in the window: Yes, we know: Really bad information was served here in a really bad way. But from now on we have a new master data management dedicated to serve really good information in a really good way.

Then you may have a look at the menu. Do we have the right mix of menu items for the guests we like to serve? How are we going to govern a steady flow of fresh raw data that’s going to be prepared and selected from the menu and end up at the tables?

What about the waiters attitude? Serving is much more fun if you are proud about the dishes coming from the kitchen. It’s pleasant to bring compliments from guests back to the kitchen – not at least given along with great tips.

The information chef have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Bon appetit.

Deploying Data Matching

18th February 20102nd July 2010Henrik Gabs Liliendahl2 Comments

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Select Company_ID from External_Source where possible

27th January 20101st September 2010Henrik Gabs Liliendahl9 Comments

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

Create Table Homo_Sapiens

23rd January 201027th March 2012Henrik Gabs Liliendahl19 Comments

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
Identity Resolution is about the same but – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.

A New Year Resolution

1st January 20102nd July 2010Henrik Gabs Liliendahl2 Comments

Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.

Just before Christmas 2009 I made such a mistake in my professional work.

It’s not that I don’t have a lot of excuses. Sure I have.

The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.

I was very busy before Christmas – but accepted this minor trivial assignment.

When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.

But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.

The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.

Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.

Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.

Phony Phones and Real Numbers

8th December 20098th December 2009Henrik Gabs Liliendahl6 Comments

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

Completeness – Missing values
Precision – Inclusion of country codes, area codes, extensions
Reliability – Real world alignment, pseudo numbers: 1234.., 555…
Timeliness – Outdated and converted numbers
Conformity – Formatting of numbers
Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

Count of filled values
Minimum and maximum lengths
Represented formats – best inspected per country if international data
Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph