Linked Data Quality

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

Bookmark and Share

Data Quality is an Ingredient, not an Entrée

Fortunately it is more and more recognized that you don’t get success with Business Intelligence, Customer Relationship Management, Master Data Management, Service Oriented Architecture and many more disciplines without starting with improving your data quality.

But it will be a big mistake to see Data Quality improvement as an entrée before the main course being BI, CRM, MDM, SOA or whatever is on the menu. You have to have ongoing prevention against having your data polluted again over time.

Improving and maintaining data quality involves people, processes and technology. Now, I am not neglecting the people and process side, but as my expertise is in the technology part I will like to mention some the technological ingredients that help with keeping data quality at a tasty level in your IT implementations.

Mashups

Many data quality flaws are (not surprisingly) introduced at data entry. Enterprise data mashups with external reference data may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

External ID’s

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

Error tolerant search

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for introducing duplicates.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.


Bookmark and Share

A user experience

As a data quality professional it is a learning experience when you are the user.

During the last years I have worked for a data quality tool vendor with headquarter in Germany. As part of the role of serving partners, prospects and customers in Scandinavia I have been a CRM system user. As a tool vendor own medicine has been taken which includes intelligent real time duplicate check, postal address correction, fuzzy search and other goodies built into the CRM system.

Sounds perfect? Sure, if it wasn’t for a few diversity glitches.

The address doesn’t exist

Postal correction is only activated for Germany. This actually makes some sense since most activity is in Germany and postal correction is not that important in Scandinavia as company (and citizen) information is more available and then usually a better choice. Due to a less fortunate setup during the first years  my routine when inserting a new account was to pick correct data from a business directory, paste into the CRM system and then angry override the warning that the address doesn’t exist (in Germany).

Dear worshipful Mr Doctor Oetker

In Germany salutation is paramount. In Scandinavia it is not common to use a prefixed salutation anymore – and if you do, you are regarded as very old fashioned. So having the salutation field for a contact as mandatory is an annoyance and setting up an automated salutation generation mechanism is a complete waste of time.

Bookmark and Share

Data Quality from the Cloud

One of my favorite data quality bloggers Jim Harris wrote a blog post this weekend called “Data, data everywhere, but where is data quality?

I believe in that data quality will be found in the cloud (not the current ash cloud, but to put it plainer: on the internet). Many of the data quality issues I encounter in my daily work with clients and partners is caused by that adequate information isn’t available at data entry – or isn’t exploited. But information needed will in most cases already exist somewhere in the cloud. The challenge ahead is how to integrate available information in the cloud into business processes.

Use of external reference data to ensure data quality is not new. Especially in Scandinavia where I live, this has been in use for long because of the tradition with public sector recording data about addresses, citizens, companies and so on far more intensely than done in the rest of the world.  The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

The drivers for increased exploitation of external data are mainly:

  • Accessibility, which is where the fast growing (semantic) information store in the cloud helps – not at least backed up by the world wide tendency of governments releasing public sector data
  • Interoperability where increased supply of Service Orientated Architecture (SOA) components will pave the way
  • Cost; the more subscribers to a certain source, the lower the price – plus many sources will simply be free

As said, smoothly integration into business processes is key – or sometimes even better, orchestrating business processes in a new way so that available and affordable information (from the cloud) is pulled into these business processes using only a minimum of costly on premise human resources.

Bookmark and Share

Breaking through an open door

This is perhaps a road I have been down before for example lately in the post The Myth about a Myth.

But it is a pet peeve of mine.

Why are some people always reminding us that this and that must be seen in a business context?

Of course everything we do in our professional life within data quality, master data management, business intelligence and so on must be seen in a business context. Again, I have never seen any people taking the opposite stance.

I am aware that playing the “business context” card is a friendly reminder when say some people become too excited about a tool. But remember, every tool is originally made by people to solve a business challenge and if the tool continues to exist it has probably done that several times.

It may be that tools are over exposed in our business issue discussions due to that some people are doing their job:

  • Vendors are naturally pushing their tools – it’s a business issue
  • Analysts talks about tools and vendors – it’s a business issue
  • Conference organizers invites vendors to make sponsorships and tool exhibitions – it’s a business issue

But I don’t think you are breaking through anything when reminding anyone about the business context. Everyone knows that already.  Take it to the next level.

Who is Responsible for Data Quality?

No, I am not going to continue some of the recent fine debates on who within a given company is data owner, accountable and responsible for data quality.

My point today is that many views on data ownership, the importance of upstream prevention and  fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company.

This is in the business world today not true in many cases.

Examples:

Direct marketing campaigns

Making a direct marketing campaign and sending out catalogues is often an eye opener for the quality of data in your customer and prospect master files. But such things are very often outsourced.

Your company extracts a file with say 100.000 names and addresses from your databases and you pay a professional service provider a fee for each row for doing the rest of the job.

Now the service provider could do you the kind favour of carefully deduplicating the file, eliminate the 5.000 purge candidates and bring you the pleasant message that the bill will be reduced by 5 %.

Yes I know, some service providers actually includes deduplication in their offerings. And yes, I know, they are not always that interested in using an advanced solution for that.

I see the business context here – but unfortunately it’s not your business.

Factoring

Sending out invoices is often a good test on how well customer master data is entered and maintained. But again, using an outsourced service for that like factoring is becoming more common.

Your company hands over the name and address, receives the most of the money, and the data is out of sight.

Now the factoring service provider has a pretty good interest in assuring the quality of the data and aligning the data with a real world entity.

Unfortunately this can not be done upstream, it’s a downstream batch process probably with no signalling back to the source.

Customer self service

Today data entry clerks are rapidly being replaced as the customer is doing all the work themselves on the internet. Maybe the form is provided by you, maybe – as often with hotel reservations – the form is provided by a service provider.

So here you basically either have to extend your data governance all the way to your customers living room or office or in some degree (fortunately?) accept that the customer owns the data.

Bookmark and Share

Ongoing Data Maintenance

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.

Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.

There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.

Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.

Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.

Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.

Figure 1: one year old prime information

PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.

Bookmark and Share

Data Quality and Climate Politics

cop15_logo_imgIn 1 month and 1 day the United Nations Climate Change Conference commence in my hometown Copenhagen. Here the people of the Earth will decide if we want to save the planet now or we will wait a while and see what happens.

The Data Quality issue might seem of little importance compared to the climate issue. Nevertheless I have been thinking about some similarities between Data Governance/ Data Quality and climate politics.

It goes like this:

CEO buy-in

It’s often said that CEO’s don’t buy-in on data quality improvements because it’s a loser’s game. In climate politics the CEO’s are the heads of states. It’s still a question how many heads of state who will attend the Copenhagen conference. There is a great deal of attention around whether United States president Barack Obama will attend. His last visit to Copenhagen in early October didn’t turn out as a success as his recommendation for Chicago as Olympic host city was fruitless. I guess he will only come again if success is very likely.

Personal agendas  

On the other hand British Prime Minister Gordon Brown has urged all world leaders to come to Copenhagen. While I think this is great for the conference being a success I also have a personal reason to think, that it’s a very bad idea. Having all the world heads of states driving around in the Copenhagen streets surrounded by a horde of police bikes will make traffic jams interfering with my daily work and more seriously my Christmas shopping.

It’s no secret that much of the climate problem is caused by us as individuals not being more careful about our energy consumption in daily routines. Data Quality is all the same about individuals not thinking ahead but focusing on having daily work done as quickly and comfortable as possible.

The business perspective

My fellow countryman Bjørn Lomborg is a prominent proponent of the view of focusing more on battling starvation, diseases and other evils because the resources will be spent more effective here than the marginal effects the same resources will have on fighting changing climate.

Data Quality improvement is often omitted from Business Process Reengineering when the scope of these initiatives is undergoing prioritizing focusing on worthy measurable short term wins.

Final words

My hope for my planet – and my profession – is that we are able to look ahead and do what is best for the future while we take personal responsibility and care in our daily work and life.

Bookmark and Share

Process of consolidating Master Data

stormp1

In my previous blog post “Multi-Purpose Data Quality” we examined a business challenge where we have multiple purposes with party master data.

The comments suggested some form of consolidation should be done with the data.

How do we do that?

I have made a PowerPoint show “Example process of consolidating master data” with a suggested way of doing that.

The process uses the party master data types explained here.

The next questions in solving our business challenge will include:

  • Is it necessary to have master data in optimal shape real time – or is it OK to make periodic consolidation?
  • How do we design processes for maintaining the master data when:
    • New members and customers are inserted?
    • We update existing members and customers?
    • External reference data changes?   
  • What changes must be made with the existing applications handling the member database and the eShop?

Also the question of what style of Master Data Hub is suitable is indeed very common in these kinds of implementations.

Bookmark and Share