Diversity in Data Quality in 2010

Diversity in data quality is a favorite topic of mine and diversity has been my theme word in social media engagement this year.

Fortunately I’m not alone. Others have been writing about diversity in data quality in the past year. Here are some of the contributions I remember:

The Dutch data quality tool vendor Human Inference has a blog called Data Value Talk. Here several posts are about diversity in data quality including the post World Languages Day – Linguistic diversity rules in Switserland!

Another blog based in the Netherlands is from Graham Rhind. Graham (a Brit stranded in Amsterdam) is an expert in international issues with data quality and one of his blog posts this year is called Robert the Carrot.

The MDM Vendor IBM Initiate has a lively blog about Master Data Management and Data Quality. One of the posts this year was an introduction to a webinar. The post by Scott Schumacher (in which I’m proud to be mentioned) is called Join Us to Demystify Multi-Cultural Name Matching.

Rich Murnane posted a funny but learning video with Derek Sivers about Japanese addresses called What is the name of that block? (Again, thanks Rich for the mention).

In the eLearningCurve free webinar series there was a very educational session with Kathy Hunter called Overcoming the Challenges of Global Data.  There is also an interview with Kathy Hunter on the DataQualityPro site.

I also remember we debated the state of the art of data quality tools when it comes to international data in the post by Jim Harris called OOBE-DQ, Where Are You? As Jim mentions in his later post called Do you believe in Magic (Quadrants)?: “It must be noted that many vendors (including the “market leaders”) continue to struggle with their International OOBE-DQ”.

I guess that international capabilities in data quality tools and party master data management solutions will be on the agenda in 2011 as well.

Bookmark and Share

Matching Light Bulbs

This morning I noticed this lightbulb joke in a tweet from @mortensax:

Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:

The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.

Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:

  • Light bulb, A 19, 130 Volt long life, 60 W
  • Incandescent lamp, 60 Watt, A19, 130V

while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):

  • Light bulb, 60 Watt, A 19, 130 Volt long life
  • Light bulb, 40 Watt, A 19, 130 Volt long life

Working with product data matching is indeed very enlightening.

Bookmark and Share

My 2011 To Do List

These days are classic times for predicting something about next year in a blog post. This year I will make some egocentric predictions about what I am going to do next year. Fortunately I think these activities are pretty representative for the trends in the data quality realm.

My three most important challenges in working with data and information quality improvement and master data management will be:

Multi-Domain Master Data Quality

There are some different disciplines and product offerings around as:

  • Data Quality tools
  • Customer Data Integration (CDI) solutions
  • Product Information Management (PIM) platforms

These disciplines and the related software packages used to solve the challenges are constantly maturing and expanded to embrace the problems as a whole.

Find more about the subject in my posts on Multi-Domain MDM.

Exploiting rich external reference data sources in the cloud

Working with external reference sources as a mean to improve data quality has been a focus area of mine for many years.

Recent developments in governments releasing rich sources of data will help with availability here, but new challenges will also arise, like working with conformity across data sources coming from many different countries in many different ways.

Much of the activity here will happen in the cloud.

See my take on the subject on the page Data Quality 3.0 and read about a concrete implementation in instant Data Quality.

Downstream data cleansing

Despite constant improvements with data quality tools and master data management solutions moving us from batch cleansing downstream to upstream prevention there will still be lots of reasons for doing downstream cleansing projects.

Here are the top 5 reasons.

I expect to be involved in at least one of each type next year.

Bookmark and Share

The Overlooked MDM Feature

When engaging in the social media community dealing with master data management an often seen subject is creating a list of important capabilities for the technical side of master data management. I have at some occasions commented on such posts by adding a feature I often see omitted from these lists, namely: Error tolerant search functionality. Examples from the DataFlux CoE blog here and the LinkedIn Master Data Management Interest Group here.

Error tolerant search (also called fuzzy search) technology is closely related to data matching technology. But where data matching is basically none interactive, error tolerant search is highly interactive.

Most people know error tolerant search from googling. You enter something with a typo and google prompts you back with: Did you mean…? When looking for entities in master data management hubs you certainly need something similar. Spelling of names, addresses, product descriptions and so on is not easy – not at least in a globalized world.

As in data matching error tolerant search may use lists of synonyms as the basic technology. But also the use of algorithms is common going from an oldie like the soundex phonetic algorithm over more sophisticated algorithms.

The business benefits from having error tolerant search as a capacity in your master data management solution are plenty, including:

  • Better data quality by upstream prevention against duplicate entries as explained in this post.
  • More efficiency by bringing down the time users spends on searching for information about entities in the master data hub.
  • Higher employee satisfaction by eliminating a lot of frustration else coming from not finding what you know must be inside the hub already.

Error tolerant search has been one of the core features in the master data management implementations where I have been involved. What about you?

Bookmark and Share

Sell–side vs Buy-side Master Data Quality

The two most prominent domains in master data management and related data quality improvement are:

  • Party master data and
  • Product master data

Party Master Data

Most of the talk about party master data is about customer master data (including prospect master data). This discipline is often called Customer Data Integration (CDI).  Customer data is the sell-side of party master data. The organizations with the biggest pains in this area are mostly organizations with many customers (and prospects). The largest volumes of customer data is related to business-to-consumer (B2C) activities, but certainly we also see many grown customer databases in the business-to-business (B2B) realm.

The buy-side of party master data is supplier data. Fewer organizations have grown supplier databases, but surely big firms with many different departments and subsidiaries have supplier master data issues like the ones we see on the sell-side.

Also many organizations have a surprisingly large intersection of the same parties being both on the sell-side and on the buy-side. I have touched that subject in the post: 360° Business Partner View.

Product Master Data

Product Information Management (PIM) also has a sell-side and a buy-side. Also here the pains grow with the numbers. Opposite to party master data high sell-side numbers is more seldom than high buy-side numbers with product master data.

We often see high sell-side number of products at retailers where the same product also is buy-side at the same time, but where we maybe haven’t the same requirements for entity resolution at the same time. Most organizations don’t have that big issues (like problems with uniqueness) with own produced products.

Else high number of buy-side products is not so much related to buying raw materials as it is to buying things as spare parts and all kind of small equipment and assets of different kind (with software licenses being most close to herding cats I guess).

Multi-Domain Master Data Management

With multi-domain master data management there is of course a connection between sell-side party master data and sell-side product master data with opportunities in analyzing to whom we sell what and discovering cross selling openings and so on.

On the buy-side there are great potentials in looking into from where we buy similar things, looking into discount possibilities and so on.

Same same but different

A while ago I wrote a blog post about similarities and differences between party master data quality and product master data quality called Same Same But Different.

Besides having the differences between party master data and product master data I also find we have differences between sell-side and buy-side making it four different but somewhat similar and connected disciplines in master data management and data quality improvement.

Bookmark and Share

Storing a Single Version of the Truth

An ever recurring subject in the data quality and master data management (MDM) realms is whether we can establish a single version of the truth.

The most prominent example is whether an enterprise can implement and maintain a single version of the truth about business partners being customers, prospects, suppliers and so on.

In the quest for establishing that (fully reachable or not) single version of the truth we use identity resolution techniques as data matching and we are exploiting ever increasing sources of external reference data.

However I am often met with the challenge that despite what is possible in aiming for that (fully reachable or not) single version of the truth, I am often limited by the practical possibilities for storing it.

In storing party master data (and other kind of data) we may consider these three different ways:

Flat files

This “Keep It Simple, Stupid” way of storing data has been on an ongoing retreat – however still common, as well as new inventions of big flat file structures of data are emerging.

Also many external sources of reference data is still flat file like and the overwhelming choice of exchanging reference and master data is doing it by flat files.

Despite lots of work around solutions for storing the complex links of the real world in flat files we basically ends up with using very simplified representations of the real world (and the truth derived) in those flat files.  

Relational databases

Most Customer Relationship Management (CRM) systems are based on a relational data model, however mostly quite basic regarding master data structures making it not straight forward to reflect the most common hierarchical structures of the real world as company family trees, contacts working for several accounts and individuals forming a household.  

Master Data Management hubs are of course built for storing exactly these hierarchical kinds of structures. Common challenges here are that there often is no point in doing that as long as the surrounding applications can’t follow and that you often may restrict your use to a simplified model anyway like an industry model.   

Neural networks

The relations between parties in the real world are in fact not truly hierarchical. That is why we look into the inspiration from the network of biological neurons.

Doing that has been an option I have heard about for many years but still waits to meet as a concrete choice when delivering a single version of the truth.   

Bookmark and Share

Bilateral Master Data Management

There is an issue I have come over and over again when creating a master data hub, making a golden copy, establishing a single version of the truth or whatever we like the name to be. The issue is about the scope of data sources.

Basically you take (practically) all the master data sources from within your organization and consolidate these data. Often you match with external sources as business directories and so. But what you often miss is the master data operated by your partners. These are partners like:

  • Your suppliers of products, be that raw materials or finished products for resale
  • Your sales agents and distributors
  • Your service providers as direct marketing agencies and factoring partners

These partners are part of your business processes and they often create and consume master data which are only shared with you in a limited way via some form of interface.

I know that even handling master data from within most organizations is a complex issue. Integrating with external reference data doesn’t add simplicity. But without embracing the master data life at your partners, the hub isn’t complete; the copy is only made of plated gold and the single version of the truth isn’t the only truth.

My guess is that many master data programs in the future will extend to embrace internal (private) data, as well as external (public) data and bilateral data as described on the page about Data Quality 3.0.

Bookmark and Share

The Magic Numbers

An often raised question and a subject for a lot of blog posts in the data quality realm is whether data quality challenges should be solved by people or technology.

As in all things data quality I don’t think there is a single right answer for that.

Now, in this blog post I will not tell about what I then think is the answer(s) to the question, but simply tell about what I have seen been chosen as the solution to the question, which have been both people centric solutions and technology centric solutions.

If I look at the situations where people centric solutions have been chosen versus the situations where technology centric solutions have been chosen, the first differentiator seems to be numbers:

  • If you have only a small number of customers and a single channel where entered, the better solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have millions of customers and multiple channels where entered, the only practical solution to optimal data quality and uniqueness seems to be a technology centric solution.
  • If you have only a small number of products and a single channel where entered, the only sensible solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have thousands of products coming from multiple channels, the most reliable solution to optimal data quality and uniqueness seems to be a technology centric solution.

So, based on common sense the answer to the people or technology question is that it magically depends on the numbers.

Bookmark and Share

Golden Copy Musings

In a recent blog post by Jim Harris called Data Quality is not an Act, it is a Habit the term “golden mean” was mentioned.   

As I commented, mentioning the “golden mean” made me think about the terms “golden copy” and “golden record” which are often used terms in data quality improvement and master data management.

In using these terms I think we mostly are aiming on achieving extreme uniqueness. But we should rather go for symmetry, proportion, and harmony.

The golden copy subject is very timely for me as I this weekend is overseeing the execution of the automated processes that create a baseline for a golden copy of party master data at a franchise operator for a major brand in car rental.

In car rental you are dealing with many different party types. You have companies as customers and prospects and you have individuals being contacts at the companies, employees using the cars rented by the companies and individuals being private renters. A real world person may have several of these roles. Besides that we have cases of mixed identities.

During a series of workshops we have worked with defining the rules for merge and survivorship in the golden copy. Though we may be able to go for extreme uniqueness in identifying real world companies and persons this may not necessary serve the business needs and, like it or not, be capable of being related back into the core systems used in daily business.

Therefore this golden copy is based on a beautiful golden mean exposing symmetry, proportion, and harmony.

Bookmark and Share

Magic Quadrant Diversity

The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.

However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?

I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.

DataFlux

First up is DataFlux, the (according to Gartner) leading data quality tool vendor.

Result: No hits.

Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.

PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.

Informatica

Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.

Result: No Hits.

Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.

Stibo Systems

Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.

Result: Hit:

They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).

Bookmark and Share