Liliendahl on Data Quality

Christmas Tree Options

19th December 201019th December 2010Henrik Gabs Liliendahl3 Comments

Today the last Sunday before Christmas seems to be a good day for selecting a Christmas tree.

We are considering two different options:

As most times before we will find a tree as wide and high as possible for the room so it may be decorated with as much of different stuff we have collected during the years as well as some of the precious things passed down from previous generations. It will be cut over the root, but that’s not a problem since we will throw it away after Christmastide.
Another option is having a smaller tree still with the root on planted in a pot. We will then have to carefully select the decoration. The advantage is that it can be reused on the terrace during the year and then, a little taller, as Christmas tree again next year.

Well, not that different from the considerations about data quality, data warehouse and business intelligence projects and programs from my workdays.

Matching Down Under

17th December 2010Henrik Gabs Liliendahl4 Comments

As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.

This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.

As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.

The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.

As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.

I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:

Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.

Matching Light Bulbs

15th December 201016th December 2010Henrik Gabs Liliendahl4 Comments

This morning I noticed this lightbulb joke in a tweet from @mortensax:

Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:

The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.

Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:

Light bulb, A 19, 130 Volt long life, 60 W
Incandescent lamp, 60 Watt, A19, 130V

while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):

Light bulb, 60 Watt, A 19, 130 Volt long life
Light bulb, 40 Watt, A 19, 130 Volt long life

Working with product data matching is indeed very enlightening.

Now, where’s the undo button?

14th December 201014th December 2010Henrik Gabs LiliendahlLeave a comment

I have just read two blog posts about the dangers of deleting data in the good cause of making data quality improvements.

In his post Why Merging is Evil Scott Schumacher of IBM Initiate describes the horrors of using survivorship rules for merging two (or more) database rows recognized to reflect the same real world entity.

Jim Harris describes the insane practices of getting rid of unwanted data in the post A Confederacy Of Data Defects.

On a personal note I have just had a related experience from outside the data management world. We have just relocated from a fairly large house to a modest sized apartment. Due to the downsizing and the good opportunity given by the migration we wasted a lot of stuff in the process. Now we are in the process of buying replacements for these things we shouldn’t have thrown away.

As Scott describes in his post about merging, there is an alternate approach to merging being linking – with some computation inefficiency attached. Also in the cases described by Jim we often don’t dare to delete at the root, so instead we keep the original values and makes a new cleansed copy without the supposed unwanted data for the purpose at hand.

In my relocation project we could have rented a self-storage unit for all the supposed not so needed stuff as well.

It’s a balance. As in all things data quality there isn’t a single right or wrong answer to what to do. And there will always be regrets. Now, where’s the undo button?

My 2011 To Do List

12th December 201015th April 2012Henrik Gabs LiliendahlLeave a comment

These days are classic times for predicting something about next year in a blog post. This year I will make some egocentric predictions about what I am going to do next year. Fortunately I think these activities are pretty representative for the trends in the data quality realm.

My three most important challenges in working with data and information quality improvement and master data management will be:

Multi-Domain Master Data Quality

There are some different disciplines and product offerings around as:

Data Quality tools
Customer Data Integration (CDI) solutions
Product Information Management (PIM) platforms

These disciplines and the related software packages used to solve the challenges are constantly maturing and expanded to embrace the problems as a whole.

Find more about the subject in my posts on Multi-Domain MDM.

Exploiting rich external reference data sources in the cloud

Working with external reference sources as a mean to improve data quality has been a focus area of mine for many years.

Recent developments in governments releasing rich sources of data will help with availability here, but new challenges will also arise, like working with conformity across data sources coming from many different countries in many different ways.

Much of the activity here will happen in the cloud.

See my take on the subject on the page Data Quality 3.0 and read about a concrete implementation in instant Data Quality.

Downstream data cleansing

Despite constant improvements with data quality tools and master data management solutions moving us from batch cleansing downstream to upstream prevention there will still be lots of reasons for doing downstream cleansing projects.

Here are the top 5 reasons.

I expect to be involved in at least one of each type next year.

The Snow Queen

11th December 2010Henrik Gabs LiliendahlLeave a comment

During the existence of this blog I have come to use two tags several times, namely the fairy tale author Hans Christian Andersen as an inspiration for data quality related subjects and the tag happy databases as a counterweight against that we may talk too much about all the bad data quality around.

In embracing these two tags the fairy tale The Snow Queen also starts in the very bad end.

An evil troll makes a magic mirror that has the power to distort the appearance of things reflected in it. It fails to reflect all the good and beautiful aspects of people and things while it magnifies all the bad and ugly aspects so that they look even worse than they really are; for example makes the loveliest landscapes look like “boiled spinach.” I think every child understands that metaphor.

We tend to do the same in the data quality realm. In order to make a case for data and information quality improvement we like to tell about trainwrecks like on the site edited by IAIDQ. And for the record, I am guilty as everyone else in reading, laughing and contributing to the mobbing when everyone else makes a mistake within data management.

Christmas at the old Bookstore

8th December 20108th December 2010Henrik Gabs LiliendahlLeave a comment

Once upon a time (let’s say 15 years ago) there was a nice old bookstore on a lovely street in a pretty town. The bookstore was a good shopping place caring about their customers. The business had grown during the years. Neighboring shops have been bought and added to the premises along with the apartments above the original shop.

Also the number of employees had increased. The old business processes didn’t fit into the new reality so the wise old business owner launched a business process reengineering project in order to have the shop ready for a new record selling Christmas season. All the employees were more or less involved from brainstorming ideas to the final implementation. All suggestions were prioritized according to business value in supporting the way of doing business: Handing books over the fine old cash desk in the middle of the bookstore.

Even some new technology adoptions were considered during the process. But not too much. As the wise old business owner said again and again: Technology doesn’t sell books. Ho ho ho.

Unfortunately something terrible happened somewhere else. I don’t remember if it was on the other side of the street, on the other side of the river or on the other side of the ocean. But someone opened an internet bookstore. During the next years the market for selling books changed drastically due to orchestrating a business process based on new technology.

The wise old business owner at the nice old bookstore was choked. He actually had read the best management books on the shelf in the bookstore telling him to improve his business processes based on the way of doing business today; rely on changing the attitude of the good people working for him and then maybe use technology as an enabler in doing that. Ho ho ho.

Now, what about a happy ending? Oh yes. Actually some people like to buy some books on the internet and like to buy some other books in a nice old bookstore. Some other people like to buy most books in a nice old bookstore but may want to buy a few other books on the internet. So the wise old business owner went into multi-channel book selling. In order to keep track on who is buying what and where he used a state of the art data matching tool. Ho ho ho. Besides that he of course relied on the good people still working for him. Ho ho ho.

The Overlooked MDM Feature

7th December 20107th December 2010Henrik Gabs LiliendahlLeave a comment

When engaging in the social media community dealing with master data management an often seen subject is creating a list of important capabilities for the technical side of master data management. I have at some occasions commented on such posts by adding a feature I often see omitted from these lists, namely: Error tolerant search functionality. Examples from the DataFlux CoE blog here and the LinkedIn Master Data Management Interest Group here.

Error tolerant search (also called fuzzy search) technology is closely related to data matching technology. But where data matching is basically none interactive, error tolerant search is highly interactive.

Most people know error tolerant search from googling. You enter something with a typo and google prompts you back with: Did you mean…? When looking for entities in master data management hubs you certainly need something similar. Spelling of names, addresses, product descriptions and so on is not easy – not at least in a globalized world.

As in data matching error tolerant search may use lists of synonyms as the basic technology. But also the use of algorithms is common going from an oldie like the soundex phonetic algorithm over more sophisticated algorithms.

The business benefits from having error tolerant search as a capacity in your master data management solution are plenty, including:

Better data quality by upstream prevention against duplicate entries as explained in this post.

More efficiency by bringing down the time users spends on searching for information about entities in the master data hub.

Higher employee satisfaction by eliminating a lot of frustration else coming from not finding what you know must be inside the hub already.

Error tolerant search has been one of the core features in the master data management implementations where I have been involved. What about you?

Snowman Data Quality

5th December 20105th December 2010Henrik Gabs Liliendahl6 Comments

Right now it is winter in the Northern Hemisphere and this year winter has come earlier than usual to Northern Europe where I live. We have already had a lot of snow.

One of the good things with snow is that you are able to build a snowman. Snowmen are beautiful pieces of art but very vulnerable. Wind and not at least rising temperatures makes the snowman ugly and finally go away sooner or later.

Snowmen have this unfortunate fate common with many data quality initiatives.

Many articles, blog posts and so on in the data quality realm focuses on this fate related to technology based initiatives. The common practice of executing downstream cleansing of data using data quality tools is often criticized. As a practitioner in this field I have to admit that: Yes, I am often making the art of building snowman data quality.

An often stated alternative to using data quality tools is improving data quality through change management including relaying on changing the attitude of people entering and maintaining data. Though it’s not my area of expertise I have seen such initiatives too. And I am afraid that I am not convinced that such initiatives unfortunately also sooner or later have the same fate as the snowman.

As said, I’m not the expert here. I am only the little child watching how this snowman is exposed to the changing winds in many business environments and how it finally disappears when the business climate varies over time.

Now, this is supposed to be a cheerful blog about happy databases. I am ready for getting into some warm clothes and build a beautiful snowman of any kind.

Sell–side vs Buy-side Master Data Quality

2nd December 201015th December 2010Henrik Gabs Liliendahl4 Comments

The two most prominent domains in master data management and related data quality improvement are:

Party master data and
Product master data

Party Master Data

Most of the talk about party master data is about customer master data (including prospect master data). This discipline is often called Customer Data Integration (CDI). Customer data is the sell-side of party master data. The organizations with the biggest pains in this area are mostly organizations with many customers (and prospects). The largest volumes of customer data is related to business-to-consumer (B2C) activities, but certainly we also see many grown customer databases in the business-to-business (B2B) realm.

The buy-side of party master data is supplier data. Fewer organizations have grown supplier databases, but surely big firms with many different departments and subsidiaries have supplier master data issues like the ones we see on the sell-side.

Also many organizations have a surprisingly large intersection of the same parties being both on the sell-side and on the buy-side. I have touched that subject in the post: 360° Business Partner View.

Product Master Data

Product Information Management (PIM) also has a sell-side and a buy-side. Also here the pains grow with the numbers. Opposite to party master data high sell-side numbers is more seldom than high buy-side numbers with product master data.

We often see high sell-side number of products at retailers where the same product also is buy-side at the same time, but where we maybe haven’t the same requirements for entity resolution at the same time. Most organizations don’t have that big issues (like problems with uniqueness) with own produced products.

Else high number of buy-side products is not so much related to buying raw materials as it is to buying things as spare parts and all kind of small equipment and assets of different kind (with software licenses being most close to herding cats I guess).

Multi-Domain Master Data Management

With multi-domain master data management there is of course a connection between sell-side party master data and sell-side product master data with opportunities in analyzing to whom we sell what and discovering cross selling openings and so on.

On the buy-side there are great potentials in looking into from where we buy similar things, looking into discount possibilities and so on.

Same same but different

A while ago I wrote a blog post about similarities and differences between party master data quality and product master data quality called Same Same But Different.

Besides having the differences between party master data and product master data I also find we have differences between sell-side and buy-side making it four different but somewhat similar and connected disciplines in master data management and data quality improvement.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph