Diversity in Data Quality in 2010

Diversity in data quality is a favorite topic of mine and diversity has been my theme word in social media engagement this year.

Fortunately I’m not alone. Others have been writing about diversity in data quality in the past year. Here are some of the contributions I remember:

The Dutch data quality tool vendor Human Inference has a blog called Data Value Talk. Here several posts are about diversity in data quality including the post World Languages Day – Linguistic diversity rules in Switserland!

Another blog based in the Netherlands is from Graham Rhind. Graham (a Brit stranded in Amsterdam) is an expert in international issues with data quality and one of his blog posts this year is called Robert the Carrot.

The MDM Vendor IBM Initiate has a lively blog about Master Data Management and Data Quality. One of the posts this year was an introduction to a webinar. The post by Scott Schumacher (in which I’m proud to be mentioned) is called Join Us to Demystify Multi-Cultural Name Matching.

Rich Murnane posted a funny but learning video with Derek Sivers about Japanese addresses called What is the name of that block? (Again, thanks Rich for the mention).

In the eLearningCurve free webinar series there was a very educational session with Kathy Hunter called Overcoming the Challenges of Global Data.  There is also an interview with Kathy Hunter on the DataQualityPro site.

I also remember we debated the state of the art of data quality tools when it comes to international data in the post by Jim Harris called OOBE-DQ, Where Are You? As Jim mentions in his later post called Do you believe in Magic (Quadrants)?: “It must be noted that many vendors (including the “market leaders”) continue to struggle with their International OOBE-DQ”.

I guess that international capabilities in data quality tools and party master data management solutions will be on the agenda in 2011 as well.

Bookmark and Share

Automation

The article on Wikipedia about automation begins like this:

“Automation is the use of control systems and information technologies to reduce the need for human work in the production of goods and services. In the scope of industrialization, automation is a step beyond mechanization. Whereas mechanization provided human operators with machinery to assist them with the muscular requirements of work, automation greatly decreases the need for human sensory and mental requirements as well. Automation plays an increasingly important role in the world economy.

Automation has had a notable impact in a wide range of industries beyond manufacturing (where it began). Once-ubiquitous telephone operators have been replaced largely by automated telephone switchboards and answering machines.”

Often we discuss the role of technology in solving data and information quality issues. Viewpoints differ between:

  • Technology may be part of the problem, but should not be part of the solution
  • Tools may solve a certain part of the problems by automating else time consuming processes

I am deliberately not stating the extreme viewpoint that tools (or a certain tool) will solve everything, as I have never seen or heard that viewpoint as mentioned in the post Data Quality Tool Exaggerations.

So, given that range, my viewpoint is the second extreme viewpoint of the ones mentioned above.

If you surprisingly should have a more extreme viewpoint you may go to the OCDQ Blog post called What Does Data Quality Technology Want? and vote for the second option there.

Bookmark and Share

My 2011 To Do List

These days are classic times for predicting something about next year in a blog post. This year I will make some egocentric predictions about what I am going to do next year. Fortunately I think these activities are pretty representative for the trends in the data quality realm.

My three most important challenges in working with data and information quality improvement and master data management will be:

Multi-Domain Master Data Quality

There are some different disciplines and product offerings around as:

  • Data Quality tools
  • Customer Data Integration (CDI) solutions
  • Product Information Management (PIM) platforms

These disciplines and the related software packages used to solve the challenges are constantly maturing and expanded to embrace the problems as a whole.

Find more about the subject in my posts on Multi-Domain MDM.

Exploiting rich external reference data sources in the cloud

Working with external reference sources as a mean to improve data quality has been a focus area of mine for many years.

Recent developments in governments releasing rich sources of data will help with availability here, but new challenges will also arise, like working with conformity across data sources coming from many different countries in many different ways.

Much of the activity here will happen in the cloud.

See my take on the subject on the page Data Quality 3.0 and read about a concrete implementation in instant Data Quality.

Downstream data cleansing

Despite constant improvements with data quality tools and master data management solutions moving us from batch cleansing downstream to upstream prevention there will still be lots of reasons for doing downstream cleansing projects.

Here are the top 5 reasons.

I expect to be involved in at least one of each type next year.

Bookmark and Share

Christmas at the old Bookstore

Once upon a time (let’s say 15 years ago) there was a nice old bookstore on a lovely street in a pretty town. The bookstore was a good shopping place caring about their customers. The business had grown during the years. Neighboring shops have been bought and added to the premises along with the apartments above the original shop.

Also the number of employees had increased. The old business processes didn’t fit into the new reality so the wise old business owner launched a business process reengineering project in order to have the shop ready for a new record selling Christmas season. All the employees were more or less involved from brainstorming ideas to the final implementation. All suggestions were prioritized according to business value in supporting the way of doing business: Handing books over the fine old cash desk in the middle of the bookstore.

Even some new technology adoptions were considered during the process. But not too much. As the wise old business owner said again and again: Technology doesn’t sell books. Ho ho ho.

Unfortunately something terrible happened somewhere else. I don’t remember if it was on the other side of the street, on the other side of the river or on the other side of the ocean. But someone opened an internet bookstore. During the next years the market for selling books changed drastically due to orchestrating a business process based on new technology.

The wise old business owner at the nice old bookstore was choked. He actually had read the best management books on the shelf in the bookstore telling him to improve his business processes based on the way of doing business today; rely on changing the attitude of the good people working for him and then maybe use technology as an enabler in doing that. Ho ho ho.

Now, what about a happy ending? Oh yes. Actually some people like to buy some books on the internet and like to buy some other books in a nice old bookstore. Some other people like to buy most books in a nice old bookstore but may want to buy a few other books on the internet. So the wise old business owner went into multi-channel book selling. In order to keep track on who is buying what and where he used a state of the art data matching tool. Ho ho ho. Besides that he of course relied on the good people still working for him. Ho ho ho.

Bookmark and Share

The Overlooked MDM Feature

When engaging in the social media community dealing with master data management an often seen subject is creating a list of important capabilities for the technical side of master data management. I have at some occasions commented on such posts by adding a feature I often see omitted from these lists, namely: Error tolerant search functionality. Examples from the DataFlux CoE blog here and the LinkedIn Master Data Management Interest Group here.

Error tolerant search (also called fuzzy search) technology is closely related to data matching technology. But where data matching is basically none interactive, error tolerant search is highly interactive.

Most people know error tolerant search from googling. You enter something with a typo and google prompts you back with: Did you mean…? When looking for entities in master data management hubs you certainly need something similar. Spelling of names, addresses, product descriptions and so on is not easy – not at least in a globalized world.

As in data matching error tolerant search may use lists of synonyms as the basic technology. But also the use of algorithms is common going from an oldie like the soundex phonetic algorithm over more sophisticated algorithms.

The business benefits from having error tolerant search as a capacity in your master data management solution are plenty, including:

  • Better data quality by upstream prevention against duplicate entries as explained in this post.
  • More efficiency by bringing down the time users spends on searching for information about entities in the master data hub.
  • Higher employee satisfaction by eliminating a lot of frustration else coming from not finding what you know must be inside the hub already.

Error tolerant search has been one of the core features in the master data management implementations where I have been involved. What about you?

Bookmark and Share

Snowman Data Quality

Right now it is winter in the Northern Hemisphere and this year winter has come earlier than usual to Northern Europe where I live. We have already had a lot of snow.

One of the good things with snow is that you are able to build a snowman. Snowmen are beautiful pieces of art but very vulnerable.  Wind and not at least rising temperatures makes the snowman ugly and finally go away sooner or later.

Snowmen have this unfortunate fate common with many data quality initiatives.

Many articles, blog posts and so on in the data quality realm focuses on this fate related to technology based initiatives. The common practice of executing downstream cleansing of data using data quality tools is often criticized. As a practitioner in this field I have to admit that: Yes, I am often making the art of building snowman data quality.

An often stated alternative to using data quality tools is improving data quality through change management including relaying on changing the attitude of people entering and maintaining data. Though it’s not my area of expertise I have seen such initiatives too. And I am afraid that I am not convinced that such initiatives unfortunately also sooner or later have the same fate as the snowman.

As said, I’m not the expert here. I am only the little child watching how this snowman is exposed to the changing winds in many business environments and how it finally disappears when the business climate varies over time.

Now, this is supposed to be a cheerful blog about happy databases. I am ready for getting into some warm clothes and build a beautiful snowman of any kind.  

Bookmark and Share

Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

The Princess and the Pea

I have earlier used the fairy tales of Hans Christian Andersen on this blog. This time it is the story about the princess on the pea.

The story tells of a prince who wants to marry a princess, but is having difficulty finding a suitable wife. Something is always wrong with those he meets, and he cannot be certain they are real princesses. One stormy night (always a harbinger of either a life-threatening situation or the opportunity for a romantic alliance in Andersen’s stories), a young woman drenched with rain seeks shelter in the prince’s castle. She claims to be a princess, so the prince’s mother decides to test their unexpected guest by placing a pea in the bed she is offered for the night, covered by 20 mattresses and 20 featherbeds. In the morning the guest tells her hosts—in a speech colored with double entendres—that she endured a sleepless night, kept awake by something hard in the bed; which she is certain has bruised her. The prince rejoices. Only a real princess would have the sensitivity to feel a pea through such a quantity of bedding. The two are married, and the pea is placed in the Royal Museum.

Buying a data quality tool is just as hard as it was for a prince to find a real princess in the good old days. How can you be certain that the tool is able to help you finding the difficult not obvious flaws hidden in your already stored data or the data streams coming in?

I think performing a test like the queen did in Andersen’s story is a must, and like the queen didn’t, don’t tell the vendor about the pea. Wait and see if the tool gets black and blue all over by the pea.

Bookmark and Share

Entity Revolution vs Entity Evolution

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

  • Is a given customer master data record representing a real world person or organization?
  • Is a person acting as a private customer and a small business owner going to be seen as the same?
  • Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
  • Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

Bookmark and Share

Donkey Business

When I started focusing on data quality technology 15 years ago I had great expectations about the spread of data quality tools including the humble one I was fabricating myself.

Even if you tell me that tools haven’t spread because people are more important than technology, I think most people in the data and information quality realm think that the data and information quality cause haven’t spread as much as deserved.

Fortunately it seems that the interest in solving data quality issues is getting traction these days. I have noticed two main drivers for that. If we compare with the traditional means of getting a donkey to move forward, the one encouragement is like the carrot and the other encouragement is like the stick:

  • The carrot is business intelligence
  • The stick is compliance

With business intelligence there has been a lot things said and written about that business intelligence don’t deliver unless the intelligence is build on a solid valid data foundation. As a result I have noticed I’m being involved in data quality improvement initiatives around aimed as a foundation for delivering business decisions. One of my favorite data quality bloggers Jim Harris has turned that carrot a lot on his blog: Obsessive Compulsive Data Quality.  

Another favorite data quality blogger Ken O’Conner has written about the stick being compliance work on his blog, where you will find a lot of good points that Ken has learned from his extensive involvement in regulatory requirement issues.

These times are interesting times with a lot of requirements for solving data quality issues. As we all know, the stereotype donkey is not easily driven forward and we must be aware not making the burden to heavy:    

Bookmark and Share