My 2011 To Do List

These days are classic times for predicting something about next year in a blog post. This year I will make some egocentric predictions about what I am going to do next year. Fortunately I think these activities are pretty representative for the trends in the data quality realm.

My three most important challenges in working with data and information quality improvement and master data management will be:

Multi-Domain Master Data Quality

There are some different disciplines and product offerings around as:

  • Data Quality tools
  • Customer Data Integration (CDI) solutions
  • Product Information Management (PIM) platforms

These disciplines and the related software packages used to solve the challenges are constantly maturing and expanded to embrace the problems as a whole.

Find more about the subject in my posts on Multi-Domain MDM.

Exploiting rich external reference data sources in the cloud

Working with external reference sources as a mean to improve data quality has been a focus area of mine for many years.

Recent developments in governments releasing rich sources of data will help with availability here, but new challenges will also arise, like working with conformity across data sources coming from many different countries in many different ways.

Much of the activity here will happen in the cloud.

See my take on the subject on the page Data Quality 3.0 and read about a concrete implementation in instant Data Quality.

Downstream data cleansing

Despite constant improvements with data quality tools and master data management solutions moving us from batch cleansing downstream to upstream prevention there will still be lots of reasons for doing downstream cleansing projects.

Here are the top 5 reasons.

I expect to be involved in at least one of each type next year.

Bookmark and Share

Snowman Data Quality

Right now it is winter in the Northern Hemisphere and this year winter has come earlier than usual to Northern Europe where I live. We have already had a lot of snow.

One of the good things with snow is that you are able to build a snowman. Snowmen are beautiful pieces of art but very vulnerable.  Wind and not at least rising temperatures makes the snowman ugly and finally go away sooner or later.

Snowmen have this unfortunate fate common with many data quality initiatives.

Many articles, blog posts and so on in the data quality realm focuses on this fate related to technology based initiatives. The common practice of executing downstream cleansing of data using data quality tools is often criticized. As a practitioner in this field I have to admit that: Yes, I am often making the art of building snowman data quality.

An often stated alternative to using data quality tools is improving data quality through change management including relaying on changing the attitude of people entering and maintaining data. Though it’s not my area of expertise I have seen such initiatives too. And I am afraid that I am not convinced that such initiatives unfortunately also sooner or later have the same fate as the snowman.

As said, I’m not the expert here. I am only the little child watching how this snowman is exposed to the changing winds in many business environments and how it finally disappears when the business climate varies over time.

Now, this is supposed to be a cheerful blog about happy databases. I am ready for getting into some warm clothes and build a beautiful snowman of any kind.  

Bookmark and Share

Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

The Princess and the Pea

I have earlier used the fairy tales of Hans Christian Andersen on this blog. This time it is the story about the princess on the pea.

The story tells of a prince who wants to marry a princess, but is having difficulty finding a suitable wife. Something is always wrong with those he meets, and he cannot be certain they are real princesses. One stormy night (always a harbinger of either a life-threatening situation or the opportunity for a romantic alliance in Andersen’s stories), a young woman drenched with rain seeks shelter in the prince’s castle. She claims to be a princess, so the prince’s mother decides to test their unexpected guest by placing a pea in the bed she is offered for the night, covered by 20 mattresses and 20 featherbeds. In the morning the guest tells her hosts—in a speech colored with double entendres—that she endured a sleepless night, kept awake by something hard in the bed; which she is certain has bruised her. The prince rejoices. Only a real princess would have the sensitivity to feel a pea through such a quantity of bedding. The two are married, and the pea is placed in the Royal Museum.

Buying a data quality tool is just as hard as it was for a prince to find a real princess in the good old days. How can you be certain that the tool is able to help you finding the difficult not obvious flaws hidden in your already stored data or the data streams coming in?

I think performing a test like the queen did in Andersen’s story is a must, and like the queen didn’t, don’t tell the vendor about the pea. Wait and see if the tool gets black and blue all over by the pea.

Bookmark and Share

Donkey Business

When I started focusing on data quality technology 15 years ago I had great expectations about the spread of data quality tools including the humble one I was fabricating myself.

Even if you tell me that tools haven’t spread because people are more important than technology, I think most people in the data and information quality realm think that the data and information quality cause haven’t spread as much as deserved.

Fortunately it seems that the interest in solving data quality issues is getting traction these days. I have noticed two main drivers for that. If we compare with the traditional means of getting a donkey to move forward, the one encouragement is like the carrot and the other encouragement is like the stick:

  • The carrot is business intelligence
  • The stick is compliance

With business intelligence there has been a lot things said and written about that business intelligence don’t deliver unless the intelligence is build on a solid valid data foundation. As a result I have noticed I’m being involved in data quality improvement initiatives around aimed as a foundation for delivering business decisions. One of my favorite data quality bloggers Jim Harris has turned that carrot a lot on his blog: Obsessive Compulsive Data Quality.  

Another favorite data quality blogger Ken O’Conner has written about the stick being compliance work on his blog, where you will find a lot of good points that Ken has learned from his extensive involvement in regulatory requirement issues.

These times are interesting times with a lot of requirements for solving data quality issues. As we all know, the stereotype donkey is not easily driven forward and we must be aware not making the burden to heavy:    

Bookmark and Share

Data Quality Tool Exaggerations

When following articles and blogs about information and data quality you often meet a sentiment like this:

“Data Quality tool vendors describe their products as if they will solve every possible data quality challenge around once and for all”.

Some years ago I was involved in making the English text for a description of a data quality vendor and our products. Here is the text:

“With activities in Germany, Denmark, Norway, Sweden, Austria, Switzerland, Italy, Spain, and France, [our company] is one of the leading data quality experts in Europe. We provide ready-made solutions, products, and services that increase your profits by protecting and improving your company’s customer, address, supplier and product data.

[Our company] offers state-of-the-art solutions for all of the following tasks:

  • Find, match, and eliminate duplicates
  • Restructure customer, supplier, and product databases
  • Compare with major reference data suppliers in order to correct incorrect data records
  • Enrich existing data with missing information
  • Find customers when searching within CRM and ERP systems
  • Integrate Data Quality components in SOA environments
  • Create a Master Data Hub”

Now, I don’t think we promised to boil the ocean here.

Have you stumbled upon a description on web-sites, white papers, product sheets or so where the vendor tells you that every data quality problem will be eliminated when you buy the tool?

Show me.

Bookmark and Share

The Magic Numbers

An often raised question and a subject for a lot of blog posts in the data quality realm is whether data quality challenges should be solved by people or technology.

As in all things data quality I don’t think there is a single right answer for that.

Now, in this blog post I will not tell about what I then think is the answer(s) to the question, but simply tell about what I have seen been chosen as the solution to the question, which have been both people centric solutions and technology centric solutions.

If I look at the situations where people centric solutions have been chosen versus the situations where technology centric solutions have been chosen, the first differentiator seems to be numbers:

  • If you have only a small number of customers and a single channel where entered, the better solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have millions of customers and multiple channels where entered, the only practical solution to optimal data quality and uniqueness seems to be a technology centric solution.
  • If you have only a small number of products and a single channel where entered, the only sensible solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have thousands of products coming from multiple channels, the most reliable solution to optimal data quality and uniqueness seems to be a technology centric solution.

So, based on common sense the answer to the people or technology question is that it magically depends on the numbers.

Bookmark and Share

The Art of Programming

Beginner’s All-purpose Symbolic Instruction Code or simply BASIC is one of the oldest programming languages around and also the first programming language I learned in school back in the 70’s. Later I came around a dialect of BASIC called COMAL, learned and forgot all about ASSEMBLER, made my first business code in COBOL (plus a Yahtzee game), created applications with SPEED and PACE, worked with PowerBuilder, wrote some SQL and made my own data quality tool using MAGIC.

Independent of all the different languages being used, when programming there may be two different basic measures of quality:

  1. Good code may refer to if the code is well structured, readable by others including being feasible documented, is reusable and is setup to use the computer resources the best way possible.
  2. Good code (delivered as an application) may refer to that it helps solving the business (or gaming) issue addressed through the best possible user experience.

Looking at good code these two ways resembles the two ways we also measure if our data is good:

  1. Good data may refer to if the data is well structured, readable by others including being feasible documented, is reusable and reflects the real world the best way possible.
  2. Good data (delivered as information) may refer to that it supports solving the business issue addressed through the best possible user experience.

Application (and information) users concern is point 2.

As a programmer (and data quality professional) you have to consider point 1 in order to achieve point 2. You may get along with a quick and dirty work around in a short term, but in the long run you have to make it technically right.  

Bookmark and Share

Magic Quadrant Diversity

The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.

However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?

I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.

DataFlux

First up is DataFlux, the (according to Gartner) leading data quality tool vendor.

Result: No hits.

Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.

PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.

Informatica

Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.

Result: No Hits.

Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.

Stibo Systems

Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.

Result: Hit:

They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).

Bookmark and Share

The Little Match Girl

The short story (or fairy tale) The Little Match Girl (or The Litlle Match Seller) by Hans Christian Andersen is a sad story with a bad ending, so it shouldn’t actually belong here on this blog where I will try to tell success stories about data quality improvement resulting in happy databases.

However, if I look at the industry of making data matching tools (and data matching technology is a large part of data quality tools) I wonder if the future has ever been that bright.

There are many tools for data matching out there.

Some tool vendors have been acquired by big players in the data management realm as:

  • IBM acquired Accential Software
  • SAS Institute acquired DataFlux
  • Informatica acquired Similarity Systems and Identity Systems
  • Microsoft acquired Zoomix
  • SAP acquired Fuzzy Informatik and Business Objects that acquired FirstLogic
  • Experian acquired QAS
  • Tibco acquired Netrics

(the list may not be complete, just what immediately comes to my mind).

The rest of the pack is struggling with selling matches in the cold economic winter.

There is another fairy tale similar to The Little Match Girl called The Star Money collected by the Brothers Grimm. This story has a happy ending. Here the little girl gives here remaining stuff away for free and is rewarded with money falling down from above. Perhaps this is like The Coming of Age of Open Source as told in a recent Talend blog post?

Well, open source is first expected to break the ice in the Frozen Quadrant in 2012.

Bookmark and Share