Data Quality Tool Exaggerations

When following articles and blogs about information and data quality you often meet a sentiment like this:

“Data Quality tool vendors describe their products as if they will solve every possible data quality challenge around once and for all”.

Some years ago I was involved in making the English text for a description of a data quality vendor and our products. Here is the text:

“With activities in Germany, Denmark, Norway, Sweden, Austria, Switzerland, Italy, Spain, and France, [our company] is one of the leading data quality experts in Europe. We provide ready-made solutions, products, and services that increase your profits by protecting and improving your company’s customer, address, supplier and product data.

[Our company] offers state-of-the-art solutions for all of the following tasks:

  • Find, match, and eliminate duplicates
  • Restructure customer, supplier, and product databases
  • Compare with major reference data suppliers in order to correct incorrect data records
  • Enrich existing data with missing information
  • Find customers when searching within CRM and ERP systems
  • Integrate Data Quality components in SOA environments
  • Create a Master Data Hub”

Now, I don’t think we promised to boil the ocean here.

Have you stumbled upon a description on web-sites, white papers, product sheets or so where the vendor tells you that every data quality problem will be eliminated when you buy the tool?

Show me.

Bookmark and Share

The Magic Numbers

An often raised question and a subject for a lot of blog posts in the data quality realm is whether data quality challenges should be solved by people or technology.

As in all things data quality I don’t think there is a single right answer for that.

Now, in this blog post I will not tell about what I then think is the answer(s) to the question, but simply tell about what I have seen been chosen as the solution to the question, which have been both people centric solutions and technology centric solutions.

If I look at the situations where people centric solutions have been chosen versus the situations where technology centric solutions have been chosen, the first differentiator seems to be numbers:

  • If you have only a small number of customers and a single channel where entered, the better solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have millions of customers and multiple channels where entered, the only practical solution to optimal data quality and uniqueness seems to be a technology centric solution.
  • If you have only a small number of products and a single channel where entered, the only sensible solution to optimal data quality and uniqueness seems to be a people centric solution.
  • If you have thousands of products coming from multiple channels, the most reliable solution to optimal data quality and uniqueness seems to be a technology centric solution.

So, based on common sense the answer to the people or technology question is that it magically depends on the numbers.

Bookmark and Share

The Art of Programming

Beginner’s All-purpose Symbolic Instruction Code or simply BASIC is one of the oldest programming languages around and also the first programming language I learned in school back in the 70’s. Later I came around a dialect of BASIC called COMAL, learned and forgot all about ASSEMBLER, made my first business code in COBOL (plus a Yahtzee game), created applications with SPEED and PACE, worked with PowerBuilder, wrote some SQL and made my own data quality tool using MAGIC.

Independent of all the different languages being used, when programming there may be two different basic measures of quality:

  1. Good code may refer to if the code is well structured, readable by others including being feasible documented, is reusable and is setup to use the computer resources the best way possible.
  2. Good code (delivered as an application) may refer to that it helps solving the business (or gaming) issue addressed through the best possible user experience.

Looking at good code these two ways resembles the two ways we also measure if our data is good:

  1. Good data may refer to if the data is well structured, readable by others including being feasible documented, is reusable and reflects the real world the best way possible.
  2. Good data (delivered as information) may refer to that it supports solving the business issue addressed through the best possible user experience.

Application (and information) users concern is point 2.

As a programmer (and data quality professional) you have to consider point 1 in order to achieve point 2. You may get along with a quick and dirty work around in a short term, but in the long run you have to make it technically right.  

Bookmark and Share

Magic Quadrant Diversity

The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.

However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?

I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.

DataFlux

First up is DataFlux, the (according to Gartner) leading data quality tool vendor.

Result: No hits.

Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.

PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.

Informatica

Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.

Result: No Hits.

Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.

Stibo Systems

Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.

Result: Hit:

They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).

Bookmark and Share

Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

  • Name (could be given name and surname or a company name)
  • House number
  • Street name
  • Postal code
  • City name
  • Phone number
  • E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

You may:

  • If you only have a few rows do this process within a few hours and possibly find all the duplicates
  • If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

The Little Match Girl

The short story (or fairy tale) The Little Match Girl (or The Litlle Match Seller) by Hans Christian Andersen is a sad story with a bad ending, so it shouldn’t actually belong here on this blog where I will try to tell success stories about data quality improvement resulting in happy databases.

However, if I look at the industry of making data matching tools (and data matching technology is a large part of data quality tools) I wonder if the future has ever been that bright.

There are many tools for data matching out there.

Some tool vendors have been acquired by big players in the data management realm as:

  • IBM acquired Accential Software
  • SAS Institute acquired DataFlux
  • Informatica acquired Similarity Systems and Identity Systems
  • Microsoft acquired Zoomix
  • SAP acquired Fuzzy Informatik and Business Objects that acquired FirstLogic
  • Experian acquired QAS
  • Tibco acquired Netrics

(the list may not be complete, just what immediately comes to my mind).

The rest of the pack is struggling with selling matches in the cold economic winter.

There is another fairy tale similar to The Little Match Girl called The Star Money collected by the Brothers Grimm. This story has a happy ending. Here the little girl gives here remaining stuff away for free and is rewarded with money falling down from above. Perhaps this is like The Coming of Age of Open Source as told in a recent Talend blog post?

Well, open source is first expected to break the ice in the Frozen Quadrant in 2012.

Bookmark and Share

Quality Data Integration

As late as yesterday I was involved in yet a data quality issue that wasn’t caused by that the truth wasn’t known, but caused by that that truth wasn’t known in all the different databases within an enterprise and of course exactly not (thanks Murphy) by that application that needed that information due to a new requirement. Yep, the column was there alright, but it wasn’t updated, because until yesterday it didn’t need to be.

The data architecture in most enterprises isn’t perfect at all. Through the information technology history of that enterprise many different systems has been deployed ranging from core operational applications, data warehouses and lately also web frontends.

It’s not that we don’t know about how master data management can help, how service oriented architecture (principles) is a must and how important it is to document the data flows within the enterprise. But gee, even for a modest sized organization this is huge and even if we strived to do it right, when we succeeded, the real world has moved.

Well, back to business. What do we do? I think we will:

  • Make a quick fix that solves the business problem to the delight of the business users
  • Perhaps prioritize up that sustainable technical solution we planned some while ago

Have a nice day everyone. I think it is going to be just fine.

Bookmark and Share

instant Data Quality

My last blog post was all about how data quality issues in most cases are being solved by doing data cleansing downstream in the data flow within an enterprise and the reasons for doing that.

However solving the issues upstream wherever possible is of course the better option. Therefore I am very optimistic about a project I’m involved in called instant Data Quality.

The project is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done. Doing that has two main purposes:

  • Data entry becomes more effective. Less cumbersome investigation and fewer keystrokes.
  • Data quality is safeguarded by better real world alignment.

The combination of a more effective business process that also results in better data quality seems to be good – like a sugar-coated vitamin pill. By the way: The vitamin pill metaphor also serves well as vitamin pills should be supplemented by a healthy life style. It’s the same with data management.

Implementing improved data quality by better real world alignment may go beyond the usual goal for data quality being meeting the requirements for the intended purpose of use.  This means that you instantly are getting more by doing less.

Bookmark and Share

Top 5 Reasons for Downstream Cleansing

I guess every data and information quality professional agrees that when fighting bad data upstream prevention is better than downstream cleansing.

Nevertheless most work in fighting bad data quality is done as downstream cleansing and not at least the deployment of data quality tools is made downstream were tools outperforms manual work in heavy duty data profiling and data matching as explained in the post Data Quality Tools Revealed.

In my experience the top 5 reasons for doing downstream cleansing are:

1) Upstream prevention wasn’t done

This is an obvious one. At the time you decide to do something about bad data quality the right way by finding the root causes, improving business processes, affect people’s attitude, building a data quality firewall and all that jazz you have to do something about the bad data already in the databases.

2) New purposes show up

Data quality is said to be about data being fit for purpose and meeting the business requirements. But new purposes will show up and new requirements have to be met in an ever changing business environment.  Therefore you will have to deal with Unpredictable Inaccuracy.

3) Dealing with external born data

Upstream isn’t necessary in your company as data in many cases is entered Outside Your Jurisdiction.

4) A merger/acquisition strikes

When data from two organizations having had different requirements and data governance maturity is to be merged something has to be done.  Some of the challenges are explained in the post Merging Customer Master Data.

5) Migration happens

Moving data from an old system to a new system is a good chance to do something about poor data quality and start all over the right way and oftentimes you even can’t migrate some data without improving the data quality. You only have to figure out when to cleanse in data migration.

Bookmark and Share

Outside Your Jurisdiction

About half a year ago I wrote a blog post called Who is Responsible for Data Quality aimed at issues with having your data coming from another corporation and going to another corporation.

My point was that many views on data governance, data ownership, the importance of upstream prevention and fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company. But this is in the business world today not true in many cases.

Actually a majority of the data quality issues I have been around since then has had exactly these ingredients:

  • When data was born it was under an outside data governance jurisdiction
  • The initial data owners, stewards and custodians were in another company
  • Upstream wasn’t in the company were the current requirements are formulated

At the point of data transfer between the two jurisdictional areas the data is already digitalized and often it is high volume of data supposed to be processed in a short time frame, so the willingness and practical possibilities for implementing manual intervention is very limited.

This means that one case of looking for technology centric solutions is when data is born outside your jurisdiction. Also you tend to deal with concrete data quality rather than fluffy information quality in this scenario. That’s a pity, as I like information quality very much – but OK, data quality technology is quite interesting too.

Bookmark and Share