Top 5 Reasons for Downstream Cleansing

I guess every data and information quality professional agrees that when fighting bad data upstream prevention is better than downstream cleansing.

Nevertheless most work in fighting bad data quality is done as downstream cleansing and not at least the deployment of data quality tools is made downstream were tools outperforms manual work in heavy duty data profiling and data matching as explained in the post Data Quality Tools Revealed.

In my experience the top 5 reasons for doing downstream cleansing are:

1) Upstream prevention wasn’t done

This is an obvious one. At the time you decide to do something about bad data quality the right way by finding the root causes, improving business processes, affect people’s attitude, building a data quality firewall and all that jazz you have to do something about the bad data already in the databases.

2) New purposes show up

Data quality is said to be about data being fit for purpose and meeting the business requirements. But new purposes will show up and new requirements have to be met in an ever changing business environment.  Therefore you will have to deal with Unpredictable Inaccuracy.

3) Dealing with external born data

Upstream isn’t necessary in your company as data in many cases is entered Outside Your Jurisdiction.

4) A merger/acquisition strikes

When data from two organizations having had different requirements and data governance maturity is to be merged something has to be done.  Some of the challenges are explained in the post Merging Customer Master Data.

5) Migration happens

Moving data from an old system to a new system is a good chance to do something about poor data quality and start all over the right way and oftentimes you even can’t migrate some data without improving the data quality. You only have to figure out when to cleanse in data migration.

Bookmark and Share

Outside Your Jurisdiction

About half a year ago I wrote a blog post called Who is Responsible for Data Quality aimed at issues with having your data coming from another corporation and going to another corporation.

My point was that many views on data governance, data ownership, the importance of upstream prevention and fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company. But this is in the business world today not true in many cases.

Actually a majority of the data quality issues I have been around since then has had exactly these ingredients:

  • When data was born it was under an outside data governance jurisdiction
  • The initial data owners, stewards and custodians were in another company
  • Upstream wasn’t in the company were the current requirements are formulated

At the point of data transfer between the two jurisdictional areas the data is already digitalized and often it is high volume of data supposed to be processed in a short time frame, so the willingness and practical possibilities for implementing manual intervention is very limited.

This means that one case of looking for technology centric solutions is when data is born outside your jurisdiction. Also you tend to deal with concrete data quality rather than fluffy information quality in this scenario. That’s a pity, as I like information quality very much – but OK, data quality technology is quite interesting too.

Bookmark and Share

A Data Quality Appliance?

Today it was announced that IBM is to acquire Netezza, a data warehouse appliance vendor.

5 years ago I guess the interest for data warehouse appliances was very sparse. I guess this because I attended a session held by Netezza at the 2005 London Information Management conference. We were 3 people in the room: The presenter, a truly interested delegate and me. I was basically in the room because I was the next speaker in the room and wanted to see how things worked out. For the record: It was a good session, I learned a lot about appliances.  

Probably therefore I noticed a piece from 2007 where Philip Howard of Bloor wrote about The scope for appliances. In this article Phillip Howard also suggested other types of appliances, for example a data quality (data matching) appliance.  

I have been around some implementations where we could use the power of an appliance when we have to match a lot of rows. The Achilles’ heel in data matching is candidate selection and often you have to restrict on your methods in order to maintain a reasonable performance.

But I wonder if I ever will see an on promise data quality (data matching) appliance or it will be placed in the cloud. Or maybe there already is one out there? If so, please tell about it.    

Bookmark and Share

My Secret

Yesterday I followed a webinar on DataQualityPro with ECCMA ISO 8000 project leader Peter Benson.

Peter had a lot of good sayings and fortunately Jim Harris as a result of his live tweeting has documented a sample of good quotes here.

My favorite:

“Quality data does NOT guarantee quality information, but quality information is impossible without quality data.”

I have personally conducted an experiment that supports that hypothesis. It goes as this:

First, I found a data file on my computer. Lots of data in there being numbers and letters. And sure, what is interesting is the information I can derive for different purposes.

Then I deleted the data file and tried to see how much information was left behind.

Guess what? Not a bit.

I first published that experiment as a comment to one of Jim’s blog posts: Data Quality and the Cupertino Effect.

As documented in the comments on this blog post the subject of data (quality) versus information (quality) is ever recurring and almost always guarantees a fierce discussion among data/information management professionals.

So, I’ll just tell you this secret: My work in achieving quality information is done by fixing data quality.

And guess what? I have disabled comments on this blog post.

Bookmark and Share

The Sound of Soundex

The probably oldest and most used error tolerant algorithm in searching and data matching is a phonetic algorithm called Soundex. If you are not familiar with Soundex: Wikipedia to the rescue here.

In the LinkedIn group Data Matching we seem to have an ongoing discussion about the usefulness of Soundex. Link to the discussion here – if you are not already a member: Please join, spammers are dealt with, though it is OK to brag about your data matching superiority.

To sum up the discussion on Soundex I think we at this stage may conclude:

  • Soundex is of course very poor compared to the more advanced algorithms, but it may be better than nothing (which will be exact searching and matching)
  • Soundex (or a variant of Soundex) may be used for indexing in order to select candidates to be scored with better algorithms.

Let’s say you are going to match 100 rows with names and addresses against a table with 100 million rows with names and addresses and let’s say that the real world individual behind the 100 rows is in fact represented among the 100 million, but not necessary spelled the same.

Your results may be as this:

  • If you use exact automated matching you may find 40 matching rows (40 %).
  • If you use automated matching with (a variant of) Soundex you may find 95 matching rows, but only 70 rows (70 %) are correct matches (true positives) as 25 rows (25 %) are incorrect matches (false positives).
  • If you use automated matching with (a variant of) Soundex indexing and advanced algorithm for scoring you may find 75 matching rows where 70 rows (70 %) are correct matches (true positives) and 5 rows (5 %) are incorrect matches (false positives).
  • By tuning the advanced algorithm you may find 67 matching rows where 65 rows (65 %) are correct matches (true positives) and 2 rows (2 %) are incorrect matches (false positives).

So when using Soundex you will find more matching rows but you will also find more manual work in verifying the results. Adding an advanced algorithm may reduce the manual work or eliminate manual work at the cost of some not found matches (false negatives) and the risk of a few wrong matches (false positives).

PS: I have a page about other Match Techniques including standardization, synonyms and probabilistic learning.

PPS: When googling for if the title of this blog has been used before I found this article from a fellow countryman.

Bookmark and Share

Data Quality Tools: The Cygnets in Information Quality

Since engaging in the social media community around data and information quality I have noticed quite a lot of mobbing going on pointed at data quality tools. The sentiment seems to be that data quality tools are no good and will play only a very little role, if any, in solving the data and information quality conundrum.

I like to think of data quality tools as being like the cygnet (the young swan) in the fairy tale “The Ugly Duckling” by Hans Christian Andersen. An immature clumsy flapper in the barnyard. And sure, until now tools have generally not been ready to fly, but been mostly situated in the downstream corner of the landscape.

Since last September I have been involved in making a new data quality tool. The tool is based on the principles described in the post Data Quality from the Cloud.

We have now seen the first test flights in the real world and I am absolutely thrilled about the testimonial sayings. Examples:

  • “It (the tool) is lean”.  I like that since lean is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful.
  • “It is gold”. I like to consider that as a calculated positive business case.
  • “It is the best thing happened in my period of employment”. I think happy people are essential to data quality.

Paraphrasing Andersen: I never dreamed there could be so much happiness, when I was working with ugly ducklings.

Bookmark and Share

The Ugly Duckling

The title of the fairy tale “The Ugly Duckling” by Hans Christian Andersen was originally supposed to be the more positive “The Young Swan” (or “The Cygnet”) , but as Andersen did not want to spoil the element of surprise in the protagonist’s transformation, he discarded it for “The Ugly Duckling”.

In a blog post called “Why Isn’t Our Data Quality Worse?” posted today (or last night local Iowa time) Jim Harris examines the psychology term “negativity bias” that explains how bad evokes a stronger reaction than good in the human mind.

Surely, data quality improvement evangelism is most often based on the strong force of badness. Always describing how bad data is everywhere. Bashing executives who don’t get it. Only as a nice positive surprise in the end we tell how our product/consultancy will transform the ugly duckling into a beautiful swan.    

My latest blog post with a truly positive angle called “What a Lovely Day” is almost 2 months old. So I promise myself the next post will have the title “The Young Swan” (or “The Cygnet”) and will be extremely positive about data quality improvement.

Bookmark and Share

Complicated Matters

A while ago I wrote a short blog post about a tweet from the Gartner analyst Ted Friedman saying that clients are disappointed with the ability to support wide deployment of complex business rules in popular data quality tools.

Speaking about popular data quality tools; on the DataFlux Community of Experts blog Founder of DataQualityPro Dylan Jones posted a piece this Friday asking: Are Your Data Quality Rules Complex Enough?

Dylan says: “Many people I speak to still rely primarily on basic data profiling as the backbone of their data quality efforts”.

The classic answers to the challenge of complex business rules are:

  • Relying on people to enforce complex business rules. Unfortunately people are not as consistent in enforcing complex rules as computer programs are.
  • Making less complex business rules. Unfortunately the complexity may be your competitive advantage.

In my eyes there is no doubt about that data quality tool vendors has a great opportunity in research and development of tools that are better at deploying complex business rules. In my current involvement in doing so we work with features as:

  • Deployment as Service Oriented Architecture components. More on this topic here.
  • Integrating multiple external sources. Further explained here.
  • Combining the best algorithms. Example here.

Bookmark and Share

Can Anybody Hear Me?

Blogging and evangelizing about data quality is a fairly lonely trade.

Hopefully it is not because it is not a good cause. And I don’t think so. Also this week I followed another good cause not getting much attention.

After the disaster with the oil spill in the Gulf of Mexico Greenpeace launched an operation aimed at getting attention to the probably even more dangerous deep water drilling in the fragile Arctic environment.

The ship Esperanza sailed to the Baffin Bay, launched inflatables with 4 climbers who hanged in under an oil rig in 40 hours in the blistering cold wind while practically no one cared.

Oh yes, there were live tweeting from the ship on the @gp_espy account – followed by 1,500 tweeps world-wide, including yours truly.

Surely, a few articles was written by the press – mainly in Britain where the drilling company Cairn Energy belong and in Denmark because the waters belongs to Greenland/Kingdom of Denmark.

But I guess Greenpeace must be pretty disappointed with the overall attention. I guess they chose the wrong right place (platform you might say). Not much press in the Baffin Bay.   

And hey, I guess I chose the wrong time for publishing this post (based on my reader demographics as I know it). No one is online in the Pacifics now, it’s early Saturday morning in Europe and it’s the night before a 3 day weekend in the United States.

Bookmark and Share