I guess every data and information quality professional agrees that when fighting bad data upstream prevention is better than downstream cleansing.
Nevertheless most work in fighting bad data quality is done as downstream cleansing and not at least the deployment of data quality tools is made downstream were tools outperforms manual work in heavy duty data profiling and data matching as explained in the post Data Quality Tools Revealed.
In my experience the top 5 reasons for doing downstream cleansing are:
1) Upstream prevention wasn’t done
This is an obvious one. At the time you decide to do something about bad data quality the right way by finding the root causes, improving business processes, affect people’s attitude, building a data quality firewall and all that jazz you have to do something about the bad data already in the databases.
2) New purposes show up
Data quality is said to be about data being fit for purpose and meeting the business requirements. But new purposes will show up and new requirements have to be met in an ever changing business environment. Therefore you will have to deal with Unpredictable Inaccuracy.
3) Dealing with external born data
Upstream isn’t necessary in your company as data in many cases is entered Outside Your Jurisdiction.
4) A merger/acquisition strikes
When data from two organizations having had different requirements and data governance maturity is to be merged something has to be done. Some of the challenges are explained in the post Merging Customer Master Data.
5) Migration happens
Moving data from an old system to a new system is a good chance to do something about poor data quality and start all over the right way and oftentimes you even can’t migrate some data without improving the data quality. You only have to figure out when to cleanse in data migration.
Interesting Post Henrik,
Probably most IDQ folks would would feel better if you had titled today’s blog “Top 5 Excuses for Downstream Correction”. The bad thing about downstream cleansing is that it makes the organization feel like something was done about the DQ problem, when actually money was wasted (to quote Larry English).
One of Gwen Thomas’s medical metaphors, kidney dialysis, comes to mind. Dialysis patients feel better after their weekly trip to have their blood cleaned, but it does not resolve the issue. And to paraphrase Gwen said, “My friend who goes to regular dialysis treatments would gladly go back in time, if she could, to correct the root cause before it cost her her kidney”.
Gordon, thanks a lot for commenting.
I also agree that downstream cleansing should be replaced by upstream prevention wherever possible and as soon as possible. However, as in the kidney metaphor, we can’t go back in time. Therefore there are business cases for downstream cleansing where upstream prevention is too late and doing nothing is life threatening.
Excellent post, Henrik!
Although it’s impossible to prevent every error, defect prevention controls in operational source systems can help greatly improve enterprise data quality.
However, performing data cleansing downstream from where the operational data originated is often an unavoidable reality, and you have done a great job listing the top five reasons why.
I think that one of the greatest challenges for enterprise data management is the synchronization between the downstream data cleansing processes and the operational source systems where the data was not only created, but also continues to be managed (i.e., updated, possibly duplicated, or sometimes deleted).
Cleansed — and possibly otherwise transformed and enriched — data is often passed further downstream without being used to update the upstream source systems.
Without diligent attention to this crucial aspect of enterprise data management, over time critical, and possibly irreparable disconnects can occur between the upstream and downstream systems.
However, I do NOT believe that this means downstream data cleansing is a dangerous waste of money because a “this is why all data quality has to be defect prevention where the data originates” is a VAST oversimplification of the true complexity of enterprise data management.
In my opinion, data quality is not exclusively about either the sources or the destinations of data (although both are, of course, very important), but instead the primary focus of data quality has to be on the many (and often unpredictable) journeys that data takes throughout the enterprise, acknowledging both the objective and subjective aspects of its quality, as it is used in data-driven solutions for business problems within what is always, as you said, an ever changing business environment.
Thanks for weighing in Jim.
I agree with your observations about not updating the upstream source systems. Sometimes this is not done because it is too difficult (and that’s a pity) and sometimes it isn’t done because the same requirements doesn’t apply to the source systems.
I agree with Henrik and Jim on this one. We have all read Mr. English books and when I met him personally I took the chance of asking him what he thought about data quality tools and downsream cleansing. His philosophy is that all data quality problems can be prevented in the processes and that no downstream cleansing should be needed.
Although Larry’s idea of preventing defects at the root is great in theory in reality very very few companies are mature enough to plug all data quality related process holes in an organization. The often lack the organization, governance, process and tools necessary to manage data strategically. My opionon is that traditional quality management and data quality management have a lot in common but some things are not the same.
A quality of a bolt in a car assembly line will have the same structure and quality throughout the manufacturing process. The “data bolt” will morph during its travel through the “data assembly line”, it will be of high quality in one station and low quality in another, it will be removed and replaced and finally dissected and melt down together with other “data bolts” in the end. That is why some of the more traditional quality principles don’t apply to data quality management, and that is why downstream data quality management is needed in the real world.
A shorter analogy: Fire prevention is great but fires still occur and we need firemen to put them out.
I wholeheartedly agree with you.
One of my greatest pet peeves about the data quality profession is that some of its greatest advocates (such as Larry English, whose books I have read and found very useful) want us all to believe (as they do) that Manufacturing Quality and Data Quality are 100% the same.
They are not.
The difference between physical objects being assembled into a product within a factory and the virtual objects being assembled into information within a computer is a significant one.
As Thomas Redman explains, data and information are NOT consumed with use. One physical part can only be assembled into one physical product. One piece of data can be replicated, then altered and customized in endless variations within numerous distinct information “products.”
As you said, this is why some of the traditional quality principles don’t apply to data quality management, and that is why downstream data quality management is needed in the real world.
On this point strong parallels have been drawn between the concept of a planning and information system and that of a manufacturing system. The ‘Manufacturing’ or ‘Factory’ analogy is a useful model in that it takes a conceptual over-view of both generic manufacturing and information systems to identify ways in which established quality principles may be applied to the input and process elements ensuring that information products in the form of outputs conform to the requirements of their relevant customers.
Within this context, however, one needs to be aware that the end products from manufacturing and information systems have differing implications, with the information production process viewed as potentially a more complex process than its physical equivalent. The outputs from a factory are unique one-off products which can be consumed only once, whether they are finished goods or components requiring further work. The overall effects of poor manufacturing are somewhat limited, normally requiring a scrap and re-work operation. Some longer-term detrimental implications may occur including customer dissatisfaction or product contamination, but even these will normally be relatively localised and time-constrained. Output in the form of data or information products can be consumed in an infinite number of ways and be re-cycled continually. Poor data can act like a virus infiltrating all aspects of an enterprise’s operations, re-occurring again and again, or lay hidden undetected within sub-systems in perpetuity. Data may also be used in ways for which it was not created or intended, causing potential misalignment, errors or misinterpretations, resulting in potentially dangerous or catastrophic decision making
Thanks Dario, Jim and Tony. I really like your musings around similarities and differences between improving quality in manufacturing and improving data quality.
Maybe it is time for a blog post about this topic? Production quality vs. Data quality, Root cause solutions vs. Downstream cleansing, Larry vs. Redman…
A nice debate no doubt.
You are right Dario. “versus” is always a good word in a blog post title 🙂
We could even set up a blog-bout?
Nice one although I dont think its that black & white. I’m sure I recall Larry saying something along the lines of “Clean just once”…and Tom in the great 1995 Sloan Management Review article ‘Improve Data Quality for Competitive Advantage’ talks about his analogy of a lake which is horribly polluted and that in order to clean this lake one must first ensure that the feeder streams were cleaned, the very sources of this pollution. He compared the lake to a database insisting that the streams (processes) must be treated as an asset, applying the necessary cleaning processes (data quality systems), if one is to have clean water (quality data)
Luckily it is not black & white but it sure does spark interesting discussions! 🙂
Reading Tom’s analogy about the lake got me thinking about one customer meeting where we discussed just the difference between these “data quality philosophies”. The customer had read Larry’s books and was very pro-root-cause-fix-only. In the end we came to the conclusion that there are so many factors that it is impossible to go one or the other path. Factors like time, maturity, organisation, particular processes etc etc. In the end the “old” fire analogy is true even for data quality… fire prevention is a must but also fire fighters!
Yes I agree a dual approach is required but with a emphasis placed upon some form of ‘upfront prevention’. Now we are into cultural and leadership issues and ‘change management’.
One real bug of mine is that one reads white papers on data quality and they seem to imply that all one needs to do is to buy tools & you are there.
I’ve just completed a doctoral programme in sustaining data quality in ERP systems(a DBA) combining theory and practice and realise we have to come at it from both sides, People, Processes and Data
I think emphasis should be on solving the data quality issues as efficiently as possible according to each specific business case and according to the business needs. In some cases upfront prevention may be the best way and in other cases the only solution is downstream data processing, at least in theory, but for most a dual approach makes the most sense I guess. My opinion however after have met hundreds of customers is that solutions based on upfront prevention alone is still rare and will be rare for many years to come, although it feels better in the gut area to say it!
Regarding the white papers I agree to a certain degree. There are some vendors that do imply that their tools will magically solve all data quality problems but the more serious vendors have a very professional attitude towards data quality management in their contact with the customers.
Furthermore in a dual approach tools will be absolutely necessary to be able to identify, correct, match and monitor data in the downstream part of the solution. Some tools also offer data quality validation functionality at the point of entry, for example directly in an ERP system, which supports the upfront prevention too.
Tools are also great for raising awareness for data quality in the first place. Today many organizations are overly optimistic about their data quality and need to be “woken up” by showing them actual reports on how bad their data really is, before tying the problems to a business case.
You are absolutely correct about the combination of people, process and data! Looking forward to reading your paper/work!
The dual approach is, at the present time, unavoidable in most enterprises.
Should that always be the case? No!
Will that always be the case? Unless we can get a major shift in many peoples perceptions then, sadly, yes.
Nobody in the Data Quality world (well, I hope not) would argue with the concept of Kaizen, or continuous improvement.
However, even this concept is, even with with the very best of intentions, misapplied in many enterprises as, “We are committed to continuously rectifying ever more of our data defects on an ongoing basis”. An expensive and doomed approach as, amazingly, as fast as one defect is rectified another takes its place.
All that is required is the simple, yet critical, change of emphasis that gets enterprises to state:
“We are committed to a) continuously reducing the number of data errors that are created in our enterprise by removing the root cause and b) for as long as they exist, to finding and rectifying all existing data errors.
Even, the most sceptical of DQ practitioners will have to admit that a continuously reducing number of data errors will eventually converge to zero!
Thanks again Tony and Dario and thanks John for joining, it’s a pleasure to host such debates.