The other day Joy Medved aka @ParaDataGeek made this tweet:
Indeed, upstream prevention of bad data to enter our databases is sure the better way compared to downstream data cleaning. Also real time enrichment is better than enriching long time after data has been put to work.
That said, there are situations where data cleaning has to be done. These reasons were examined in the post Top 5 Reasons for Downstream Cleansing. But I can’t think of many situations, where a downstream cleaning and/or enrichment operation will be of much worth if it isn’t followed up by an approach to getting it first time right in the future.
If we go a level deeper into data quality challenges, there will be some different data quality dimensions with different importance to various data domains as explored in the post Multi-Domain MDM and Data Quality Dimensions.
With customer master data we most often have issues with uniqueness and location precision. While I have spend many happy years with data cleansing, data enrichment and data matching tools, I have during the last couple of years been focusing on a tool for getting that first time right.
Product master data are often marred by issues with completeness and (location) conformity. The situation here is that tools and platforms for mastering product data are focussed on what goes on inside a given organization and not so much about what goes on between trading partners. Standardization seems to be the only hope. But that path is too long to wait for and may in some way be contradicting the end purpose as discussed under the post Image Coming Soon.
So in order to have a first time right solution for product master data sharing, I have embarked on a journey with a service called the Product Data Lake. If you want to join, you are most welcome.
PS: The product data lake also has the capability of catching up with the sins of the past.
Henrik, you have raised very valid points. Safeguarding the investment of time and other resources in a data cleansing initiative with a data-quality-at-source process is a no-brainer.
Else, the end users don’t trust the data set anyway. Business that does not trust the data, does not use it to drive decisions. An example a customer shared with me was how, before their data cleansing project, it was a practice across that mining company for plant managers to have additional safety stock, costing millions of dollars in used working capital. They called it sleeping stock – stock of critical items that allowed them to sleep at night!
Self-service, IT source system- agnostic tools are the answer. One of the largest state-owned Oil and Gas companies is using Verdantis Harmonize to support a supplier self-service environment. I see the product data lake as a valid extension of that idea, and a solution that the industry needs.
The key challenge to be solved, remains organizational maturity around data processes.
Thanks a lot for commenting Abhinav. Data processes, and being mature about them, is indeed a key to getting data sharing right. With sharing product data within ecosystems of manufacturers, distributors and retailers I have seen many data portals. What we need is covering the data processes between trading partners as well. This is among the important things I try to achieve with product data lake concept.