Product Data Quality

The data quality tool industry has always had a hard time offering capabilities for solving the data quality issues that relates to product data.

Customer data quality issues has always been the challenges addressed as examined in the post The Future of Data Quality Tools, where the current positioning from the analyst firm Information Difference was discussed. The leaders as Experian Data Quality, Informatica and Trillium (now part of Syncsort) always promote their data quality tools with use cases around customer data.

Back some years Oracle did have a go for product data quality with their Silver Creek Systems acquisition as mentioned by Andrew White of Gartner in this post. The approach from Silver Creek to product data quality can be seen in this MIT Information Quality Industry Symposium presentation from the year before. However, today Oracle is not even present in the industry report mentioned above.

Multi-Domain MDM and Data Quality DimensionsWhile data quality as a discipline with the methodology and surrounding data governance may be very similar between customer data and product data, the capabilities needed for tools supporting data cleansing, data quality improvement and prevention of data quality issues are somewhat different.

Data profiling is different, as it must be very tightly connected to product classification. Deduplication is useful, but far from in same degree as with customer data. Data enrichment must be much more related to second party data than third party data, which is most useful for customer and other party master data.

Regular readers of this blog will know, that my suggestion for data quality tool vendors is to join Product Data Lake.

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share