What is Data Quality anyway?

The above question might seem a bit belated after I have blogged about it for 9 months now. But from time to time I ask myself some questions like:

Is Data Quality an independent discipline? If it is, will it continue to be that?

Data Quality is (or should) actually be a part of a lot of other disciplines.

Data Governance as a discipline is probably the best place to include general data quality skills and methodology – or to say all the people and process sides of data quality practice. Data Governance is an emerging discipline with an evolving definition, says Wikipedia. I think there is a pretty good chance that data quality management as a discipline will increasingly be regarded as a core component of data governance.

Master Data Management is a lot about Data Quality, but MDM could be dead already. Just like SOA. In short: I think MDM and SOA will survive getting new life from the semantic web and all the data resources in the cloud. For that MDM and SOA needs Data Quality components. Data Quality 3.0 it is.

You may then replace MDM with CRM, SCM, ERP and so on and here by extend the use of Data Quality components from not only dealing with master data but also transaction data.

Next questions: Is Data Quality tools an independent technology? If it is, will it continue to be that?

It’s clear that Data Quality technology is moving from being stand alone batch processing environments, over embedded modules to, oh yes, SOA components.

If we look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have.

In the recent years there has been a lot of consolidation in the market around Data Integration, Master Data Management and Data Quality which certainly is telling that the market need Data Quality technology as components in a bigger scheme along with other capabilities.

But also some new pure Data Quality players are established – and I think I often see some old folks from the acquired entities at these new challengers. So independent Data Quality technology is not dead and don’t seem to want to be that.

Bookmark and Share

21 thoughts on “What is Data Quality anyway?

  1. Dylan Jones 17th March 2010 / 14:07

    Great, thought-provoking post as ever Henrik.

    I think it’s important to distinguish Data Quality – the process or methodology, and data quality – the technology market.

    I think this causes a lot of confusion depending on who you are dealing with in the organisation.

    Your point “…look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have…” – that is absolutely, 100% bang on, there are a myriad of other activities required in a data quality programme, technology plays a part but often a very minor part.

    I also agree that data governance is increasingly seen as the “umbrella” programme with DQ sitting underneath.

    I’ve been considering the terminology we use a lot lately and this has helped me to shape some of those thoughts, nice one.

  2. michaelbaylon 17th March 2010 / 14:50

    Interesting post and comment.

    I have found that DAMA’s functional framework provides a useful viewpoint.

    It has data governance at the core – with 9 other areas including data quality centred around it – this seems to align with your point and the previous comment.

    It has been a useful tool when discussing ‘data’ with the rest of the business and explaining how different areas such as data governance, architecture, quality etc are all interconnected.

  3. Henrik Liliendahl Sørensen 17th March 2010 / 21:09

    Comments from the Data Quality Pro.com LinkedIn group:

    John Ladley says:

    independent of what? It can be done as a stand alone program, but ideally is a part of overall EIM.

    Michael Antonellis says:

    Data Quality should be a stand alone discipline so you have the flexibility to integrate into multiple applications and stacks. DQ is not a commodity, and should be sold as simply a check box in a stack solution. Every company has multiple applications and stacks, but if you have DQ in one stack, what happens in the other systems and stacks? By obtaining separate best of breed DQ tools, you will ensure that DQ can be integrated into any process, database, application or stack and not be confined to one stack provider that has very little DQ expertise..

    Milan Kučera says:

    Data quality management is an individual discipline based on use of relevant quality management methodology (TdQM, TIQM, or others). Data quality tools (profiling, cleansing) were standalone applications integrated into one package a few years ago. It was possible to integrate these into the applications like CRM before and after this integration process.

    I think there is no problem if data quality tools used for inspection and scrap and rework will continue standalone or not. The best use of data quality tool is at prevention (at the place where information is created) but must be closely integrate with specific process.

    If integrated as a part of ETL processes than data quality tool becomes part of massive inspection and/or correction. This is opposite to sound principles of quality management.All costs associated with data quality tool is nothing more or less than failure costs.

    Business process is effective only if working with accurate information and it is something what data quality tool (cleansing) is unable to ensure.

    In general you are right data quality tools (inspection and cleansing) must provide wide integration features.

    And the last comment. Information quality assessment focuses at information architecture stability requires comparison of given results between systems at direction of information flows. It is a feature which should be a part of inspection tool primarily. I do not know all profiling tools but a year ago I did not find any with this feature.

  4. William Sharp 17th March 2010 / 21:13

    I’d have to agree with you that data quality and data governance are co-dependent business initiatives.

    I’m curious why you think that MDM and SOA are “dead”?

    I always enjoy your data quality focused blog postings! Looking forward to tomorrow’s post!

  5. Henrik Liliendahl Sørensen 18th March 2010 / 09:36

    Thank you all for the comments.

    Dylan, excited to see what’s up from your side. I remember that DataQualityPro was born out of an active part of DataMigrationPro .

    Michael B., thanks for sharing the DAMA reference. I know what you mean when saying “the rest of the business” considering your recent blog post Is IT part of the business?. Not at least when talking Data Governance and Data Quality the view of business and IT as two separate things is a killer.

    John, agree, a Data Quality program is eventually a component in Enterprise Information Management. Now how is it exactly EIM and Data Governance relates? Is Data Governance a key component of EIM? So (Enterprise) Data Quality (Management) is component of (Enterprise) Data Governance that is component of (Enterprise) Information Management?

    Michael A., agree, not necessary very good if you buy your Data Quality tools as one-stop-shopping along with CRM / ETL / BI solutions. I think the sentiment is expressed very well in the article on DataQualityPro called Does your project suffer from data quality product myopia.

    Milan, my expectation is that data quality tools will increasingly embrace access to external data resources in order to ensure accuracy, completeness, timeliness, uniqueness….

    William, the phrase “MDM (as we know it) is dead” was among others recently said by Andrew White of Gartner. As analysts do he did it in the context of technology vendors and their products. SOA is dead was said by Anne Thomas Mannes of Burton Group about a year ago – less repeated with the other half of the title: Long Live Services. May I sum this up:

    MDM vendors are among endangered species. SOAsaurus is dead. Long live MDM DQ SOA components

  6. kenoconnordataconsultant 18th March 2010 / 12:03

    Henrik,

    Thought provoking as ever.

    I agree with Milan Kučera:
    “All costs associated with data quality tool is nothing more or less than failure costs.”

    As we know from experience, these are unavoidable “data cleansing” costs, but they remain “failure costs” – failure to prevent the problem occurring in the first place.

    We should learn from our mistakes… Every instance of “Data Cleansing” requires a business rule to be defined, to implement the data cleansing. The same business rule should then be deployed “upstream” at each data capture point.

    In the case of data migration from legacy to new system, the business rules should be incorporated into all data entry processes of the new system.

    Rgds Ken

  7. Henrik Liliendahl Sørensen 18th March 2010 / 12:36

    Thanks Ken

    My expectation about the future of data quality tools are certainly that they will move upstream.

    This may include applying business rules for data entry but also minimizing data entry to be checked (and governed). Why enter data that you can pick from the cloud? Why maintain data if you can link to maintained data in the cloud? Why bother about if your data are complete if you can enrich through your links to the cloud?

  8. Phil Simon 18th March 2010 / 17:48

    Until the day that most people recognize the need for DQ, I don’t see how it’s dead. Also, with regard to SOA and MDM, this might just be Gartner being controversial for the sake of being controversial. $20 says that, if an MDM vendor paid to be in one of their “magic quadrants”, they’d be hyping it like its social networking.

  9. Ted Friedman 18th March 2010 / 18:44

    That might be true, if vendors could pay to be in magic quadrants. Which, as has been discussed numerous times in various forums, is not the case.

  10. Henrik Liliendahl Sørensen 22nd March 2010 / 10:32

    Thanks Phil and Ted for the small fire exchange on the Magic Quadrant

    On the “MDM – Master Data Management” LinkedIn group there are some comments:

    Sravan Kasarla says:

    In my opinion Data Quality is an integral part of the Information management. Data Quality is key to enterprise data management especially managing key business data (master data). Business value delivered by data quality initiatives depends on the change in culture within an enterprise and fundamental change in the Information Delivery Lifecycle. As a technology it may continue to be an independent engine but increasingly it is being wrapped in to the integrated Master Data Management (MDM) platform offerings.

    Kjell Wittmaack says:

    As with many things, it depends on what you believe a data quality discipline covers and what you are aiming at achieving. If a data quality discipline is about the skills needed to harmonize, standardize, cleanse, and de-duplicate data, then yes I believe that discipline is here to stay – as long as we have data we will need such skills.

    If the aim is to improve the quality levels of a company’s data in a sustainable way, such skills will not be enough and you need to include other disciplines (or skill sets) like data governance and the management of common definitions to reach your goal. In my view you can only measure (and improve) the quality of data against a defined standard. Hence the establishment of that standard and the management of it have to become part of the picture to make results achieved by changing the quality of data sustainable. This will definitely take us into the realm of Information Management (managing information as a corporate asset). So our “disciplines” are converging and working in tandem to deliver higher goals as we and our organizations become more ambitious.

    As for the DQ technology converging is already happening – Data Quality technology at its root is specialized (unique) functionality and that is most definitely starting to appear in other technology than specialized DQ technology.

  11. Milan Kučera 22nd March 2010 / 12:08

    I am affraid I cannot agree with understanding of data quality as data profiling and data cleansing technology. Data quality is presented to market as applying of massive inspection and massive correction of data – and it is not correct, it is wholly opposite to quality management principles.

    Process, especially and clerical world, depends on accuracy of information. Information is collection of a) data ; b) definition; c) presentation. Only what we need to do is possibility to categorize information into the groups like: the most critical, critical, nice to have with high density of fulfillment, etc. To each of the categories information quality measures are applied. I think it is a one criteria to decide if data is master or not. If the most critical information is low quality that high costs of poor quality are “generated”. And it can be other criterion how to identify whad data can be assigned with master data.

    Management is about managing. If we will focus at information prevention, than we will prevent data related to critical information.

    Technology like data cleansing helps reduce incorrectness (mistyping, patterns, etc.) But how data quality tools works with overloaded feald?? It is important issue, as so called default values like 999-999-999, etc. Cleansing tools are unable to ensure information accuracy to reality – the most important measure.

    What do you think?

  12. Henrik Liliendahl Sørensen 22nd March 2010 / 12:41

    Thanks again Milan.

    I actually tried to express the same sentiment as I see in your comment. As Dylan said in his comment:

    It’s important to distinguish Data Quality – the process or methodology, and data quality – the technology market.

    For the latter one it has been much about profiling and cleansing including matching. But as you rightfully say, cleansing tools are (often) not able to ensure information accuracy to reality.

    What I see is an increasingly trend towards using matching technology in aligning (master) data with reality. Until now this has been mostly by for example doing matching with business directories like the D&B worldbase, making address correction with available postal sources and linking product master data with product classification systems. With the emerging semantic web the sources available will grow rapidly not at least considering the trend that governments around the world are beginning to release a lot of data these days.

    I think it’s going to be exciting what happens with skills around data quality in the future and how cloud computing will influence these disciplines as a lot of other disciplines.

  13. Henrik Liliendahl Sørensen 22nd March 2010 / 17:40

    On the LinkedIn group “Data Quality Pro.com” Olga Maydanchik says:

    Hi Henrik, you might be interested in our new course Information Management Fundamentals (available at http://www.elearningcureve.com free for a limited time). You will find a very interesting answer to this very question. IM includes 14 difference disciplines (DQ is one of them). Dave Wells entertains a very interesting concept of the discipline dependencies. He puts all 14 disciplines in the stack . Disciplines on top of the stack (such as predictive analytics or data mining) have increased dependencies upon other disciplines. Disciplines on the bottom (such as data modeling or metadata management) provide foundation for others. DQ is kind of in a middle, so it depends on some and provides foundation for some. I found this view quite fascinating. Check it out; it is very well worth your time. By the way, I used a term stack in a very different meaning than Michael used it in his answer.

    Here is the link
    http://ecm.elearningcurve.com/Information_Management_Fundamentals_p/imf-01-a.htm

  14. Milan K. 23rd March 2010 / 10:28

    Henrik,

    you are right at the case of technology. One of the great business issues is possibility to indentify the same customer across different systems used by company. Matching algorithm, in spite of working with fuzzy logics, can significantly helps at this effort.

    You are right is use of matching technology at data enhancement for an external information. You have mentioned register like D&B, and similar. As I wrote, the most important information quality measure is accuracy to reality. I do not have relevant information how reference tables are maintained by government in other countries. If use real experience from my country, than I see a few issues about its accuracy. Lets take a look at our national address register. There are two registers maintained by two official authorities. These are updated weekly. It means that newly agreed streets will be available to the companies after week, but if doing profiling you will be identify address which does not exist (but reality is different).

    I tried to point out the accuracy of the external reference table you plan to use. It is necessary to ask for processes around every reference table because its maintenance impacts into the company’s processes.

    I know it is very hard to ensure information accuracy. But business processes should think about it and implement prevent steps.

    Thanks, you have open an excellent discussion.

  15. Duane Morrison Smith 25th March 2010 / 10:33

    I will not start by providing any generic definitions of data quality, as there are many versions of the truth as well all know.I will also not cover the broader concepts of Enterprise Information Management or Data Governance in my response, as there are already valuable resources out there.

    As to your question, maybe the real term that requires definition is the use of the word ‘independent’. Are we saying ‘Standalone’ or ‘Separate from’?

    I would suggest that some of the failures in data quality have their origin in either systems or humans working in this way, in isolation without any form of training, checking or cross-referencing and in silos. It is right to say that data quality requires ‘discipline’ and sometimes doing the right thing can be the ‘hard road’ when it is so much easier to enter only some of the information into a record, let alone verify it’s source or validity.

    The world may have changed with the introduction of all types of pervasive technology and we may use many more sources of data than previously, however the practices of data capture in the main remain unchanged, a human being entering data into a computer. That data only becomes useful information when it can be trusted or relied upon. Hence the more sources used to reference this data and verify it’s truth the better. While the may be generic data such as locality tables that you could find on a cloud somewhere you will still have to review what data sources are reliable or trustworthy when it comes to using any data source, internal or external. This data will also have to be rationalised along with ‘known’ data you already have.

    Unfortunately, many are always looking to the ‘panacea’ or the ‘magical solution’ when, data quality for most part, is about hard-work, commitment and more importantly teamwork, collaboration and integration. I agree that technology plays an important part here, but I agree with Dylan it should not be the most important part or steal the show. Technology, like most IT departments, should be the facilitator for effective business processes that fully support the way an organisation does business. If data quality is being considered independent, it may be that the systems in place or the culture in an organisation is missing the ‘Q’ factor, i.e. ‘quality’. It may be that data quality is a ‘nice to have’ rather than an essential element of doing business with customers, as in real people!

    In regard to data quality tools being independent of other technologies, it seems to be a loaded question. As a data quality vendor or practitioner it might be concerning to see data quality technology become part of some other technology or stack. However, I would suggest that the more systems that embrace data quality technology including rules, knowledgebases, parsing and matching algorithms the better. Integration is the technology key here, just as collaboration is the people key. Already many of the pure play data quality vendors offer batch (back-end) and real-time (front-end) solutions to resolve both legacy and current data quality issues respectively. Whether the integration happens through vendors acquiring each other or within your own in-house software development team, it does not really matter as long as the result is the same, quality improvement. Anyone who knows data quality, knows that data quality is not something that can be easily understood or achieved. It is also a subject matter that many organisations have taken for granted at their peril. Therefore, I see specialised data quality skills and technology being part of the landscape for the foreseeable future as the mainstream are still learning how things should be done in an ideal world and more basically how to enter data correctly into a form on a screen.

  16. Henrik Liliendahl Sørensen 25th March 2010 / 11:40

    Thanks Milan for summing it up nicely.

    Duane, thanks a lot for the very comprehensive answer. I really like your points, not at least:

    “Data only becomes useful information when it can be trusted or relied upon. Hence the more sources used to reference this data and verify its truth the better”.

    “If data quality is being considered independent, it may be that the systems in place or the culture in an organisation is missing the ‘Q’ factor”.

  17. Theodora 8th April 2010 / 10:57

    You may want to look at Dun & Bradstreets data quality management information solutions for trusted and relied upon data.

  18. Henrik Liliendahl Sørensen 8th April 2010 / 11:02

    Theodora, I am doing that in this very moment as I am working with the D&B WorldBase match solutions in Europe.

  19. John Owens 17th September 2010 / 08:43

    There are three separate questions embedded in the question, “What is Data Quality anyway?”.

    The first would be better asked in the form, “What is quality data?”. The answer is, “Data that is fully capable of supporting all of the core business activities (Business Functions) of an enterprise”.

    The second question is, “What are ‘Data Quality’ activities?”. The answer to this is, “That set of activities carried out in an enterprise to correct errors in data caused by faults in business functions, processes, procedures and practices.” As Ken O’Connor says, the are “failure costs”.

    The third part of the question is, “What is the ‘Data Quality Industry’?”. There are probably two parts to the answer here.

    The first is, “A set of disciplines and professionals whose purpose and goal it is to help enterprises sort out errors in their existing data and to put in place structures and practices with the Business Functions (activities) of the enterprise to prevent the errors happening again.”

    The second is, “A set of organizations exploiting the fact that so many enterprises have bad data as a means of pushing consultancy and automated tools, both purporting to ‘fix’ the problem, which in effect perpetuate the situation, thus ensuring a continuing and expanding market for their products and services.”

    Sadly, the DAMA Functional Framework, referred to in a reply above, is typical of a structure that will perpetuate the current data problems. It makes the mistake of supposing that data has an intrinsic value and can be addressed in abstract isolation! Data has no intrinsic value! It only exists to support the activities of the enterprise.

    The only what to ensure the quality of data in an enterprise is to ensure that the business rules required to correctly create and transform data are an an integral part of the business functions doing the creation or transformation.

    Regards
    John

  20. Henrik Liliendahl Sørensen 17th September 2010 / 09:31

    Thanks John for the comment.

    Yes “data quality” as a term is indeed used with different meanings. Yesterday I noticed a blog blog post from Winston Chen of Kalido called Is Data Quality Dead? . Many different possible answers to that question may be true.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s