Data Quality from the Cloud

One of my favorite data quality bloggers Jim Harris wrote a blog post this weekend called “Data, data everywhere, but where is data quality?

I believe in that data quality will be found in the cloud (not the current ash cloud, but to put it plainer: on the internet). Many of the data quality issues I encounter in my daily work with clients and partners is caused by that adequate information isn’t available at data entry – or isn’t exploited. But information needed will in most cases already exist somewhere in the cloud. The challenge ahead is how to integrate available information in the cloud into business processes.

Use of external reference data to ensure data quality is not new. Especially in Scandinavia where I live, this has been in use for long because of the tradition with public sector recording data about addresses, citizens, companies and so on far more intensely than done in the rest of the world.  The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

The drivers for increased exploitation of external data are mainly:

  • Accessibility, which is where the fast growing (semantic) information store in the cloud helps – not at least backed up by the world wide tendency of governments releasing public sector data
  • Interoperability where increased supply of Service Orientated Architecture (SOA) components will pave the way
  • Cost; the more subscribers to a certain source, the lower the price – plus many sources will simply be free

As said, smoothly integration into business processes is key – or sometimes even better, orchestrating business processes in a new way so that available and affordable information (from the cloud) is pulled into these business processes using only a minimum of costly on premise human resources.

Bookmark and Share

11 thoughts on “Data Quality from the Cloud

  1. william sharp 19th April 2010 / 13:18

    I believe movement toward data quality in the cloud will also increase the need to move ETL to the cloud as well. bottom line …. get your head in the clouds!
    great topic, Henrik! looking forward to more cloud based conversations!

  2. Jim Harris 19th April 2010 / 15:03

    Great post Henrik,

    I definitely agree that the future of data quality is cloudy.

    On a side note, perhaps we need a new name for it that could make our future seem brighter. Perhaps we should call it the Solar Web instead of the Semantic Web? Then we could say that the outlook for data quality is always sunny when you use the Solar Web.

    Best Regards,


    P.S. Thanks for the link and the kind words 🙂

  3. Henrik Liliendahl Sørensen 19th April 2010 / 15:28

    Thanks William and Jim.

    I agree about ETL and oh yes, it’s always sunny above the clouds – I remember that from way back when it was possible to go by airplanes here in Northern Europe.

  4. kenoconnordataconsultant 20th April 2010 / 12:30


    I’m with you all the way – “to infinity and beyond”!

    Seriously though, as Data Quality professionals we need to be aware of, and to guide our clients about, the “move to the cloud” (e.g. Cloud based CRM from, and the availability of quality external reference data in the cloud.

    As William points out, moving date into a cloud based CRM will pose the same ETL challenges as a traditional “land based” migration.

    Great post, looking forward to lots more debate,

    Rgds Ken

  5. Garnie Bolling 20th April 2010 / 23:53

    Another great post Henik…

    Just like Ken and William states, it does not matter where the data sits, it is how it is leveraged, consumed, created and managed…

    As exciting it is to see Cloud Computing, I am just as excited where DQ & MDM will grow to be part of the cloud…

  6. Henrik Liliendahl Sørensen 21st April 2010 / 06:10

    Thanks Ken and Garnie for commenting.

    I think we will see an evolution with ETL, MDM and DQ when enterprises continue to embrace the cloud. Things will get more real-time and real-world.

  7. Mark Baran 21st April 2010 / 23:35

    We have been evangelizing and building upstream (Cloud) data quality solutions for years. It’s nice to see IT and marketing professionals begin to take it seriously.

  8. Henrik Liliendahl Sørensen 22nd April 2010 / 05:41

    Thanks Mark. I like your website at Ikhana – both the information and the fish.

    Also thanks for the mention in your blog section.

  9. Pugazendhi Asaimuthu 23rd April 2010 / 00:36

    Cloud computing will certainly lead to greater web services driven data quality assurance. However, whether that will lead to better data quality at OLTP data entry will only be decided by the efficiency and reliability of web services facilitated data quality. Are we there yet? Until then data cleansing and standardization by ETL in the DW and BI spaces will be the lifeline for quality information.

  10. Henrik Liliendahl Sørensen 23rd April 2010 / 06:17

    Thanks Pugazendhi. I agree, the reality today is very much about cleansing during ETL when loading data from operational applications into our DW’s and other batch processes like during migration.

    We are certainly not there yet, as I see it. We only just started with the beginning.

  11. Mark Baran 23rd April 2010 / 17:20

    I’m inclined to agree with Pugazendhi’s take on the current state of our industry. Traditional ETL tools still provide the best tools for data quality initiatives overall.

    But traditional ETL tools usually do not provide the methods to deal with data that is used in real time. Many of our customers use very sophisticated data quality tools. From an IT perspective, these tools are employed in a manner that historically “works.” From the standpoint of the business owner of the data, they do not “work” well.

    For example; one of our customers utilizes a Trillium instance to apply data rule based transformations to records coming in from various web sites. These records are stored in a temporary database, usually for a period of days before they are moved into the marketing database that the business uses for campaign management. By then the data is “old” as far as the business owner is concerned.

    Many of the data transformations can be handled upstream, at the datasource. This is what our solutions are designed to accomplish. Can we do everything a traditional ETL DQ solution can do? Obviously not, nor would we claim it. But what we can do, in milliseconds not days, is transform and enhance the data; make it more useable in real time; help the receiving application’s matching process work much more efficiently; and, most importantly, give the business owner a process that they feel meets their needs.

    As Henrik has noted: we have to start somewhere.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s