Data Quality in the Cloud

In my previous post I advocated that Data Quality tools in the near future will exploit the huge data resources in the cloud in order to achieve having data of high quality by correctly reflecting the real world construct to which they refer.

I am well aware that this is based on an assumption that data in the cloud are accurate, timely and so on, which is of course not always the case – now. This will only come when a certain data source has a number of subscribers that require a certain level of data quality and perhaps contributes to correcting flaws.

I tried that out right before writing this post when I installed Google Earth on a new laptop. A journey where I shifted between being very impressed and then a bit disappointed.

First the site from where to install – either by position or my OS language – guessed that I am not English speaking. Unfortunately it changed to Dutch – and not Danish. Well, most Dutch words are either like German or English or at least urban slang. I went through. Inside the application most text has now changed to Danish – only with a few Dutch and English labels.

Knowing that the application hasn’t learned anything about me yet I started to type just my street address which is only 8 characters but global unique: “Lerås 13” (remember: house number after street name in my part of the world). The application answered promptly with my full address as first candidate and when clicking on that it took me from high above the earth right down to where I live. Impressing.

Well, the pointer was actually 40 meters NNE from the nearest corner of our premise – and in front of our garage I could recognize the grey car I had 2 years ago. Disappointing.

5 thoughts on “Data Quality in the Cloud

  1. Dylan Jones 19th March 2010 / 09:50

    I think your post actually shows a growing issue with information quality, we demand so much.

    If we had access to this technology 10 years ago this functionality would have been startling, now we (quite rightly) expect perfection.

    I think this is a wake-up call if organisations have any data-driven interaction with their consumers, we now expect so much more than before.

    Great post.

  2. Henrik Liliendahl Sørensen 20th March 2010 / 10:14

    Thanks Dylan.

    An aspect of exploiting external data (be that in the cloud or delivered in more old fashioned ways) is of course also to understand the intended purpose with the external data.

    The thing with the position of our house is that on our street there are one row of houses on the even side and two rows of houses on the odd side. We are in the second row and are reached by a small byway.

    The house numbers used by Google Earth seems to be provided by Tele Atlas. Here the house numbers are distributed along a straight line for the street. This is perfect for a feature like a route planner.

    The apparent 40 meter NNE inaccuracy is actually the distance between first and second row of premises and a 90 degree angle on the street going WNW.

    So this is tricky, in order to obtain real world representation for data quality – opposite to fit for intended purpose – you exploit external data, but then you have to consider intended purpose for the external data.

    Also you have to consider dimensions as the timeliness of the external data – like the satellite photo used by Google Earth may be 2 years old.

    So your remark about information quality certainly also is wake-up call for that high information quality is high data quality explained.

  3. Satesh 24th March 2010 / 05:42

    Excellent one Henrik!!!

    The post throws light on timeliness of data, specifically when accessed over the cloud. We did face similar problems (timeliness of data) while designing a mash up process that combined data from Google (earth) and client’s internal data. Finally got to do with internal data alone with some manual web-research to fill in the cloud processing (which was planned to be automated earlier)

    Technology has moved a long way from excel sheets, legacy systems, ERPs, SaaS, Cloud computing but Data Quality issues seems to be far from getting addressed 😦

    MDMAnswers.com

  4. Jonathan Stigant 24th March 2010 / 16:41

    I think it a good idea to define what we mean by Data Quality. For me it is whether it is ‘correct’ spatially in relation to the real world (one might describe this as ‘absolute’ quality. The alternative seems to be what is discussed here. That is to say ‘completeness’ and ‘consistency’. These are relative quality issues. The difference is that completeness and consistency can place things in the same place every time and provide a full dataset but can provide bogus information relative to the ‘real world’ or ‘absolute’ location.

  5. Henrik Liliendahl Sørensen 24th March 2010 / 22:08

    Thanks Satesh and Jonathan.

    Data Quality may indeed be valued by many different dimensions.

    For consistency I would say that this may be that a geocode from a given source always is assigned by the same principle and not say sometimes be the nearest spot on a public street and other times be the centre of a premise. In the latter case the derived information then becomes accurate for some contexts but inaccurate in other contexts.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s