Going Upstream in the Circle

One of the big trends in data quality improvement is going from downstream cleansing to upstream prevention. So let’s talk about Amazon. No, not the online (book)store, but the river. Also as I am a bit tired about that almost any mention of innovative IT is about that eShop.

A map showing the Amazon River drainage basin may reveal what may go to be a huge challenge in going upstream and solve the data quality issues at the source: There may be a lot of sources. Okay, the Amazon is the world’s largest river (because it carries more water to the sea than any other river), so this may be a picture of the data streams in a very large organization. But even more modest organizations have many sources of data as more modest rivers also have several sources.

By the way: The Amazon River also shares a source with the Orinoco River through the natural Casiquiare Canal, just as many organizations also shares sources of data.

Some sources are not so easy to reach as the most distant source of the Amazon being a glacial stream on a snowcapped 5,597 m (18,363 ft) peak called Nevado Mismi in the Peruvian Andes.

Now, as I promised that the trend on this blog should be about positivity and success in data quality improvement I will not dwell at the amount of work in going upstream and prevent dirty data from every source.

I say: Go to the clouds. The clouds are the sources of the water in the river. Also I think that cloud services will help a lot in improving data quality in a more easy way as explained in a recent post called Data Quality from the Cloud.

Finally, the clouds over the Amazon River sources are made from water evaporated from the Amazon and a lot of other waters as part of the water cycle. In the same way data has a cycle of being derived as information and created in a new form as a result of the actions made from using the information.

I think data quality work in the future will embrace the full data cycle: Downstream cleansing, upstream prevention and linking in the cloud.

Bookmark and Share

Data Quality from the Cloud

One of my favorite data quality bloggers Jim Harris wrote a blog post this weekend called “Data, data everywhere, but where is data quality?

I believe in that data quality will be found in the cloud (not the current ash cloud, but to put it plainer: on the internet). Many of the data quality issues I encounter in my daily work with clients and partners is caused by that adequate information isn’t available at data entry – or isn’t exploited. But information needed will in most cases already exist somewhere in the cloud. The challenge ahead is how to integrate available information in the cloud into business processes.

Use of external reference data to ensure data quality is not new. Especially in Scandinavia where I live, this has been in use for long because of the tradition with public sector recording data about addresses, citizens, companies and so on far more intensely than done in the rest of the world.  The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

The drivers for increased exploitation of external data are mainly:

  • Accessibility, which is where the fast growing (semantic) information store in the cloud helps – not at least backed up by the world wide tendency of governments releasing public sector data
  • Interoperability where increased supply of Service Orientated Architecture (SOA) components will pave the way
  • Cost; the more subscribers to a certain source, the lower the price – plus many sources will simply be free

As said, smoothly integration into business processes is key – or sometimes even better, orchestrating business processes in a new way so that available and affordable information (from the cloud) is pulled into these business processes using only a minimum of costly on premise human resources.

Bookmark and Share

Beyond Home Improvement

During my many years in customer master data quality improvement I have worked with a lot of clients having data from several countries. In almost every case the data has been prioritized in two pots:

  • Master Data referring to domestic customers
  • Master Data referring to foreign customers

Even though the enterprise defines itself as an international organization, the term domestic still in a lot of cases is easily assigned to the country where a headquarter is situated and where the organization was born.

Signs of this include:

  • Data formats are designed to fit domestic customers
  • Internal reference data are richer for domestic locations
  • External reference data services are limited to domestic customers

The high prioritizing of domestic data is of course natural for historical reasons, because domestic customers almost certainly are the largest group, and because the rules are common to most delegates in a data quality program.

If we accept the fact that improving data quality will be reflected in an improved bottom line, there is still a margin you may improve by not stopping when having optimal procedures for domestic data.

One way of dealing with this in an easy way is to apply general formats, services and rules that may work for data from all over the world, and this approach may in some cases be the best considering costs and benefits.

But I have no doubt that achieving the best data quality with customer master data is done by exploiting the specific opportunities that exist for each country / culture.

Examples are:

  • The completeness and depth for address (location) data available in each country is very different – so are the rules of the postal service’s operating there
  • Public sector company and citizen registration practice also differs why the quality of external reference data is different and so are the rules of access to the data.
  • Using local character sets, script systems, naming conventions and addressing formats besides (or instead of) what applies to that of the headquarter helps with data quality through real world alignment

My guess is that we will see services in cloud in the near future helping us making the global village also come true for master data quality.

Bookmark and Share

Data Quality in the Cloud

In my previous post I advocated that Data Quality tools in the near future will exploit the huge data resources in the cloud in order to achieve having data of high quality by correctly reflecting the real world construct to which they refer.

I am well aware that this is based on an assumption that data in the cloud are accurate, timely and so on, which is of course not always the case – now. This will only come when a certain data source has a number of subscribers that require a certain level of data quality and perhaps contributes to correcting flaws.

I tried that out right before writing this post when I installed Google Earth on a new laptop. A journey where I shifted between being very impressed and then a bit disappointed.

First the site from where to install – either by position or my OS language – guessed that I am not English speaking. Unfortunately it changed to Dutch – and not Danish. Well, most Dutch words are either like German or English or at least urban slang. I went through. Inside the application most text has now changed to Danish – only with a few Dutch and English labels.

Knowing that the application hasn’t learned anything about me yet I started to type just my street address which is only 8 characters but global unique: “Lerås 13” (remember: house number after street name in my part of the world). The application answered promptly with my full address as first candidate and when clicking on that it took me from high above the earth right down to where I live. Impressing.

Well, the pointer was actually 40 meters NNE from the nearest corner of our premise – and in front of our garage I could recognize the grey car I had 2 years ago. Disappointing.

What is Data Quality anyway?

The above question might seem a bit belated after I have blogged about it for 9 months now. But from time to time I ask myself some questions like:

Is Data Quality an independent discipline? If it is, will it continue to be that?

Data Quality is (or should) actually be a part of a lot of other disciplines.

Data Governance as a discipline is probably the best place to include general data quality skills and methodology – or to say all the people and process sides of data quality practice. Data Governance is an emerging discipline with an evolving definition, says Wikipedia. I think there is a pretty good chance that data quality management as a discipline will increasingly be regarded as a core component of data governance.

Master Data Management is a lot about Data Quality, but MDM could be dead already. Just like SOA. In short: I think MDM and SOA will survive getting new life from the semantic web and all the data resources in the cloud. For that MDM and SOA needs Data Quality components. Data Quality 3.0 it is.

You may then replace MDM with CRM, SCM, ERP and so on and here by extend the use of Data Quality components from not only dealing with master data but also transaction data.

Next questions: Is Data Quality tools an independent technology? If it is, will it continue to be that?

It’s clear that Data Quality technology is moving from being stand alone batch processing environments, over embedded modules to, oh yes, SOA components.

If we look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have.

In the recent years there has been a lot of consolidation in the market around Data Integration, Master Data Management and Data Quality which certainly is telling that the market need Data Quality technology as components in a bigger scheme along with other capabilities.

But also some new pure Data Quality players are established – and I think I often see some old folks from the acquired entities at these new challengers. So independent Data Quality technology is not dead and don’t seem to want to be that.

Bookmark and Share

Deploying Data Matching

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Bookmark and Share

Master Data Quality: The When Dimension

Often we use the who, what and where terms in defining master data opposite to transaction data, like saying:

  • Transaction data accurately identifies who, what, where and when and
  • Master data accurately describes who, what and where

Who is easily related to our business partners, what to the products we sell, buy and use – where is the locations of the events.

In some industries when is also easily related to master data entities like in public transportation a time table valid for a given period. Also a fiscal year in financial reporting belongs to the when side of things.

But when is also a factor in improving and preventing data quality related to our business partners, products and locations and assigned categories because the description of these entities do change over time.

This fact is named as “slowly changing dimensions” when building data warehouses and attempting to make sense of data with business intelligence.

But also in matching, deduplication and identity resolution the “when” dimension matters. Having data with the finest actuality doesn’t necessary lead to a good match as you may compare with data not having the same actuality. Here history tracking is a solution by storing former names, addresses, phones, e-mail addresses, descriptions, roles and relations.

Clouds_and_their_shadowsSuch a complexity is often not handled in master data containers around – and even less in matching environments.

My guess is that the future will bring public accessible reference data in the cloud describing our master data entities with a rich complexity including the when – the time – dimension and capable matching environments around.

Bookmark and Share

Data Quality Milestones

milestoneI have a page on this blog with the heading “Data Quality 2.0”. The page is about what the near future in my opinion will bring in the data quality industry. In recent days there were some comments on the topic. My current summing up on the subject is this:

Data Quality X.X are merely maturity milestones where:

Data Quality 0.0 may be seen as a Laissez-faire state where nothing is done.

Data Quality 1.0 may be seen as projects for improving downstream data quality typically using batch cleansing with national oriented techniques in order to make data fit for purpose.

Data Quality 2.0 may be seen as agile implementation of enterprise wide and small business data quality upstream prevention using multi-cultural combined techniques exploiting cloud based reference data in order to maintain data fit for multiple purposes.

Government says so

Capitol_Building_Full_ViewExternal reference data are going to play an increasing role in data quality improvement and a recent trend around the world helps a lot: Governments are unlocking their data stores.

Some available initiatives in English are the US data.gov and the UK “show us a better way”.

Today I attended a “Workshop on the use of public data in the private sector” arranged by the Danish National IT and Telecom Agency as part of the similar initiative in my home country.cristiansborg

The initiatives around the world are a bit different in focus areas and on which data to be released depending on the administrative traditions and local privacy policies.

As an organisation you may integrate with such public reference data either directly or through services from private vendors who add value by reformatting, merging, enriching and bundling with other services. One add on service on the international scene will be supplying consistency – as far as possible – between the datasets from each country.

One way or the other public reference data will become a part of the data architecture in most organisations. Applications in the cloud will probably be (actually are) first movers in this field.

Public reference data will bring operational databases and data warehouses closer to that “one version of the truth” that we talk so much about but have so much trouble achieving and even define. Now some of the trouble can be solved by: Government says so.

Bookmark and Share