Boiling Data Silos

Yesterday there where some blog posts dealing with data silos.

Graham Rhind posted: Data silos – learn to live with them.

Rob Karel posted: Stop trying to put a monetary value on data – it’s the wrong path. Though not being the main subject there was a remark saying: “Attempting to boil the ocean and trying to solve Customer, Product, or Financial data for all processes and decisions across the whole organization is too big an effort destined to fail before it starts”.  

Mark Montgomery made a comment on Rob’s post saying: “I also have trouble with the boil the ocean metaphor, which is used too often these days to justify all kinds of protectionist policies in the enterprise. You can’t have it both ways in the enterprise– either you have data silos or you don’t, and I argue that increasingly the world cannot afford them, albeit in highly secure formats in most situations”.

I guess we have to go for the golden mean on this one also. We shouldn’t accept data silos but we must expect them. We could go for eliminating them probably not in one big bang but slice by slice as we climb up the levels in an information maturity model.

I would definitely expect to see fewer and smaller data silos at the top level of an information maturity model than on a bottom level of a data quality immaturity model.

Bookmark and Share

Holistic Accuracy

In community economics you have two terms called

  • Partitive accuracy and
  • Holistic accuracy

In short, partitive accuracy is the accuracy of a single measure being part of a model while holistic accuracy is the accuracy of the model structure and its use. More information here.

I find these terms being very useful in data quality and master data management as well.

The distinction between partitive accuracy and holistic accuracy resembles the distinction between data quality and information quality.

One problem with the term information quality is that it implies a certain context of use, which makes it hard to prepare data for having high data quality for multiple uses other than assuring the accuracy of the single data elements – being similar to the term partitive accuracy.

One clue for assuring better information quality is looking at the model structure of data – being similar to the term holistic accuracy. Here I am thinking beyond traditional data modeling, which is anchored in the technical world, and into how end users of master data hubs are able to build structures of data (with partitive accuracy) that fits the daily business use.

Examples of such holistic information capabilities in master data management will be building flexible product hierarchies and hierarchies of party master data that at the same time reflects hierarchies in the real world as households and company family trees and hierarchies of related accounts and addresses used within the enterprise.

While a single data element as an address component like a postal code may be partitive accurate, the holistic accuracy is seen as how data elements contribute to a holistic accuracy as a part of a data structure that fits multiple purposes of use.

Bookmark and Share

All that glisters is not gold

As William (not Bill) Shakespeare wrote in the play The Merchant of Venice:

All that glisters is not gold;
Often have you heard that told

I was reminded about that phrase when commenting on a comment from John Owens in my recent post called Non-Obvious Entity Relationship Awareness.

Loraine Lawson wrote a piece on IT Business Edge yesterday called Adding Common Sense to Data Quality. That post relates to a post by Phil Simon on Mike 2.0 called Data Error Inequality. That post relates to a post on this blog called Pick Any Two.

Anyway, one learning from all this glistering relationship fuzz is that when looking for return on investment (Gold) in data quality improvement and master data management perfection I agree with adding some common sense.

One of the first posts on this blog actually was Data Quality and Common Sense.  

Bookmark and Share

Data Quality and Data Visualization

This is a self-centric blog post about data quality and data visualization.

The figure to the right is a statistic about who viewed my profile in a certain period on LinkedIn.

Looking at that makes me think about a couple of data quality and data visualization issues especially linked to visualization of data on a world map.

Hidden value

Fortunately there is both a map and some numbers below, because the map is too small to show from where I have the most views: My very small home country Denmark.

Misleading proportions

I have no views from the grey countries. So I should certainly concentrate on Greenland (the big grey land in the top of the map) to get more viewers, right?

Well, the Mercator projections make areas close to the poles like Greenland look much bigger than in the real world. Greenland is a big island, but in fact only less than 1/3 of Australia (the almost as big light blue land in the down under right corner) – and Greenland only has 1/400 of the population of Australia.

Cultural dependency

My blogging and LinkedIn activities are in English due to the moderate population of Denmark. Therefore, and because of the spread of LinkedIn biased in the English speaking world, it’s no surprise most viewers are from English speaking countries.

Bookmark and Share

Electronic Data Processing

A comment on my last blog post took me back to the days when I started working with Information Technology (IT). At that time our métier actually wasn’t called IT but EDP (Electronic Data Processing) – at least that was the case in my home country Denmark where we used the local TLA being EDB (Elektronisk Data Behandling).

I have earlier touched the long standing discussion about if “data quality” should be rebranded as “information quality” for example in the post called new blog name, as this should also require a new name for this blog.

The words data and information are indeed used very randomly around. In MDM (Master Data Management) we have two main domains being Customer Data Integration (CDI) and Product Information Management (PIM). Wonder if customer data is old school and product information is new school?

Bookmark and Share

1/1/11

Date formats have always been a trouble maker.

1/1/11 is one format for expressing today’s date. 2011/01/01 is another one. 1st January 2011 is a third way. January 1, 2011 is a fourth way.

That is of course given you use the Gregorian calendar and you don’t live far east from me, where it’s already a new day when I post this post.

1/1/11 is not one of those days where we have the usual confusion between the American way of expressing a date using the sequence month/day/year opposite to the common straight forward European sequence being day/month/year.

But in a few hours when it’s 2/1/11 in Europe and some hours later when it’s 1/2/11 in North America we are confused.

So, data quality folks, remember putting your dates in a unique format starting from tomorrow the 2nd January 2011 or, if you like, January 2, 2011.  

Happy New Unique Year.

Bookmark and Share

Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

The Art of Programming

Beginner’s All-purpose Symbolic Instruction Code or simply BASIC is one of the oldest programming languages around and also the first programming language I learned in school back in the 70’s. Later I came around a dialect of BASIC called COMAL, learned and forgot all about ASSEMBLER, made my first business code in COBOL (plus a Yahtzee game), created applications with SPEED and PACE, worked with PowerBuilder, wrote some SQL and made my own data quality tool using MAGIC.

Independent of all the different languages being used, when programming there may be two different basic measures of quality:

  1. Good code may refer to if the code is well structured, readable by others including being feasible documented, is reusable and is setup to use the computer resources the best way possible.
  2. Good code (delivered as an application) may refer to that it helps solving the business (or gaming) issue addressed through the best possible user experience.

Looking at good code these two ways resembles the two ways we also measure if our data is good:

  1. Good data may refer to if the data is well structured, readable by others including being feasible documented, is reusable and reflects the real world the best way possible.
  2. Good data (delivered as information) may refer to that it supports solving the business issue addressed through the best possible user experience.

Application (and information) users concern is point 2.

As a programmer (and data quality professional) you have to consider point 1 in order to achieve point 2. You may get along with a quick and dirty work around in a short term, but in the long run you have to make it technically right.  

Bookmark and Share

Valuable Inaccuracy

These days I’m involved in an activity in which you may say that we by creating data with questionable quality are making better information quality.

The business case is within public transit. In this particular solution passengers are using chip cards when boarding busses, but are not using the cards when alighting. This is a cheaper and smoother solution than the alternative in electronic ticketing, where you have both check-in and check-out. But a major drawback is the missing information about where passengers alighted, which is very useful information in business intelligence.

So what we do is that we where possible assume where the passenger alighted. If the passenger (seen as a chip card) within a given timeframe boarded another bus at a stop point which is on or near a succeeding stop point on the previous route, then we assume alighting was at that stop point though not recorded.

Two real life examples of doing so is where the passenger makes an interchange or where the passenger later on a day goes back from work, school or other regular activity.

An important prerequisite however is that we have good data quality regarding stop point locations, route assignments and other master data and their relations.    

Bookmark and Share

Outside Your Jurisdiction

About half a year ago I wrote a blog post called Who is Responsible for Data Quality aimed at issues with having your data coming from another corporation and going to another corporation.

My point was that many views on data governance, data ownership, the importance of upstream prevention and fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company. But this is in the business world today not true in many cases.

Actually a majority of the data quality issues I have been around since then has had exactly these ingredients:

  • When data was born it was under an outside data governance jurisdiction
  • The initial data owners, stewards and custodians were in another company
  • Upstream wasn’t in the company were the current requirements are formulated

At the point of data transfer between the two jurisdictional areas the data is already digitalized and often it is high volume of data supposed to be processed in a short time frame, so the willingness and practical possibilities for implementing manual intervention is very limited.

This means that one case of looking for technology centric solutions is when data is born outside your jurisdiction. Also you tend to deal with concrete data quality rather than fluffy information quality in this scenario. That’s a pity, as I like information quality very much – but OK, data quality technology is quite interesting too.

Bookmark and Share