The Many Worlds of Data Quality

This morning I had some fun reading the articles on Wikipedia explaining about Data Quality.

I tried to compare the texts available in:

I am afraid that the quality of texts and some differences between how the subject is presented in the different languages shows the immaturity of the data quality discipline and not at least the lack of global embracement that is seen in literature, published articles and the technology available.

Three observations from the Wikipedia articles:

The French piece is in some parts a translation from the English text. However the translation became very difficult in the History section, as the English text here has the well known narrowly United States scope.

The German text is completely different from the English text. Also the title is Information Quality. The references are largely from German authors.

The Japanese text seems to be a Google Translate of the (former) English text. This is strange as much of the quality inspiration originally came from Japan.

Bookmark and Share

LinkedIn and the other Thing

I have a profile in two different business oriented social networking services: LinkedIn and XING.

I have far more connections in LinkedIn than in XING.

My connections in LinkedIn are mainly from English speaking countries (US, UK, IE, IN, AU) and from Scandinavia (DK, NO, SE) where I live and where English is widely spoken not at least by people in white-collar.

The connections I have with people in XING are almost only with people from Germany.

This picture matches very well how these two tools are positioned.

The US based LinkedIn is strong in “English speaking” countries with most profiles per capita in:

  • Denmark, Netherlands and USA followed by
  • Norway, Sweden, United Kingdom and Australia

(I have some figures from last year when LinkedIn passed 50 million profiles).

XING is strong in Germany, where XING was founded, and through acquisitions also in Spain and Turkey.

Now, it’s not that you can’t operate LinkedIn in German and Spanish; you can. Also you can operate XING in English.

It’s about meeting your connections where they are.

Bookmark and Share

Birthday Party

Today this blog has been online one year. It’s time for a birthday party.

The economy around a birthday party usually goes like this:

  • You, the guest, spend some money on a nice birthday present
  • I, the host, spend some money on fine food and beverage

Now, a blog is a virtual thing and I reckon that most of my readers live far, far away from the Copenhagen South Coast.  So it’s going to be a remote birthday party and as most other things happening in the social media realm actually no money is going to be exchanged.

Anyway, here is what I would have liked to serve in the real world:

Paella

The dish I have prepared the most times when we have guests is the Spanish paella. I love paella very much and so do all our polite guests.

Also I am a shrimp addict, so I usually like to add two or three different kind of shrimps as the smaller but extremely tasteful Greenlandic shrimps to delicious giant Thai tiger prawns.

Steak

My second favorite meal is a steak. You probably don’t get a better steak than those originated from cattle grazing on the Argentinean pampas.

As I live in the Northern Hemisphere it’s summertime now and perfect weather for preparing the steak outside on the grill.

Wine

There is so much good wine coming from many places around the world. I like Californian wine, wine from Chile, South African wine, Australian wine, French wine and last but not least Italian wine including the unbeatable Amarone.

Beer

As I am a native Dane you will probably expect me to propose a Carlsberg. Don’t get me wrong: Carlsberg is probably a good beer. But there are many other good beers around. When I am in England I like the ultimate mainstream beer: A John Smith (now owned by Dutch Heineken). The best mainstream beer in my opinion is the Belgian Leffe.

Cheers

Thanks to everyone who has read this blog, subscribed, made a re-tweet and not at least those who has commented.

Bookmark and Share

Picture This

How do people find their way to your blog? I use Twitter and LinkedIn to say: Hey, I made a new post. And then I pretty much rely on that people find my blog when searching with terms as:

  • Data Quality
  • Master Data Survivorship
  • Fit for purpose

But honestly, the search terms that hits my blog many fold more than the above terms are those little texts I add to the images I use to have on every post. And I am pretty sure that those people were not looking for data quality and master data management.

The top term is pearls, including the same word in Russian (жемчуг), Turkish (inci) and Arabian (لآلئ). This word was the title in the image in the post “Universal Pearls of Wisdom” where I wrote about the new SOA manifesto and how this manifesto might as well be about data quality and a lot of other disciplines and concepts. Probably not very interesting for someone trying to buy pearls or so. But maybe a single or two of the +2,000 pearl fishers was captured in the data quality net.

The second most used term is Gorilla. This was used as text for the image in the post “Gorilla Data Quality”. Personally I like this gorilla picture, and so it seems that approximate 1,600 other people also do. Whether they also like the philosophic ideas around “Gorilla Data Quality” and “Guerilla Data Quality” I am not so sure.

Other terms hitting big is Brueghel and Tower of Babel used in a post about international challenges in data quality called “The Tower of Babel” as it was illustrated by a painting by Brueghel. Also Penny Black used in a post about “Postal Address Hierarchy, Granularity, Precision and History” raised the pageview counter.

But it doesn’t seem that every little common word will do. Once I used the word traffic, but it didn’t generate any traffic at all.


Bookmark and Share

The Next Level

A quote about data quality from Thomas Redman says:

“It is a waste of effort to improve the quality (accuracy) of data no one ever uses.”

I have learned the quote from Jim Harris who mentioned the quote latest in his post: DQ-Tip: “There is no point in monitoring data quality…”

In a comment Phil Simon said: I love that. I’m jealous that I didn’t think of something so smart.

I’m guessing Phil was into some irony. If so, I can see why. The statement seems pretty obvious and at first glance you can’t imagine anyone taking the opposite stance: Let’s cleanse some data no one ever uses.

Also I think it was meant as being obvious in Redman’s book: Data Driven.

Well, taking it to the next level I can think of the following elaboration:

  1. If you found some data that no one ever uses you should not only avoid improving the quality of that data, you should actually delete the data and make sure that no one uses time and resources for entering or importing the same data in the future.
  2. That is unless the reason that no one ever uses the data is that the quality of the data is poor.  Then you must compare the benefits of improving the data against the costs of doing so. If costs are bigger, proceed with point 1. If benefits are bigger, go to point 3.
  3. It is not  a waste of effort to improve the quality of some data no one ever uses.

Bookmark and Share

Information and Data Quality Blog Carnival, February 2010


El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on AndrewBrooks.co.uk by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on IQTrainwrecks.com by IAIDQ

Bookmark and Share

2010 predictions

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.

Bookmark and Share

Sharing data is key to a single version of the truth

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers:  Charles Blyth and Jim Harris. Our contest is a Blogging Olympics of sorts, with the Great Britain, United States and Denmark competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.”

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below). Thanks!

My take

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

In my eyes the term “single version of the truth” relates best to the real-world way of data being of high quality while “shared version of the truth” relates best to the hard work of making data fit for multiple intended uses of shared data in the enterprise.

My thesis is that there is a break even point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.  

The map analogy

In search for this truth we will go on a little journey around the world.

For a journey we need a map.

Traditionally we have the challenge that the real-world being the planet Earth is round (3 dimensions) but a map shows a flat world (2 dimensions). If a map shows a limited part of the world the difference doesn’t matter that much. This is similar to fitting the purpose of use in a single business unit.

MercatorIf the map shows the whole world we may have all kind of different projections offering different kind of views on the world having some advantages and disadvantages. A classic world map is the rectangle where Alaska, Canada, Greenland, Svalbard, Siberia and Antarctica are presented much larger than in the real-world if compared to regions closer to equator. This is similar to the problems in fulfilling multiple uses embracing all business units in an enterprise.

Today we have new technology coming to the rescue. If you go into Google Earth the world indeed looks round and you may have any high altitude view of a apparently round world. If you go closer the map tends to be more and more flat. My guess is that the solutions to fit the multiple uses conondrum will be offered from the cloud.  

Exploiting rich external reference data

But Google Earth offers more than powerfull technolgy. The maps are connected with rich information on places, streets, companies and so on obtained from multiple sources – and also some crowdsourced photos not always placed with accuracy. Even if external reference data is not “the truth” these data, if used by more and more users (one instance, multiple tenants), will tend to be closer to “the truth” than any data collected and maintained solely in a single enterprise.

Shared data makes fit for pupose information

You may divide the data held by an enterprise into 3 pots:

  • Global data that is not unique to operations in your enterprise but shared with other enterprises in the same industry (e.g. product reference data) and eventually the whole world (e.g. business partner data and location data). Here “shared data in the cloud” will make your “single version of the truth” easier and closer to the real world.
  • Bilateral data concerning business partner transactions and related master data. If you for example buy a spare part then also “share the describing data” making your “single version of the truth” easier and more accurate.    
  • Private data that is unique to operations in your enterprise. This may be a “single version of the truth” that you find superior to what others have found, data supporting internal business rules that make your company more competitive and data referring to internal events.

While private and then next bilateral data makes up the largest amount of data held by an enterprise it is often seen that it is data that could be global that have the most obvious data quality issues like duplicated, missing, incorrect and outdated party master data information.

Here “a global or bilateral shared version of the truth” helps approaching “a single version of the truth” to be shared in your enterprise. This way accurate raw data may be consumed as valuable information in a given context at once when needed.  

Call to action

If not done already, please take the time to read posts from fellow bloggers Charles Blyth and Jim Harris and then vote for who you think has won the debate. A link to the same poll is provided on all three blogs. Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals.

The poll will remain open for one week, closing at midnight on 19th November so that the “medal ceremony” can be conducted via Twitter on Friday, 20th November. Additionally, please share your thoughts and perspectives on this debate by posting a comment below.  Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.

Vote here.

Bookmark and Share

Data Quality and Climate Politics

cop15_logo_imgIn 1 month and 1 day the United Nations Climate Change Conference commence in my hometown Copenhagen. Here the people of the Earth will decide if we want to save the planet now or we will wait a while and see what happens.

The Data Quality issue might seem of little importance compared to the climate issue. Nevertheless I have been thinking about some similarities between Data Governance/ Data Quality and climate politics.

It goes like this:

CEO buy-in

It’s often said that CEO’s don’t buy-in on data quality improvements because it’s a loser’s game. In climate politics the CEO’s are the heads of states. It’s still a question how many heads of state who will attend the Copenhagen conference. There is a great deal of attention around whether United States president Barack Obama will attend. His last visit to Copenhagen in early October didn’t turn out as a success as his recommendation for Chicago as Olympic host city was fruitless. I guess he will only come again if success is very likely.

Personal agendas  

On the other hand British Prime Minister Gordon Brown has urged all world leaders to come to Copenhagen. While I think this is great for the conference being a success I also have a personal reason to think, that it’s a very bad idea. Having all the world heads of states driving around in the Copenhagen streets surrounded by a horde of police bikes will make traffic jams interfering with my daily work and more seriously my Christmas shopping.

It’s no secret that much of the climate problem is caused by us as individuals not being more careful about our energy consumption in daily routines. Data Quality is all the same about individuals not thinking ahead but focusing on having daily work done as quickly and comfortable as possible.

The business perspective

My fellow countryman Bjørn Lomborg is a prominent proponent of the view of focusing more on battling starvation, diseases and other evils because the resources will be spent more effective here than the marginal effects the same resources will have on fighting changing climate.

Data Quality improvement is often omitted from Business Process Reengineering when the scope of these initiatives is undergoing prioritizing focusing on worthy measurable short term wins.

Final words

My hope for my planet – and my profession – is that we are able to look ahead and do what is best for the future while we take personal responsibility and care in our daily work and life.

Bookmark and Share

Man versus Computer. Special Edition.

trafficFollowing up on my previous post on Man versus Computer I am actually most workday mornings reminded about how man sucks.

Most workday mornings I leave home in my car heading into the following traffic:

  • A 4 lane motorway rolling in from southern Copenhagen, rest of Denmark, Germany and ultimately rest of Eurasia.
  • A 5th lane coming in from a local area.

These 5 lanes then split into:

  • 2 lanes heading for the Danish answer to Silicon Valley (called Ballerup)
  • 3 lanes leading to downtown Copenhagen or the main fair (called Bella Center), airport, Sweden and rest of Scandinavia.

Of course you will expect some mingling here. What happens every morning is rather a complete stop in traffic and the cause is not the merge and splitting but humans being drivers as:

  • Experienced local selfish drivers staying in the fastest lane until they suddenly want to switch lane according to their ongoing route.
  • Unexperienced (in this area) foreign drivers coming up from crowded central Europe in search for tranquility deep into the Swedish forests having no clue about where to position in this intersection. The same goes for Swedes returning for the opposite reason.
  • Everyone else having fun rejecting the switching from the selfish types and the foreign ones who should know better than passing in rush hours.

Some solutions to this problem might be:

  • Change Management learning people better driving habits.
  • Onboard computer in every car taking care of lane positioning. Should go smooth splitting 5 lanes into 2 + 3 lanes.

Now I am waiting for which solution that will be implemented first.