Automate or Obliterate, That is the Question

Back in 1990 Michael Hammer made a famous article called Reengineering Work: Don’t Automate, Obliterate.

Indeed, while automation is a most wanted outcome of Master Data Management (MDM) implementations and many other IT enabled initiatives, you should always consider the alternative being eliminating (or simplifying). This often means thinking out of the box.

As an example I today stumbled upon the Wikipedia explanation about Business Process Mapping. The example used is how to make breakfast (the food part):


You could think about different Business Process Re-engineering opportunities for that process. But you could also realize that this is an English / American breakfast. What about making a French breakfast instead. Will be as simple as:

Input money > Buy croissant > Fait accompli

PS: From the data quality and MDM world one example of making French breakfast instead of English / American breakfast is examined in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

Free and Open Public Sector Master Data

Yesterday the Danish Ministry of Finance announced an agreement between local authorities and the central government to improve and link public registers of basic data and to make data available to the private sector.

Once the public authorities have tidied up, merged the data and put a stop to parallel registration, annual savings in public administration could amount to 35 million EUR in 2020.

Basic open data includes private addresses, companies’ business registration numbers, cadastral numbers of real properties and more. These master data are used for multiple purposes by public sector bodies.

Private companies and other organizations can look forward to large savings when they no longer have to buy their basic data from the public authorities.

In my eyes this is a very clever move by the authorities exactly because of the two main opportunities mentioned:

  • The public sector will see savings and related synergies from a centralized master data management approach
  • The private sector will gain a competitive advantage from better and affordable reference data accessibility and thereby achieve better master data quality.

Denmark have, along with the other Nordic countries, always had a more mature public sector master data approach than we see in most other countries around the world.

I remember I worked with the committee that prepared a single registry for companies in Denmark back in the 80’s as mentioned in the post Single Company View.

Today I work with a solution called iDQ (instant Data Quality) which is about mashing up internal master data and a range of external reference data from social networks and not at least public sector sources. In that realm there is certainly not something rotten in Denmark. Rather there is a good answer to the question about to be free and open or not to be.

Bookmark and Share

All that glisters is not gold

As William (not Bill) Shakespeare wrote in the play The Merchant of Venice:

All that glisters is not gold;
Often have you heard that told

I was reminded about that phrase when commenting on a comment from John Owens in my recent post called Non-Obvious Entity Relationship Awareness.

Loraine Lawson wrote a piece on IT Business Edge yesterday called Adding Common Sense to Data Quality. That post relates to a post by Phil Simon on Mike 2.0 called Data Error Inequality. That post relates to a post on this blog called Pick Any Two.

Anyway, one learning from all this glistering relationship fuzz is that when looking for return on investment (Gold) in data quality improvement and master data management perfection I agree with adding some common sense.

One of the first posts on this blog actually was Data Quality and Common Sense.  

Bookmark and Share

As Bill Shakespeare Wrote …

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

Bookmark and Share

To be called Hamlet or Olaf – that is the question

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share

Double Falshood

Always remember to include Shakespeare in a blog, right?

Now, it is actually disputable if Shakespeare has anything to do with the title of this blog post. Double Falshood is the (first part of the) title of a play claimed to be based on a lost play by Shakespeare (and someone else). The only fact that seems to be true in this story is that the plot of the play(s) is based on an episode in Don Quixote by Cervantes.  “The Ingenious Hidalgo Don Quixote of La Mancha”, which is the full name of the novel, is probably best known for the attack on the windmills by don Quijote (the Spanish version of the name).

All this confusion about sorting out who, what, when and where, and the feeling of tilting at windmills, seems familiar in the daily work in trying to fix master data quality.

And indeed “double falsehood” may be a good term for the classic challenge in the data quality kind of deduplication, which is to avoid false positives and false negatives at the same time.

Now, back to work.

Information and Data Quality Blog Carnival, February 2010

El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on by IAIDQ

Bookmark and Share