Shakespeare – Liliendahl on Data Quality

Automate or Obliterate, That is the Question

18th March 201519th March 2015Henrik Gabs Liliendahl4 Comments

Back in 1990 Michael Hammer made a famous article called Reengineering Work: Don’t Automate, Obliterate.

Indeed, while automation is a most wanted outcome of Master Data Management (MDM) implementations and many other IT enabled initiatives, you should always consider the alternative being eliminating (or simplifying). This often means thinking out of the box.

As an example I today stumbled upon the Wikipedia explanation about Business Process Mapping. The example used is how to make breakfast (the food part):

You could think about different Business Process Re-engineering opportunities for that process. But you could also realize that this is an English / American breakfast. What about making a French breakfast instead. Will be as simple as:

Input money > Buy croissant > Fait accompli

PS: From the data quality and MDM world one example of making French breakfast instead of English / American breakfast is examined in the post The Good, Better and Best Way of Avoiding Duplicates.

Free and Open Public Sector Master Data

9th October 20129th October 2012Henrik Gabs Liliendahl8 Comments

Yesterday the Danish Ministry of Finance announced an agreement between local authorities and the central government to improve and link public registers of basic data and to make data available to the private sector.

Once the public authorities have tidied up, merged the data and put a stop to parallel registration, annual savings in public administration could amount to 35 million EUR in 2020.

Basic open data includes private addresses, companies’ business registration numbers, cadastral numbers of real properties and more. These master data are used for multiple purposes by public sector bodies.

Private companies and other organizations can look forward to large savings when they no longer have to buy their basic data from the public authorities.

In my eyes this is a very clever move by the authorities exactly because of the two main opportunities mentioned:

The public sector will see savings and related synergies from a centralized master data management approach
The private sector will gain a competitive advantage from better and affordable reference data accessibility and thereby achieve better master data quality.

Denmark have, along with the other Nordic countries, always had a more mature public sector master data approach than we see in most other countries around the world.

I remember I worked with the committee that prepared a single registry for companies in Denmark back in the 80’s as mentioned in the post Single Company View.

Today I work with a solution called iDQ (instant Data Quality) which is about mashing up internal master data and a range of external reference data from social networks and not at least public sector sources. In that realm there is certainly not something rotten in Denmark. Rather there is a good answer to the question about to be free and open or not to be.

All that glisters is not gold

19th March 201119th March 2011Henrik Gabs LiliendahlLeave a comment

As William (not Bill) Shakespeare wrote in the play The Merchant of Venice:

All that glisters is not gold;

Often have you heard that told

I was reminded about that phrase when commenting on a comment from John Owens in my recent post called Non-Obvious Entity Relationship Awareness.

Loraine Lawson wrote a piece on IT Business Edge yesterday called Adding Common Sense to Data Quality. That post relates to a post by Phil Simon on Mike 2.0 called Data Error Inequality. That post relates to a post on this blog called Pick Any Two.

Anyway, one learning from all this glistering relationship fuzz is that when looking for return on investment (Gold) in data quality improvement and master data management perfection I agree with adding some common sense.

One of the first posts on this blog actually was Data Quality and Common Sense.

As Bill Shakespeare Wrote …

15th March 2011Henrik Gabs Liliendahl3 Comments

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

To be called Hamlet or Olaf – that is the question

18th October 20105th November 2010Henrik Gabs LiliendahlLeave a comment

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

Pegy Smith
Peggy Smith
Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Double Falshood

22nd March 201015th March 2011Henrik Gabs Liliendahl2 Comments

Always remember to include Shakespeare in a blog, right?

Now, it is actually disputable if Shakespeare has anything to do with the title of this blog post. Double Falshood is the (first part of the) title of a play claimed to be based on a lost play by Shakespeare (and someone else). The only fact that seems to be true in this story is that the plot of the play(s) is based on an episode in Don Quixote by Cervantes. “The Ingenious Hidalgo Don Quixote of La Mancha”, which is the full name of the novel, is probably best known for the attack on the windmills by don Quijote (the Spanish version of the name).

All this confusion about sorting out who, what, when and where, and the feeling of tilting at windmills, seems familiar in the daily work in trying to fix master data quality.

And indeed “double falsehood” may be a good term for the classic challenge in the data quality kind of deduplication, which is to avoid false positives and false negatives at the same time.

Now, back to work.

Information and Data Quality Blog Carnival, February 2010

2nd February 201023rd March 2013Henrik Gabs Liliendahl2 Comments

El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management. With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on AndrewBrooks.co.uk by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on IQTrainwrecks.com by IAIDQ

Perfect Wrong Answer

9th January 20109th May 2010Henrik Gabs Liliendahl4 Comments

If you ask me the question ”How many people live in your town?” I could give you a correct answer being 5,000 % besides what you are looking for.

I live in Greve Municipality in Denmark. Population close to 48,000. Greve is a suburb south of Copenhagen. According to Wikipedia Copenhagen urban area has a population of 1.2 million and Copenhagen metro area has a population of 1.9 million people.

The Copenhagen metro area stretches from 40 km (20 miles) south of the city to 40 km (20 miles) north at Elsinore and Kronborg Castle (immortalized in Shakespeare’s Hamlet – always remember to include Shakespeare in a blog).

Further more: From Copenhagen you can look across the water to the east seeing Sweden and the city Malmoe. The Copenhagen-Malmoe bi-national urban agglomeration has a total population of 2.5 million people.

The real data quality issue in my initial question is not the precision, validity and timeliness in the number given in the answer but the shared understanding of the label attached to the number.

I noticed that Wikipedia has developed a good metadata habit when stating town populations giving 3 distinct labels: City, Urban and Metro.

	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph
	Henrik Gabs Lilienda… on SAP and Master Data Manag…
	Conrad Greer on SAP and Master Data Manag…
	Henrik Gabs Lilienda… on SAP and Master Data Manag…
	Michael Fieg, Parsio… on SAP and Master Data Manag…
	Asifa on Data Fabric and Master Data…
	Henrik Gabs Lilienda… on Data Fabric and Master Data…