The Start of the History of Data and Information Quality Management

I am sad to hear that Larry English has passed away as I learned from this LinkedIn update by C. Lwanga Yonke.

As said in here: “When the story of Information Quality Management is written, the first sentence of the first paragraph will include the name Larry English”.

Larry pioneered the data quality – or information quality as he preferred to coin it – discipline.

He was an inspiration to many data and information quality practitioners back in the 90’s and 00’s, including me, and he paved the way for bringing this topic to the level of awareness that it has today.

In his teaching Larry emphasized on the simple but powerful concepts which are the foundation of data quality and information quality methodologies:

  • Quantify the costs and lost opportunities of bad information quality
  • Always look for the root cause of bad information quality
  • Observe the plan-do-check-act circle when solving the information quality issues

Let us roll up our sleeves and continue what Larry started.

Human Errors and Data Quality

Every time there is a survey about what causes poor data quality the most ticked answer is human error. This is also the case in the Profisee 2019 State of Data Management Report where 58% of the respondents said that human error is among the most prevalent causes of poor data quality within their organization.

This topic was also examined some years ago in the post called The Internet of Things and the Fat-Finger Syndrome.

Errare humanum estEven the Romans knew this as Seneca the Younger said that “errare humanum est” which translates to “to err is human”. He also added “but to persist in error is diabolical”.

So, how can we not persist in having human errors in data then? Here are three main approaches:

  • Better humans: There is a whip called Data Governance. In a data governance regime you define data policies and data standards. You build an organizational structure with a data governance council (or any better name), have data stewards and data custodians (or any better title). You set up a business glossary. And then you carry on with a data governance framework.
  • Machines: Robotic Processing Automation (RPA) has, besides operational efficiency, the advantage of that machines, unlike humans, do not make mistakes when they are tired and bored.
  • Data Sharing: Human errors typically occur when typing in data. However, most data are already typed in somewhere. Instead of retyping data, and thereby potentially introduce your misspelling or other mistake, you can connect to data that is already digitalized and validated. This is especially doable for master data as examined in the article about Master Data Share.

Modern Data Management, Paella, Herodotus, Darwin and Einstein

Reltio has a blog series with the tag #moderndatamasters. The posts are interviews with people in the data management world. The other day it was my turn to share my story.

Kate Tickner from Reltio went with me around some serious questions as:

  • How would you define “modern” data management and what does it /should it mean for organisations that adopt it?
  • What are your top 3 tips or resources to share for aspiring modern data masters?
  • Can you tell us a little more about the concepts behind Product Data Lake and your vision for how it could be used in the future?
  • What trends or changes do you predict to the data management arena in the next few years?

You can read the interview here on the Reltio blog.

At the end we touched:

  • What do you like to do outside of work?
  • Which 3 people – living or dead, real or fictional – would you invite to a dinner party and why?
  • What are you cooking?

For the dinner party I would make paella and, based on my interest in history, picked three historical persons, who also have been featured on this blog:

Reltio moderndatamasters

Marathon, Spartathlon and Data Quality

Tomorrow there is a Marathon race in my home city Copenhagen. 8 years ago, a post on this blog revolved around some data quality issues connected with the Marathon race. The post was called How long is a Marathon?

Pheidippides at the end of his Marathon race in a classic painting

However, another information quality issue is if there ever was a first Marathon race ran by Pheidippides? Historians toady do not think so. It has something to do with data lineage. The written mention of the 42.192 (or so) kilometre effort from Marathon to Athens by Pheidippides is from Plutarch whose records was made 500 years after the events. The first written source about the Battle of Marathon is from Herodotus. It was written (in historian perspective) only 40 years after the events. He did not mention the Marathon run. However, he wrote, that Pheidippides ran from Athens to Sparta. That is 245 kilometres.

By the way: His mission in Sparta was to get help. But the Spartans did not have time. They were in the middle of an SAP roll-out (or something similar festive).

Some people make the 245-kilometre track in what is called a Spartathlon. In data and information quality context this reminds me that improving data quality and thereby information quality is not a sprint. Not even a Marathon. It is a Spartathlon.


When You Know that Statement is Wrong

1271Oftentimes it still takes a human eye to establish if a number, year, term or other piece of information is wrong.

I had that experience today at Harvard Square in Cambridge (Boston) when looking at the sign in front of our lunch restaurant. Established 1271 it says. Hmmmm. North American natives were not known for establishing restaurants. Also, the Vikings did not stay that long or went that south in North America.

The restaurant website actually admits the sign is wrong and this is a printing flaw (should have been 1971) that they have chosen to keep – maybe also in order to test the clever people hanging around Harvard.

Anyway, without attempting to turn this into a foodie blog, the food is OK but the waiting time for being served does resemble spans of centuries.

A Product Information Management (PIM) Solar System

Hundreds of years ago the geocentric model was replaced by heliocentrism, meaning that we recognize that the earth travels around the sun and not the other way around.

When it comes to Product Information Management (PIM), we also need a Copernican Revolution, meaning that it is good to manage product information consistently inside a given company, but it is better to manage product information in the light of the business ecosystem where we participate.

Exchanging product information in the business ecosystems of manufacturers, distributors and merchants cannot work properly by asking all your trading partners to use your version of a spreadsheet – if they don’t get to you first with their version. Nor will self-centered supplier / customer product data portals work as examined in the post PIM Supplier Portals: Are They Good or Bad?

Your company is not a lonely planet. You are part of a business ecosystem, where you may be:

  • Upstream as the maker of goods and services. For that you need to buy raw materials and indirect goods from the parties being your vendors. In a data driven world you also to need to receive product information for these items. You need to sell your finished products to the midstream and downstream parties being your B2B customers. For that you need to provide product information to those parties.
  • Midstream as a distributor (wholesaler) of products. You need to receive product information from upstream parties being your vendors, perhaps enrich and adapt the product information and provide this information to the parties being your downstream B2B customers.
  • Downstream as a retailer/etailer or large end user of product information. You need to receive product information from upstream parties being your vendors and enrich and adapt the product information so you will be the preferred seller to the parties being your B2B customers and/or B2C customers.

At Product Data Lake we support business ecosystems in Product Information Management (PIM). And this is not just a nice model. There are concrete business benefits too. 5 for you and 5 for your trading partner:  Check our 10 business benefits.


No plan of operations extends with any certainty beyond the first contact with the full load of data

There is a famous saying from the military world stating that: “No plan survives contact with the enemy.” At least one blogger has used the paraphrasing saying: “No plan survives contact with the data.” A good read by the way.

Helmuth von Moltke the Elder

Like most famous sayings also this phrase is simplified from the original version. The military observation made by Helmuth von Moltke the Elder is in full length: “No plan of operations extends with any certainty beyond the first contact with the main hostile force.”

Translating the extended military learning into data management makes a lot of sense too. You may plan data management activities using selected examples and you may test those using nice little samples. Like skirmishes before the real battle in warfare. But if your data management solution goes live on the full load of data for the first time, there most often is news for you.

From my data matching days I remember this clearly as explained in the post Seeing is Believing.

The mitigation is to test with a full load of data before going live. In data management we actually have a realistic way of overcoming the observation made by Field Marshall Helmuth Carl Bernard Graf von Moltke and revisit our plan of operations before the second and serious contact with the full load of data.

Bookmark and Share

Who Discovered the Americas?

Today I read a strange story about who discovered the Americas. It is about that Turkish President Recep Tayyip Erdogan said that Muslims, not Columbus, discovered Americas. The assumed discovery should have happened in the year 1178 in the Gregorian calendar.

AmericasWell, in my history book it goes like this:

1st the indigenous peoples of the Americas, sometimes called Indians (as opposed to cowboys), found that land by crossing the Bering Strait thousands of years ago.

2nd there is much speculation about that someone else crossed the oceans. Only archaeological evidence (so far) is that the Vikings were on Newfoundland of the coast of Canada at a place today called L’Anse aux Meadows. That happened around year 1000 in the Gregorian calendar. (By the way they came from Greenland, that geographically is a part of the Americas).

3rd Christopher Columbus and his crew arrived in the Americas in the year 1492 in the Gregorian calendar.

That is the data quality part of the story. The rest is information quality.

Bookmark and Share

Anachronism and Data Quality

The term anachronism is used for something misplaced in time. An example is classical paintings where a biblical event is shown with people in clothes from the time when the painting was done.

anachronismIn data quality lingo such a flaw will be categorized as lack of timeliness.

The most frequent example of lack of timeliness, or should we say example of anachronism, in data management today is having an old postal address attached to a party master data entity. A remedy for avoiding this kind of anachronism is explained in the post The Relocation Event.

In a recent blog post called 3-2-1 Start Measuring Data Quality by Janani Dumbleton of Experian QAS the timeliness dimension in data quality is examined along with five other important dimensions of data quality. As said herein an impact of anachronism could be:

“Not being aware of a change in address could result in confidential information being delivered to the wrong recipient. “

Hope you got it.

Bookmark and Share

Famous False Positives

You should Beware of False Positives in Data Matching. A false positive in the data quality realm is a match of two (or more) identities that actually isn’t the same real world entity.

Throughout history and within art we have seen some false positives too. Here are my three favorites:

The Piltdown Man

In 1912 a British amateur archeologist apparently found a fossil claimed to be the missing link between apes and man: The so called Piltdown Man. Backed up by the British Museum it was a true discovery until 1953 when it was finally revealed as a hoax. It was however disputed during all the years but defended by the British establishment maybe due to envy on the French having a Cro-Magnon man first found there and the Germans having a name giving true discovery in Neandertal.

Eventually the Piltdown Man was exposed as a middle age human upper skull, an orangutan jawbone and chimpanzee teeth.

Jimmy Bond in Casino Royale

James and Jimmy Bond

As told in the post My Name is Bond. Jimmy Bond: James Bond is British intelligence and Jimmy Bond is an American agent. It’s always a question if two identities residing in different countries are the same as discussed (about me) in the post Hello Leading MDM Vendor.

Dupond et Dupont

In English they are known as Thomson and Thompson. In the original Belgian/French (and in my childhood Danish comics) piece of art about the adventures of Tintin they are known as Dupond et Dupont. They are two incompetent detectives who look alike and have names with a low edit distance and same phonetic sound. For twin names in a lot of other languages check the Wikipedia article here.

And hey, today I’m going to the creator of these two guy’s home country Belgium to be at the Belgian Data Quality Association congress tomorrow.

Bookmark and Share