The Trouble with Data Quality Dimensions

Data Quality Dimensions

Data quality dimensions are some of the most used terms when explaining why data quality is important, what data quality issues can be and how you can measure data quality. Ironically, we sometimes use the same data quality dimension term for two different things or use two different data quality dimension terms for the same thing. Some of the troubling terms are:

Validity / Conformity – same same but different

Validity is most often used to describe if data filled in a data field obeys a required format or are among a list of accepted values. Databases are usually well in doing this like ensuring that an entered date has the day-month-year sequence asked for and is a date in the calendar or to cross check data values against another table and see if the value exist there.

The problems arise when data is moved between databases with different rules and when data is captured in textual forms before being loaded into a database.

Conformity is often used to describe if data adheres to a given standard, like an industry or international standard. This standard may due to complexity and other circumstances not or only partly be implemented as database constraints or by other means. Therefore, a given piece of data may seem to be a valid database value but not being in compliance with a given standard.

For example, the code value for a colour being “0,255,0” may be the accepted format and all elements are in the accepted range between 0 and 255 for a RGB colour code. But the standard for a given product colour may only allow the value “Green” and the other common colour names and “0,255,0” will when translated end up as “Lime” or “High green”.

Accuracy / Precision – true, false or not sure

The difference between accuracy and precision is a well-known statistical subject.

In the data quality realm accuracy is most often used to describe if the data value corresponds correctly to a real-world entity. If we for example have a postal address of the person “Robert Smith” being “123 Main Street in Anytown” this data value may be accurate because this person (for the moment) lives at that address.

But if “123 Main Street in Anytown” has 3 different apartments each having its own mailbox, the value does not, for a given purpose, have the required precision.

If we work with geocoordinates we have the same challenge. A given accurate geocode may have the sufficient precision to tell the direction to the nearest supermarket is, but not precise enough to know in which apartment the out-of-milk smart refrigerator is.

Timeliness / Currency – when time matters

Timeliness is most often used to state if a given data value is present when it is needed. For example, you need the postal address of “Robert Smith” when you want to send a paper invoice or when you want to establish his demographic stereotype for a campaign.

Currency is most often used to state if the data value is accurate at a given time – for example if “123 Main Street in Anytown” is the current postal address of “Robert Smith”.

Uniqueness / Duplication – positive or negative

Uniqueness is the positive term where duplication is the negative term for the same issue.

We strive to have uniqueness by avoiding duplicates. In data quality lingo duplicates are two (or more) data values describing the same real-world entity. For example, we may assume that

  • “Robert Smith at 123 Main Street, Suite 2 in Anytown”

is the same person as

  • “Bob Smith at 123 Main Str in Anytown”

Completeness / Existence – to be, or not to be

Completeness is most often used to tell in what degree all required data elements are populated.

Existence can be used to tell if a given dataset has all the needed data elements for a given purpose defined.

So “Bob Smith at 123 Main Str in Anytown” is complete if we need name, street address and city, but only 75 % complete if we need name, street address, city and preferred colour and preferred colour is an existent data element in the dataset.

More on data quality dimensions:

Solving GDPR Issues Using a Data Lake Approach

Some of the hot topics on the agenda today is the EU General Data Protection Regulation (GDPR) and the data lake concept. These are also hot topics for me, as GDPR is high on the agenda in doing MDM (and currently TDM – Test Data Management) consultancy and the data lake approach is the basic concept in my Product Data Lake venture.

EU GDPRIn my eyes the data lake concept can be used for a lot of business challenges. One of the them was highlighted in a CIO article called Informatica brings AI to GDPR compliance, data governance. In here Informatica CEO Anil Chakravarthy tells how a new tool, Informatica’s Compliance Data Lake, can help organisations getting a grasp on where data elements relevant to be compliant with GDPR resides in the IT landscape. This is a task very close to me in a current engagement.

The Informatica compliance tool is built on the Informatica’s Intelligent Data Lake, which was touched in the post Multi-Domain MDM 360 and an Intelligent Data Lake.

Spectre vs James Bond and the Unique Product Identifier

bond_24_spectreThe latest James Bond movie is out. It is called Spectre. Spectre is the name of a criminal organization.

In the movie “Bond, James Bond” alias 007 and in this case Mickey Mouse sneaks into a Spectre meeting. At that meeting the Spectre folks reports how they maliciously earns money. One way is selling falsified medicine.

Of course Bond hits Spectre hard during the movie. And if Bond didn’t hit all the villains, data management will do so related to falsified medicine.

The method is using a unique product identifier.

Usually in master data management, we describe a product to the level of unique characteristics also called a Stock Keeping Unit (SKU). In the pharmaceutical world that will typically be a brand name, a concentration of active substances, a dosage type and pack size and possibly a destination country.

From the electronics and machinery sectors, we know the approach of assigning each physical instance of the product a serial number. The same approach is becoming mandatory for medicine in more and more countries. The pharmaceutical manufacturers will assign a unique number to every package (and sometimes also shipping boxes) and report those to the health care authorities around the world. At the point of delivery, it is checked that the identifier equals an original product instance.

The identifier is formed by a product identifier being a Global Trade Identification Number (GTIN) or a National Drug Code (NDC) plus a randomly assigned serial number, making it hard to guess the serial number part.

The Taxman: Data Quality’s Best Friend

Collection of taxes has always been a main driver for having registries and means of identifying people, companies and properties.

5,000 years ago the Egyptians made the first known census in order to effectively collect taxes.

As reported on the Data Value Talk blog, the Netherlands have had 200 years of family names thanks to Napoleon and the higher cause of collecting taxes.

Today the taxman goes cross boarder and wants to help with international data quality as examined in the post Know Your Foreign Customer. The US FATCA regulation is about collecting taxes from activities abroad and as said on the Trillium blog: Data Quality is The Core Enabler for FATCA Compliance.

My guess is that this is only the beginning of a tax based opportunity for having better data quality in relation to international data.

In a tax agenda for the European Union it is said: “As more citizens and companies today work and operate across the EU’s borders, cooperation on taxation has become increasingly important.”.

The EU has a program called FISCALIS in the making. Soon we not only have to identify Americans doing something abroad but practically everyone taking part in the globalization.

For that we all need comprehensive accessibility to the wealth of global reference data through “cutting-edge IT systems” (a FISCALIS choice of wording).

I am working on that right now:

Bookmark and Share

Real World Identity

How far do you have to go when checking your customer’s identity?

This morning I read an article on the Danish Computerworld telling about a ferry line now dropping a solution for checking if the passenger using an access card is in fact the paying customer by using a lightweight fingerprint stored on the card. The reason for dropping was by the way due to the cost of upgrading the solution compared to future business value and not any renewed privacy concerns.

I have been involved in some balancing of real world alignment versus fitness for use and privacy in public transport as well as described in the post Real World Alignment. Here it was the question about using a national identification number when registering customers in public transportation.

As citizens of the world we are today used to sometimes having our iris scanned when flying as our passport holds our unique identification that way. Some of the considerations around using biometrics in general public registration were discussed in the post Citizen ID and Biometrics.

In my eyes, or should we say iris, there is no doubt that we will meet an increasing demand of confirming and registering our identification around. Doing that in the fight against terrorism has been there for long. Regulatory compliance will add to that trend as told in the post Know Your Foreign Customer, mentioning the consequences of the FATCA regulation and other regulations.

When talking about identity resolution in the data quality realm we usually deal with strings of text as names, addresses, phone numbers and national identification numbers. Things that reflect the real world, but isn’t the real world.

We will however probably adapt more facial recognition as examined in the post The New Face of Data Matching. We do have access to pictures in the cloud, as you may find your B2C customers picture on FaceBook and your B2B customer contacts picture on LinkedIn or other similar services. It’s still not the real world itself, but a bit closer than a text string. And of course the picture could be false or outdated and thus more suitable for traction on a dating site.

Fingerprint is maybe a bit old fashioned, but as said, more and more biometric passports are issued and the technology for iris and retinal scanning is used around for access control even on mobile devices.

In the story starting this post the business value for reinvesting in a biometric solution wasn’t deemed positive. But looking from the print on my fingers down to my hand lines I foresee some more identity resolution going beyond name and address strings into things closer to the real world as facial recognition and biometrics.

Bookmark and Share

Know Your Foreign Customer

I’m not saying that Customer Master Data Management is easy. But if we compare the capabilities within most companies with handling domestic customer records they are often stellar compared to the capabilities of handling foreign customer records.

It’s not that the knowledge, services and tools doesn’t exist. If you for example are headquartered in the USA, you will typically use best practice and services available there for domestic records. If you are headquartered in France, you will use best practice and services available there for domestic records. Using the best practices and services for foreign (seen from where you are) records is more seldom and if done, it is often done outside enterprise wide data management.

This situation can’t, and will not, continue to exist. With globalization running at full speed and more and more enterprise wide data management programs being launched, we will need best practices and services embracing worldwide customer records.

Also new regulatory compliance will add to this trend. Being effective next year the US Foreign Account Tax Compliance Act (FATCA) will urge both US Companies and Foreign Financial Institutions to better know your foreign customers and other business partners.

In doing that, you have to know about addresses, business directories and consumer/citizen hubs for an often large range of countries as described in the post The Big ABC of Reference Data.

It may seem a daunting task for each enterprise to be able to embrace big reference data for all the countries where you have customers and other business partners.

My guess, well, actually plan, is, that there will be services, based in the cloud, helping with that as indicated in the post Partnerships for the Cloud.

Bookmark and Share

Donkey Business

When I started focusing on data quality technology 15 years ago I had great expectations about the spread of data quality tools including the humble one I was fabricating myself.

Even if you tell me that tools haven’t spread because people are more important than technology, I think most people in the data and information quality realm think that the data and information quality cause haven’t spread as much as deserved.

Fortunately it seems that the interest in solving data quality issues is getting traction these days. I have noticed two main drivers for that. If we compare with the traditional means of getting a donkey to move forward, the one encouragement is like the carrot and the other encouragement is like the stick:

  • The carrot is business intelligence
  • The stick is compliance

With business intelligence there has been a lot things said and written about that business intelligence don’t deliver unless the intelligence is build on a solid valid data foundation. As a result I have noticed I’m being involved in data quality improvement initiatives around aimed as a foundation for delivering business decisions. One of my favorite data quality bloggers Jim Harris has turned that carrot a lot on his blog: Obsessive Compulsive Data Quality.  

Another favorite data quality blogger Ken O’Conner has written about the stick being compliance work on his blog, where you will find a lot of good points that Ken has learned from his extensive involvement in regulatory requirement issues.

These times are interesting times with a lot of requirements for solving data quality issues. As we all know, the stereotype donkey is not easily driven forward and we must be aware not making the burden to heavy:    

Bookmark and Share

Bon Appetit

If I enjoy a restaurant meal it is basically unimportant to me what raw ingredients from where were used and which tools the chef used during preparing the meal. My concerns are whether the taste meet my expectations, the plate looks delicious in my eyes, the waiter seems nice and so on.

This is comparable to when we talk about information quality. The raw data quality and the tools available for exposing the data as tasty information in a given context is basically not important to the information consumer.

But in the daily work you and I may be the information chef. In that position we have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Let’s look at some analogies.

Best before

Fresh raw ingredients is similar to actualized raw data. Raw data also has a best before date depending on the nature of the data. Raw data older than that date may be spiced up but will eventually make bad tasting information.

One-stop-shopping

Buying all your raw ingredients and tools for preparing food – or taking the shortcut with ready made cookie cutting stuff – from a huge supermarket is fast and easy (and then never mind the basket usually also is filled with a lot of other products not on the shopping list).

A good chef always selects the raw ingredients from the best specialized suppliers and uses what he consider the most professional tools in the preparing process.

Making information from raw data has the same options.

Compliance

Governments around the world has for long time implemented regulations and inspection regarding food mainly focused at receiving, handling and storing raw ingredients.

The same is now going on regarding data. Regulations and inspections will naturally be directed at data as it is originated, stored and handled.

Diversity

Have you ever tried to prepare your favorite national meal in a foreign country?

Many times this is not straightforward. Some raw ingredients are simply not available and even some tools may not be among the kitchen equipment.

When making information from raw data under varying international conditions you often face the same kind of challenges.

Master Data Audit

In the recent cycling sport paramount of the year ”Le Tour de France” one of the leading teams was “Team Saxo Bank”. The name should actually have been “Team Saxo Bank IT Factory”. But “IT Factory” is gone.

IT Factory was during the last years a comet in the Danish IT industry with fast increasing turnover and revenues verified by leading auditors. Only a few people led by a (now) known blogger asked about the customer base. 1st December 2008 it all blew up and it was revealed that 99% of the turnover was a fairy tale. More details on wiki.

If the auditors had spent 10 minutes (or so) on the Master Data besides looking at Transaction Data making the Financial Statements, the auditors would have found the mismatch between the customer base (and linked products) and the real world – and several banks and others would not have lost a lot of money.

Master Data Management is first of all a benefit to the organisation having these data. But as shown in the above example, it is like with financial statements also of interest to the surrounding world that the Master Data has a reasonable data quality and alignment with the real world. Often financial statements are followed by market and other assessments built on the Master Data of the organisation.

Without comparison with the IT Factory case I remember another case from Denmark this year where Telia, a leading Telco in the Nordics, in addition to the Financial Statement told that the they had 44.000 more customers in the database during the year. Asked how it was counted the answer revealed, that it actually was 44.000 more active SIM-cards. So the case was, that it could have been 1 new customer with 1 SIM-card and 43.999 existing customers having more SIM-cards. Link in Danish here.

We already know SOX and EuroSOX as compliance approaches with financial statements and Basel II also affects the data quality and real world alignment of Master Data in banking. My guess is that we will see more focus on the Data Quality and real world alignment of Master Data from outside the organisation adding to ongoing awareness on the subject already existing inside many organisations.

Bookmark and Share