Standardization – Liliendahl on Data Quality

The Trouble with Data Quality Dimensions

8th August 201911th August 2019Henrik Gabs LiliendahlLeave a comment

Data Quality Dimensions

Data quality dimensions are some of the most used terms when explaining why data quality is important, what data quality issues can be and how you can measure data quality. Ironically, we sometimes use the same data quality dimension term for two different things or use two different data quality dimension terms for the same thing. Some of the troubling terms are:

Validity / Conformity – same same but different

Validity is most often used to describe if data filled in a data field obeys a required format or are among a list of accepted values. Databases are usually well in doing this like ensuring that an entered date has the day-month-year sequence asked for and is a date in the calendar or to cross check data values against another table and see if the value exist there.

The problems arise when data is moved between databases with different rules and when data is captured in textual forms before being loaded into a database.

Conformity is often used to describe if data adheres to a given standard, like an industry or international standard. This standard may due to complexity and other circumstances not or only partly be implemented as database constraints or by other means. Therefore, a given piece of data may seem to be a valid database value but not being in compliance with a given standard.

For example, the code value for a colour being “0,255,0” may be the accepted format and all elements are in the accepted range between 0 and 255 for a RGB colour code. But the standard for a given product colour may only allow the value “Green” and the other common colour names and “0,255,0” will when translated end up as “Lime” or “High green”.

Accuracy / Precision – true, false or not sure

The difference between accuracy and precision is a well-known statistical subject.

In the data quality realm accuracy is most often used to describe if the data value corresponds correctly to a real-world entity. If we for example have a postal address of the person “Robert Smith” being “123 Main Street in Anytown” this data value may be accurate because this person (for the moment) lives at that address.

But if “123 Main Street in Anytown” has 3 different apartments each having its own mailbox, the value does not, for a given purpose, have the required precision.

If we work with geocoordinates we have the same challenge. A given accurate geocode may have the sufficient precision to tell the direction to the nearest supermarket is, but not precise enough to know in which apartment the out-of-milk smart refrigerator is.

Timeliness / Currency – when time matters

Timeliness is most often used to state if a given data value is present when it is needed. For example, you need the postal address of “Robert Smith” when you want to send a paper invoice or when you want to establish his demographic stereotype for a campaign.

Currency is most often used to state if the data value is accurate at a given time – for example if “123 Main Street in Anytown” is the current postal address of “Robert Smith”.

Uniqueness / Duplication – positive or negative

Uniqueness is the positive term where duplication is the negative term for the same issue.

We strive to have uniqueness by avoiding duplicates. In data quality lingo duplicates are two (or more) data values describing the same real-world entity. For example, we may assume that

“Robert Smith at 123 Main Street, Suite 2 in Anytown”

is the same person as

“Bob Smith at 123 Main Str in Anytown”

Completeness / Existence – to be, or not to be

Completeness is most often used to tell in what degree all required data elements are populated.

Existence can be used to tell if a given dataset has all the needed data elements for a given purpose defined.

So “Bob Smith at 123 Main Str in Anytown” is complete if we need name, street address and city, but only 75 % complete if we need name, street address, city and preferred colour and preferred colour is an existent data element in the dataset.

More on data quality dimensions:

Learn about the trends in which dimensions that are hot in the post Data Quality Dimensions in Motion.
Read about which dimensions that are the top ones for product data in the post 5 Vital Product Data Quality Dimensions.
Explore the relationship between dimensions and real-world entities in the post Data Quality Dimensions and Real World Alignment.

Embracing Standards versus Imposing Standards

8th March 201811th September 2018Henrik Gabs LiliendahlLeave a comment

When working with Product Information Management (PIM) and the recurring challenges in exchanging product information between trading partners the idea about everyone adhering to the same standard is a tempting idea.

This idea is also governing the many product data pools around. However, there are some serious considerations against this idea, namely:

Being on the same standard and not to say on the same version within your business ecosystem is quite utopic (being that within your own organization is hard enough).
It is not desirable to have the same product information as your competitors if you are going to compete on other factors than price.

In my eyes it is a better idea to forget about imposing a rigid standard for everyone and instead embrace the many available standards for product information where your organization utilize those being best for you at the given time and your various trading partners utilize those being best for them at a given time.

The solution for that is Product Data Lake.

Sell more Reduce costs

How to Combine eClass and ETIM

21st November 2017Henrik Gabs Liliendahl1 Comment

eClass and ETIM are two different standards for product information.

eCl@ss is a cross-industry product data standard for classification and description of products and services emphasizing on being a ISO/IEC compliant industry standard nationally and internationally. The classification guides the eCl@ss standard for product attributes (in eClass called properties) that are needed for a product with a given classification.

ETIM develops and manages a worldwide uniform classification for technical products. This classification guides the ETIM standard for product attributes (in ETIM called features) that are needed for a product with a given classification.

It is worth noticing, that these two standards are much more elaborate than for example the well-known classification system called UNSPSC, as UNSPSC only classifies products, but does not tell which attributes (and with what standards) you need to specify a product in detail.

There is a cooperation between eClass and ETIM which means, that you can map between the two standards. However, it will not usually make sense for one organization to try to use both standards at the same time.

PDL How it works What does make sense is combining the two standards, if there are two trading partners where one uses one of these standards and the other one uses the other standard. The place to make the combination is within Product Data Lake, the new service for exchanging product information between manufacturers and merchants. Here, trading partners can make a:

Product Data Push with one standard, and a
Product Data Pull with the other standard

What is in a business directory?

20th July 2017Henrik Gabs Liliendahl4 Comments

When working with Party Master Data Management one approach to ensure accuracy, completeness and other data quality dimensions is to onboard new business-to-business (B2B) entities and enrich such current entities via a business directory.

While this could seem to be a straight forward mechanism, unfortunately it usually is not that easy peasy.

Let us take an example featuring the most widely used business directory around the world: The Dun & Bradstreet Worldbase. And let us take my latest registered company: Product Data Lake.

PDL at DnB

On this screen showing the basic data elements, there are a few obstacles:

The address is not formatted well
The country code system is not a widely used one
The industry sector code system shown is one among others

Address Formatting

In our address D&B has put the word “sal”, which is Danish for floor. This is not incorrect, but addresses in Denmark are usually not written with that word, as the number following a house number in the addressing standard is the floor.

Country Codes

D&B has their own 3-digit country code. You may convert to the more widely used ISO 2-character country code. I do however remember a lot of fun from my data matching days when dealing with United Kingdom where D&B uses 4 different codes for England, Wales, Scotland and Northern Ireland as well as mapping back and forth with United States and Puerto Rico. Had to be made very despacito.

Industry Sector Codes

The screen shows a SIC code: 7374 = Computer Processing and Data Preparation and Processing Services

This must have been converted from the NACE code by which the company has been registered: 63.11:(00) = Data processing, hosting and related activities.

The two codes do by the way correspond to the NAICS Code 518210 = Data processing, hosting and related activities.

The challenges in embracing the many standards for reference data was examined in the post The World of Reference Data.

The Problem with English

14th June 2017Henrik Gabs Liliendahl4 Comments

– and many other languages

This blog is in English. However, as a citizen in a country where English is not the first language, I have a problem with English. Which flavour or flavor of English should I use? US English? British English? Or any of the many other kinds of English?

It is, in that context, more a theoretical question than a practical one. Despite what Grammar Nazis might think, I guess everyone understands the meaning in my blend of English variants and occasional other spelling mistakes.

The variants of English, spiced up with other cultural and administrative differences, does however create real data quality issues as told in the post Cultured Freshwater Pearls of Wisdom.

English When working with Product Data Lake, a service for sharing product information between trading partners, we also need to embrace languages. In doing that we cannot just pick English. We must make it possible to pick any combination of English and country where English is (one of) the official language(s). The same goes for Spanish, German, French, Portuguese, Russian and many other languages in the extend that products can be named and described with different spelling (in a given alphabet or script type).

You always must choose between standardization or standardisation.

Product Information Sharing Issue No 2: No Viable Standard

8th June 2017Henrik Gabs LiliendahlLeave a comment

A current poll on sharing product information with trading partners running on this blog has this question: As a manufacturer: What is Your Toughest Product Information Sharing Issue?

Some votes in the current standing has gone to this answer:

There is no viable industry standard for our kind of products

Indeed, having a standard that all your trading partners use too, will be Utopia.

This is however not the situation for most participants in supply chains. There are many standards out there, but each applicable for a certain group of products, geography or purpose as explained in the post Five Product Classification Standards.

At Product Data Lake we embrace all these standards. If you use the same standard in the same version as your trading partner, linking and transformation is easy. If you do not, you can use Product Data Lake to link and transform from your way to the way your trading partners handles product information. Learn more at Product Data Lake Documentation and Data Governance.

Attribute Types — The tagging scheme used in Product Data Lake attributes (metadata)

Five Product Classification Standards

27th April 201730th April 2017Henrik Gabs LiliendahlLeave a comment

When working with Product Master Data Management (MDM) and Product Information Management (PIM) one important facet is classification of products. You can use your own internal classification(s), being product grouping and hierarchy management, within your organization and/or you can use one or several external classification standards.

Five External Standards

Some of the external standards I have come across are:

UNSPSC

The United Nations Standard Products and Services Code® (UNSPSC®), managed by GS1 US™ for the UN Development Programme (UNDP), is an open, global, multi-sector standard for classification of products and services. This standard is often used in public tenders and at some marketplaces.

GPC

GS1 has created a separate standard classification named GPC (Global Product Classification) within its network synchronization called the Global Data Synchronization Network (GDSN).

Commodity Codes / Harmonized System (HS) Codes

Commodity codes, lately being worldwide harmonized and harmonised, represent the key classifier in international trade. They determine customs duties, import and export rules and restrictions as well as documentation requirements. National statistical bureaus may require these codes from businesses doing foreign trade.

eClass

ETIM

The Competition and The Neutral Hub

If you click on the links to some of these standards you may notice that they are actually competing against each other in the way they represent themselves.

At Product Data Lake we are the neutral hub in the middle of everyone. We cover your internal grouping and tagging to any external standard. Our roadmap includes more close integration to the various external standards embracing both product classification and product attribute requirements in multiple languages where provided. We do that with the aim of letting you exchange product information with your trading partners, who probably do the classification differently from you.

Data Born Companies and the Rest of Us

18th January 201728th January 2017Henrik Gabs LiliendahlLeave a comment

This post is a new feature here on this blog, being guest blogging by data management professionals from all over the world. First up is Harri Juntunen, Partner at Twinspark Consulting in Finland:

Data and clever use of data in business has had and will have significant impact on value creation in the next decade. That is beyond reasonable doubt. What is less clear is, how this is going to happen? Before we answer the question, I think it is meaningful to make a conceptual distinction between data born companies and the rest of us.

Data born born companies are companies that were conceived from data. Their business models are based on monetising clever use of data. They have organised everything from their customer service to operations to be capable of maximally harness data. Data and capabilities to use data to create value is their core competency. These companies are the giants of data business: Google, Facebook, Amazon, Über, AirBnB. The standard small talk topics in data professionals’ discussions.

However, most of the companies are not data born. Most of the companies were originally established to serve a different purpose. They were founded to serve some physical needs and actually maintaining them physically, be it food, spare parts or factories. Obviously, all of these companies in e.g. manufacturing and maintenance of physical things need data to operate. Yet, these companies are not organised around the principles of data born companies and capabilities to harness data as the driving force of their businesses.

We hear a lot of stories and successful examples about how data born companies apply augmented intelligence and other latest technology achievements. Surely, technologies build around of data are important. The key question to me is: what, in practice, is our capability to harness all of these opportunities in companies that are not data born?

In my daily practice I see excels floating around and between companies. A lot of manual work caused by unstandardised data, poor governance and bad data quality. Manual data work simply prevents companies to harness the capabilities created by data born companies. Yet, most of the companies follow the data born track without sufficient reflection. They adopt the latest technologies used by the data born companies. They rephrase same slogans: automation, advanced analytics, cognitive computing etc. And yet, they are not addressing the fundamental and mundane issues in their own capabilities to be able to make business and create value with data. Humans are doing machine’s job.

Why? Many things relate to this, but data quality and standardization are still pressing problems in every day practice in many companies. Let alone between companies. We can change this. The rest of us can reborn from data just by taking a good look of our mundane data practices instead of aspiring to go for the next big thing.

P.S. The Google Brain team had reddit a while ago and they were asked “what do you think is underrated?”

The answer:

“Focus on getting high-quality data. “Quality” can translate to many things, e.g. thoughtfully chosen variables or reducing noise in measurements. Simple algorithms using higher-quality data will generally outperform the latest and greatest algorithms using lower-quality data.”

https://www.reddit.com/r/MachineLearning/comments/4w6tsv/ama_we_are_the_google_brain_team_wed_love_to/

About Harri Juntunen:

Harri is seasoned data provocateur and ardent advocate of getting the basics right. Harri says: People and data first, technology will follow.

You can contact Harri here:

+358 50 306 9296

harri.juntunen@twinspark.fi

www.twinspark.fi

Approaches to Sharing Product Information in Business Ecosystems

17th October 201628th December 2016Henrik Gabs LiliendahlLeave a comment

One of the most promising aspects of digitalization is sharing information in business ecosystems. In the Master Data Management (MDM) realm, we will in my eyes see a dramatic increase in sharing product information between trading partners as touched in the post Data Quality 3.0 as a stepping-stone on the path to Industry 4.0.

Standardization (or standardisation)

A challenge in doing that is how we link the different ways of handling product information within each organization in business ecosystems. While everyone agrees that a common standard is the best answer we must on the other hand accept, that using a common standard for every kind of product and every piece of information needed is quite utopic. We haven’t even a common uniquely spelled term in English.

Also, we must foresee that one organization will mature in a different pace than another organisation in the same business ecosystem.

Product Data Lake

These observations are the reasons behind the launch of Product Data Lake. In Product Data Lake we encompass the use of (in prioritized order):

The same standard in the same version
The same standard in different versions
Different standards
No standards

In order to link the product information and the formats and structures at two trading partners, we support the following approaches:

Automation based on product information tagged with a standard as explained in the post Connecting Product Information.
Ambassadorship, which is a role taken by a product information professional, who collaborates with the upstream and downstream trading partner in linking the product information. Read more about becoming a Product Data Lake ambassador here.
Upstream responsibility. Here the upstream trading partner makes the linking in Product Data Lake.
Downstream responsibility. Here the downstream trading partner makes the linking in Product Data Lake.

Data Governance

Regardless of the mix of the above approaches, you will need a cross company data governance framework to control the standards used and the rules that applies to the exchange of product information with your trading partners. Product Data Lake have established a partnership with one of the most recommended authorities in data governance: Nicola Askham – the Data Governance Coach.

For a quick overview please have a look at the Cross Company Data Governance Framework.

Please request more information here.

Sharing Metadata

7th October 20168th October 2016Henrik Gabs LiliendahlLeave a comment

In short, metadata is data about data. Handling metadata is an important facet of data management including in data governance, data quality management and Master Data Management (MDM). When it comes to the new trends in data management as big data and handling data in data lakes, the importance of metadata management will in my eyes become even more obvious.

In a current venture (Product Data Lake) we are working on building in metadata management for business ecosystems, meaning that trading partners can share product information either using the same metadata or linking their different metadata.

Using international, national and industry standards for product information will be the perfect solution within business ecosystem sharing of metadata and indeed this is the preferred option we support. However, there are many competing standards for product information and they come in developing versions, so having everyone on the same page at the same time is quite utopic.

Add to that everyone do not speak English – and even not the same variant of English. Metadata originates and should exist in the languages that is used in trading partnerships.

In Product Data Lake we have started out with these principles:

Product attributes can be tagged with an attribute type telling about what standard (if any) in terms of product identification, product classification or product feature it adheres to. More about that in the post Connecting Product Information.
Attribute short and long descriptions can be represented in different languages.
Trading partners can link their product attributes and have visibility in the Product Data Lake of the standards and descriptions used in the different languages they exist.

I will very much welcome your input to this quest and if you want to be involved please do not hesitate to be in touch with me here or on Xing, Viadeo or LinkedIn.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph