The Good, the Better and the Best Kinds of Data Quality Technology

If I look at my journey in data quality I think you can say, that I started with working with the good way of implementing data quality tools, then turned to some better ways and, until now at least, is working with the best way of implementing data quality technology.

It is though not that the good old kind of tools are obsolete. They are just relieved from some of the repeating of the hard work in cleaning up dirty data.

The good (old) kind of tools are data cleansing and data matching tools. These tools are good at finding errors in postal addresses, duplicate party records and other nasty stuff in master data. The bad thing about finding the flaws long time after the bad master data has entered the databases, is that it often is very hard to do the corrections after transactions has been related to these master data and that, if you do not fix the root cause, you will have to do this periodically. However, there still are reasons to use these tools as reported in the post Top 5 Reasons for Downstream Cleansing.

The better way is real time validation and correction at data entry where possible. Here a single data element or a range of data elements are checked when entered. For example the address may be checked against reference data, phone number may be checked for adequate format for the country in question or product master data is checked for the right format and against a value list. The hard thing with this is to do it at all entry points. A possible approach to do it is discussed in the post Service Oriented MDM.

The best tools are emphasizing at assisting data capture and thus preventing data quality issues while also making the data capture process more effective by connecting opposite to collecting. Two such tools I have worked with are:

·        IDQ™ which is a tool for mashing up internal party master data and 3rd party big reference data sources as explained further in the post instant Single Customer View.

·        Product Data Lake, a cloud service for sharing product data in the business ecosystems of manufacturers, distributors, merchants and end users of product information. This service is described in detail here.

DQ

Product Information Sharing Issue No 2: No Viable Standard

A current poll on sharing product information with trading partners running on this blog has this question: As a manufacturer: What is Your Toughest Product Information Sharing Issue?

Some votes in the current standing has gone to this answer:

There is no viable industry standard for our kind of products

Indeed, having a standard that all your trading partners use too, will be Utopia.

This is however not the situation for most participants in supply chains. There are many standards out there, but each applicable for a certain group of products, geography or purpose as explained in the post Five Product Classification Standards.

At Product Data Lake we embrace all these standards. If you use the same standard in the same version as your trading partner, linking and transformation is easy. If you do not, you can use Product Data Lake to link and transform from your way to the way your trading partners handles product information. Learn more at Product Data Lake Documentation and Data Governance.

Attribute Types
The tagging scheme used in Product Data Lake attributes (metadata)

The Panama Papers and MDM

A keynote at the Master Data Management Summit Europe 2017 was presented by Mar Cabra, Editor, Data & Research Unit, International Consortium of Investigative Journalists (ICIJ). In her presentation Mar told how journalists explored huge amounts of data in order to reveal how rich and famous people used tax havens to avoid paying tax in their home countries.

Some of the technologies used in doing that are the same emerging technologies we right now are considering, and some are already using, within Master Data Management as for example graph databases.

Panama PapersWhat stroke me the most was however the sharing approach that probably made all the difference in the impact achieved from revealing the core data in the Panama Papers. The journalists from Süddeutsche Zeitung who originality was offered the data did not keep the data to themselves but shared the data with the community of journalists in the news media ecosystem from around the world.

We can also make better use of master data, and gain better business impact, if we share feasible master data in wider business ecosystems as explained here on the piece on Master Data Share.

I’m not saying all enterprises should share everything with their business ecosystems. For party master data there are indeed many limitations as very much trending these days with the upcoming GDPR regulations. With product master data you should also consider what is best managed as your own story and what is best for your business in a sharing approach. These considerations are elaborated in the post Using Internal and External Product Information to Win.

Big Data Fitness

A man with one watch knows what time it is, but a man with two watches is never quite sure. This old saying could be modernized to, that a person with one smart device knows the truth, but a person with two smart devices is never quite sure.

An example from my own life is measuring my daily steps in order to motivate me to be more fit. Currently I have two data streams coming in. One is managed by the app Google Fit and one is managed by the app S Health (from Samsung).

This morning a same time shot looked like this:

Google Fit:

google-fit

S Health:

s-health

So, how many steps did I take this morning? 2,047 or 2413?

The steps are presented on the same device. A smartphone. They are though measured on two different devices. Google Fit data are measured on the smartphone itself while S Health data are measured on a connected smartwatch. Therefore, I might not be wearing these devices in the exact same way. For example, I am the kind of Luddite that do not bring the phone to the loo.

With the rise of the Internet of Things (IoT) and the expected intensive use of the big data streams coming from all kinds of smart devices, we will face heaps of similar cases, where we have two or more sets of data telling the same story in a different way.

A key to utilize these data in the best fit way is to understand from what and where these data comes. Knowing that is achieved through modern Master Data Management (MDM).

At Product Data Lake we in all humbleness are supporting that by sharing data about the product models for smart devices and in the future by sharing data about each device as told in the post Adding Things to Product Data Lake.

1st Party, 2nd Party and 3rd Party Master Data

Until now, much of the methodology and technology in the Master Data Management (MDM) world has been about how to optimize the use of what can be called first party master data. This is master data already collected within your organization and the approaches to MDM and the MDM solutions offered has revolved around federating internal silos and obtain a single source of truth within the corporate walls.

Besides that third-party data has been around for many years as described in the post Third-Party Data and MDM. Use of third party data in MDM has mainly been about enriching customer and supplier master data from business directories and in some degree utilizing standardized pools of product data in various solutions.

open doorUsing third party data for customer and supplier master data seems to be a very good idea as exemplified in the post Using a Business Entity Identifier from Day One. This is because customer and supplier master looks pretty much the same to every organization. With product master data this is not case and that is why third party sources for product master data may not be fully effective.

Second party data is data you get directly from the external source. With customer and supplier master data we see that approach in self-registration services. My recommendation is to combine self-registration and third party data in customer and supplier on-boarding processes. With product master data I think leaning mostly to second party connections in business ecosystems seems like the best way forward. There is more on that in a discussion on the LinkedIn  MDM – Master Data Management Group.

Bookmark and Share

Using a Business Entity Identifier from Day One

One of the ways to ensure data quality for customer – or rather party – master data when operating in a business-to-business (B2B) environment, is to on-board new entries using an external defined business entity identifier.

By doing that, you tackle some of the most challenging data quality dimensions as:

  • Uniqueness, by checking if a business with that identifier already exist in your internal master data. This approach is superior to using data matching as explained in the post The Good, Better and Best Way of Avoiding Duplicates.
  • Accuracy, by having names, addresses and other information defaulted from a business directory and thus avoiding those spelling mistakes that usually are all over in party master data.
  • Conformity, by inheriting additional data as line-of-business codes and descriptions from a business directory.

Having an external business identifier stored with your party master data helps a lot with maintaining data quality as pondered in the post Ongoing Data Maintenance.

Busienss Entity IdentifiersWhen selecting an identifier there are different options as national IDs, LEI, DUNS Number and others as explained in the post Business Entity Identifiers.

At the Product Data Lake service I am working on right now, we have decided to use an external business identifier from day one. I know this may be something a typical start-up will consider much later if and when the party master data population has grown. But, besides being optimistic about our service, I think it will be a win not to have to fight data quality issues later with guarantied increased costs.

For the identifier to use we have chosen the DUNS Number from Dun & Bradstreet. The reason is that this currently is the only worldwide covered business identifier. Also, Dun & Bradstreet offers some additional data that fits our business model. This includes consistent line-of-business information and worldwide company family trees.

Bookmark and Share

Starting up at the age of 56

It is never too late to start up, I have heard. So despite I usually brag about having +35 years of experience in the intersection of business and IT and a huge been done list in Data Quality and Master Data Management (MDM) which can get me nice consultancy engagements, a certain need on the market has been puzzling in my head for some time.

Before that, when someone asked me what to do in the MDM space I told them to create something around sharing master data between organisations. Most MDM solutions are sold to a given organization to cover the internal processes there. There are not many solutions out there that covers what is going on between organizations.

But why not do that myself? – with the help of some younger people.

FirstLogoSaveYou may have noticed, that I during the last year have been writing about something called the Product Data Lake. This has until recently mostly just been a business concept that could be presented on power point slides. So called slideware. But now it is becoming real software being deployed in the cloud.

Right now a gifted team in Vietnam, where I also am this week, is building the solution. We aim to have it ready for the first trial subscribers in August 2016. We will also be exhibiting the solution in London in late September, where we will be at the Start-up Alley in the combined Customer Contact, eCommerce and Technology for Marketing exhibition.

At home in Denmark, some young people are working on our solution too as well as the related launching activities and social media upbeat. This includes a LinkedIn company page. For continuous stories about our start-up, please follow the Product Data Lake page on LinkedIn here.

Bookmark and Share

Did You Mean Potato or Potahto?

As told in the post Where the Streets have Two Names one aspect of address validation is the fact, that in some parts of the world, a given postal address can be presented in more than one language.

I experienced that today when using Google Maps for directions to a Master Data Management (MDM) conference in Helsinki, Finland. When typing in the address I got this message:

Helsinki

The case is that the two addresses proposed by Google Maps are exactly the same address, just spelled in Swedish and Finnish, the two official languages used in this region.

I think Google Maps is an example of a splendid world-wide service. But even the best world-wide services sometimes don’t match local tailored services. This is in my experience the case when it comes to address management solutions as address validation and assistance whether they come as an integrated part of a Master Data Management (MDM) solution, a stand-alone data quality tool or a general service as Google Maps.

Using a Data Lake for Reference Data

TechTarget has recently published a definition of the term data lake.

In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.

If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.

Big reference dataSharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share

Is big data all about analytics?

My answer to the question in the title of this blog post is NO. In my eyes big data is not just data warehouse 3.0. It is also data quality 3.0.

The concept of the data lake is growing in popularity in the big data world and so are the counts of warnings about your data lake becoming a data swamp, a data marsh or a data cesspool. Doing analytic work on a nice data lake sounds great. Doing it in a huge swamp, a large marsh or a giant cesspool does not sound so nice.

Figure 1In nature a lake stays fresh by having good upstream supply of water and a downstream system as well. In kind of the same way your data lake should not be a closed system or a dump within your organization.

Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share