The World of Reference Data

Google EarthReference Data Management (RDM) is an evolving discipline within data management. When organizations mature in the reference data management realm we often see a shift from relying on internally defined reference data to relying on externally defined reference data. This is based on the good old saying of not to reinvent the wheel and also that externally defined reference data usually are better in fulfilling multiple purposes of use, where internally defined reference data tend to only cater for the most important purpose of use within your organization.

Then, what standard to use tend to be a matter of where in the world you are. Let’s look at three examples from the location domain, the party domain and the product domain.

Location reference data

If you read articles in English about reference data and ensuring accuracy and other data quality dimensions for location data you often meet remarks as “be sure to check validity against US Postal Services” or “make sure to check against the Royal Mail PAF File”. This is all great if all your addresses are in the United States or the United Kingdom. If all your addresses are in another country, there will in many cases be similar services for the given country. If your address are spread around the world, you have to look further.

There are some Data-as-a-Service offerings for international addresses out there. When it comes to have your own copy of location reference data the Universal Postal Union has an offering called the Universal POST*CODE® DataBase. You may also look into open data solutions as GeoNames.

Party reference data

Within party master data management for Business-to-Business (B2B) activities you want to classify your customers, prospects, suppliers and other business partners according to what they do, For that there are some frequently used coding systems in areas where I have been:

  • Standard Industrial Classification (SIC) codes, the four-digit numerical codes assigned by the U.S. government to business establishments.
  • The North American Industry Classification System (NAICS).
  • NACE (Nomenclature of Economic Activities), the European statistical classification of economic activities.

As important economic activities change over time, these systems change to reflect the real world. As an example, my Danish company registration has changed NACE code three times since 1998 while I have been doing the same thing.

This doesn’t make conversion services between these systems more easy.

Product reference data

There are also a good choice of standardized and standardised classification systems for product data out there. To name a few:

  • TheUnited Nations Standard Products and Services Code® (UNSPSC®), managed by GS1 US™ for the UN Development Programme (UNDP).
  • eCl@ss, who presents themselves as: “THE cross-industry product data standard for classification and clear description of products and services that has established itself as the only ISO/IEC compliant industry standard nationally and internationally”. eCl@ss has its main support in Germany (the home of the Mercedes E-Class).

In addition to cross-industry standards there are heaps of industry specific international, regional and national standards for product classification.

Bookmark and Share

Using a Data Lake for Reference Data

TechTarget has recently published a definition of the term data lake.

In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.

If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.

Big reference dataSharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share

Data Quality: The Union of First Time Right and Data Cleansing

The other day Joy Medved aka @ParaDataGeek made this tweet:

https://twitter.com/ParaDataGeek

Indeed, upstream prevention of bad data to enter our databases is sure the better way compared to downstream data cleaning. Also real time enrichment is better than enriching long time after data has been put to work.

That said, there are situations where data cleaning has to be done. These reasons were examined in the post Top 5 Reasons for Downstream Cleansing. But I can’t think of many situations, where a downstream cleaning and/or enrichment operation will be of much worth if it isn’t followed up by an approach to getting it first time right in the future.

If we go a level deeper into data quality challenges, there will be some different data quality dimensions with different importance to various data domains as explored in the post Multi-Domain MDM and Data Quality Dimensions.

With customer master data we most often have issues with uniqueness and location precision. While I have spend many happy years with data cleansing, data enrichment and data matching tools, I have during the last couple of years been focusing on a tool for getting that first time right.

Product master data are often marred by issues with completeness and (location) conformity. The situation here is that tools and platforms for mastering product data are focussed on what goes on inside a given organization and not so much about what goes on between trading partners. Standardization seems to be the only hope. But that path is too long to wait for and may in some way be contradicting the end purpose as discussed under the post Image Coming Soon.

So in order to have a first time right solution for product master data sharing, I have embarked on a journey with a service called the Product Data Lake. If you want to join, you are most welcome.

PS: The product data lake also has the capability of catching up with the sins of the past.

Bookmark and Share

Making a Firmographic Analysis

What demographics are to people, firmographics are to organizations.

I am currently working with starting up a Business-to-Business (B2B) service. In order to assess the market I had to know something about how many companies there are out there who possibly could be in need of such a service.

The service will work word-wide, but adhering to the sayings about thinking globally/big and starting locally/small I have started with assessing the Danish market. Also there are easy and none expensive access to business directories for Denmark.

My first filter was selecting companies with at least 50 employees.

As the service is suitable for companies within ecosystems of manufacturers, distributors and retailers, I selected the equivalent range of industry codes. In this case it was NACE codes which resembles SIC codes and other classifications of Line-Of-Business used in other geographies.

There were circa 2,500 companies in my selection. However, some belong to the same company family tree. By doing a merge/purge with the largest company in a company family tree as the survivor, the list was down to circa 2,000 companies.

For this particular service, there are some other possibly competing approaches that are stronger for some kinds of goods than other kinds of goods. For that purpose, I made a bespoke categorization being:

  • Priority A: Building materials, furniture, houseware, machinery and vehicles.
  • Priority B: Electronics, books and clothes.
  • Priority C: Pharmaceuticals, food, beverage and tobacco.

Retailers that span several priorities were placed in priority B. Else, for this high level analysis, I only used the primary Line-Of-Business.

The result was as shown below:

Firmographic

So, from my firmographic analysis I know the rough size of the target market in one locality. I can assume, that other markets look more or less the same or I can do specific firmographics on other geographies. Also, I can apply first results of dialogues with entities in the breakdown model and see if the model needs a modification.

Bookmark and Share

Image Coming Soon

End customer self-service has grown dramatically during the last decades due to the increasing adoption of ecommerce. When customers shop online they need a lot of information about the product they intent to buy. One of the pieces of information they need is an image of the product. The image helps customers to understand if it is the intended product they are going to buy and helps with quickly differentiating among a range of products.

Unfortunately the most common image around on web shops is the “image coming soon”.

Image coming soon

Completeness is a huge problem in Product Information Management (PIM) as examined in my previous post called Multi-Domain MDM and Data Quality Dimensions. A missing product image is a classic completeness issue for product master data.

As a web shop you can collect a product image in several ways, namely:

  • Take the image yourself
  • Get it from the manufacturer

The former approach is cumbersome and usually only used for selected products for a special purpose of use. The latter one is far the most common. When you deal with many products and constant new on-boarding of products, you want to have a uniform and automated approach to collect images along with all the other product information needed for the specific product category.

A clumsy variant of the latter is scraping it from your manufacturer’s website or even your competitor’s website. Or having someone far away doing that for you.

The better way is to start sharing product data and digital assets, including product images, within the ecosystems of manufacturers, distributors, retailers and end users. Stay tuned. A service for that is coming soon :-)

Bookmark and Share

Multi-Domain MDM and Data Quality Dimensions

The most frequently mentioned domains within Master Data Management (MDM) are customer, product and location. Data quality is a core discipline when working with MDM. In data quality we talk about different dimensions as uniqueness, relevance, completeness, timeliness, precision, conformity and consistency.

While these data quality dimensions apply to all domains of MDM, some different dimensions apply a bit more to one of the domains or the intersections of the domains.

Below is a figure with an attempt to illustrate where the dimensions belong the most:

Multi-Domain MDM and Data Quality Dimensions

Uniqueness is the most addressed data quality dimension when it comes to customer master data. Customer master data are often marred by duplicates, meaning two or more database rows describing the same real world entity. There are several remedies around to cure that pain. These remedies are explored in the post The Good, Better and Best Way of Avoiding Duplicates.

With product master data, uniqueness is a less frequent issue. However, completeness is often a big pain. One reason is that completeness means different requirements for different categories of products as explained in the post Hierarchical Completeness within Product Information Management.

When working with location master data consistency can be a challenge. Addressing, so to speak, the different postal address formats around the world is certainly not a walkover. Even google maps does not have all the right answers as told in the post Sometimes Big Brother is Confused.

In the intersection between the location domain and the customer domain the data quality dimension called precision can be hard to manage as reported in the post A Universal Challenge. What is relevant to know about your customers and what is relevant to tell about your products are essential questions in the intersection of the customer and product master data domains.

Conformity of product data is related to locations. Take unit measurement. In the United States the length of a small thing will be in inches. In most of the rest of the world it will be in centimetres. In the UK you can never know.

Timeliness is the everlasting data quality dimension all over.

Bookmark and Share

Chinese Whispers and Data Quality

There is a game called Chinese Whispers or Broken Telephone or some other names. In that game, one person whispers a message to another person. The message is passed through a line of people until the last player announces the message to the entire group. At that point the message is often quite different or very shortened. The reasons for that is human unreliability including how we put our own perceptions and filters into a message.

When working with data quality you often see the same phenomenon when data is passed through a chain. One area I have observed in recent years is within Product Information Management (PIM). Here the chain is not just the data chain within a given company but the whole data chain in ecosystems of manufacturers, distributors, retailers and end users.

While Product Information Management (PIM) solutions and Product Master Data Management (Product MDM) solutions – if there is a difference – address the issues within a given company, we haven’t seen adequate solutions for solving the problem in the exchange zones between trading partners.

Broken data supply chain

From what I have seen the solutions that upstream providers of product data work with and the solutions that downstream receivers of product data work with will not go well together.

Consequently, I am right now working with a solution to end Chinese whispers in product data supply chains. Check out the Product Data Lake.

Bookmark and Share