Big Data Quality, Santa Style

Previous years close to Christmas posts on this blog has been about Multi-Domain MDM, Santa Style and Data Governance, Santa Style.

julemandenSo this year it may be the time to have a closer look at big data quality, Santa style, meaning how we can imagine Santa Claus is joining the raise of big data while observing that exploiting data, big or small, is only going to add real value if you believe in data quality. Ho ho ho.

At the Santa Claus organization they have figured out, that there is a close connection between excellence in working with big data and excellence in multi-domain Master Data Management (MDM) and data governance.

Here are some of the findings in the big data paper that the Chief Data Elf just signed off:

  • The feasibility of the new algorithms for naughty or nice marking using social media listening combined with our historical records is heavily dependent on unique, accurate and timely boys and girls master data. The party data governance elf gathering will be accountable for any nasty and noisy issues.
  • Implementation of the automated present buying service based on fuzzy matching between our supplier self-service based multi-lingual product catalogue and the wish list data lake must be done in a phased schedule. The product data governance elf committee are responsible for avoiding any false positives (wrong present incidents) and decreasing the number of false negatives (someone not getting what could be purchaed within the budget).
  • Last year we had and an 12.25 % overspend on reindeers due to incorrect and missing chimney positions. This year the reliance on crowdsourced positions will be better balanced with utilizing open government property data where possible. The location data governance elves will consult with the elves living on the roof at each head of state in order make them release more and better quality of any such data (the Gangnam Project).

Tear Down These Walls

Over at The Data Roundtable there is some good thinking going on. Recently Dylan Jones blogged: Want to improve data quality? Start by re-imagining your data boundaries.

In his blog post Dylan explains how data journeys are costly and risky. There are huge opportunities, not at least for data quality, in simplifying the sharing of data by breaking down the data boundaries.

Berlinermauer
The Berlin Wall. Fortunately it is not there anymore.

Data boundaries exists within organisations and between organisations. As the way of doing business today involves businesses working together, we see more and more data being sent between businesses. Unfortunately often using spreadsheets as told in post Excellence vs Excel.

We definitely need better ways to share data within organisations and between organisations. Furthermore, as Dylan points out, the data exchange needs to go in both directions. The ability to share data in an intelligent way is based on that data is identified and described by commonly shared reference and master data.

In my experience, the ability to collaborate between businesses by sharing reference and master data, and utilize available public sources, will be crucial in the quest for re-imagining data boundaries. This is indeed the future of data quality and The Future of Master Data Management.

Excellence vs Excel

We all use Excel though we know it is bad. It is a user friendly and powerful tool, but there are plenty of stories out there where Excel has caused so much trouble like this one from Computerworld in 2008 when the credit crunch struck.

I guess all people who works in data management curses Excel. Data kept in Excel is a pain  – you know where – as it is hard to share, you never know if you have the latest version, nice informative colouring disappears when transforming, narrow columns turns into rubbish, different formatting usually makes it practically impossible to combine two sheets and heaps of other not so nice behaviours.

Even so, Excel is still the most used tool for many crucial data management purposes as for example reported in the post The True Leader in Product MDM.

Excel is still a very frequent used option when it comes to exchanging data as touched by Monica McDonnell of Informatica in a recent blog post on Four Technology Approaches for IDMP Data Management.

Probably, the use of Excel as a mean to exchange data between organizations is the field where it will be most difficult to eliminate the dangerous use of Excel. The problem is that the alternative usually is far too rigid. The task of achieving consensus between many organizations on naming, formatting and all the other tedious stuff makes us turn to Excel.

Excellence vs Excel

When working with data quality within data management we may wrongly strive for perfection. We should rather strive for excellence, which is something better than the ordinary. In this case Excel is the ordinary. As Harriet Braiker said: “Striving for excellence motivates you; striving for perfection is demoralizing.”

In order to be excellent, though not perfect, in data sharing, we must develop solutions that are better than Excel without being too rigid. Right now, I am working on a solution for sharing product data being of that kind. The service is called the Product Data Lake.

The Future of Master Data Management

Back in 2011 Gartner, the analyst firm, predicted that these three things would shape the Master Data Management (MDM) market:

  • Multi-Domain MDM
  • MDM in the Cloud
  • MDM and Social Networks

The third point was in 2012, after the raise of big data, rephrased to MDM and Big Data as reported in the post called The Big MDM Trend.

In my experience all these three themes are still valid with slowly but steadily uptake.

open-doorBut, have any new trends showed up in the past years?

In a 2015 post called “Master Data Management Merger Tardis and The Future of MDM” Ramon Chen of Reltio puts forward some new possibilities to be discussed, among those Machine Learning & Cognitive computing. I agree with Ramon on this theme, though these have been topics around in general for decades without really breaking through. But we need more of this in MDM for sure.

My own favourite MDM trend is a shift from focussing on internally captured master data to collaboration with external business partners as explained in the post MDM 3.0 Musings.

In that quest, I am looking forward to my next speaking session, which will be in Helsinki, Finland on the 8th December. There is an interview on that with yours truly available on the Talentum Master Data Management 2015 site.

It is Magic Quadrant Week

Earlier this week this blog featured the Magic Quadrant for Customer MDM and the Magic Quadrant for Product MDM. Today it is time to have a look at the just published Magic Quadrant for Data Quality Tools.

Last year I wondered if we finally will see that data quality tools will focus on other pain points than duplicates in party data and postal address precision as discussed in the post The Multi-Domain Data Quality Tool Magic Quadrant 2014 is out.

Well, apparently there still isn’t a market for that as the Gartner report states: “Party data (that is, data about existing customers, prospective customers, citizens or patients) remains the top priority for most organizations: Almost nine in 10 (89%) of the reference customers surveyed for this Magic Quadrant consider it a priority, up from 86% in the previous year’s survey.”

Multi-Domain MDM and Data Quality DimensionsFrom own experience in working predominantly with product master data during the last couple of years there are issues and big pain points with product data. They are just different from the main pain points with party master data as examined in the post Multi-Domain MDM and Data Quality Dimensions.

I sincerely believes that there are opportunities in providing services to solve the specific data quality challenges for product master data, that, according to Gartner, “is one of the most important information assets an organization has; second-only, perhaps, to customer master data”. In all humbleness, my own venture is called the Product Data Lake.

Anyway, as ever, Informatica is our friend when it comes to free copies of a data management quadrant. Get a free copy of the 2015 Magic Quadrant for Data Quality Tools here.

The Perhaps Second Most Important MDM Quadrant 2015 is Out

This year the Gartner Magic Quadrant for Master Data Management of Product Data Solutions is published very shortly after the Gartner Magic Quadrant for Master Data Management of Customer Data Solutions. Now 1 day in between. I hope this is a sign of that the two MDM quadrants eventually will melt into a (Multi-Domain) MDM Quadrant as touched yesterday in my post about the Customer MDM Quadrant.

MDM Brands
This is not the quadrant, just some vendor names

The product MDM quadrant states: “Product master data is one of the most important information assets an organization has; second-only, perhaps, to customer master data”. In my humble opinion, I think you can refine that statement. It depends on the number of customers (or other party roles) versus the number of products you deal with. Highest number names the most important domain to start with in your organization.

As usual Informatica seems to be the fastest MDM vendor measured on providing a free copy of the Gartner quadrants. Find the 2015 Product MDM Quadrant here from Informatica.

Two Ways of Exploiting Big Data with MDM

MDM Wordle
This is not the quadrant, just some vendor names

The Gartner 2015 Magic Quadrant for Master Data Management of Customer Data Solutions is out. One way of getting the report without being a Gartner customer is through this link on the Informatica site.

Successful providers of Mater Data Management (MDM) solutions will sooner or later need to offer ways of connecting MDM with big data.

In the Customer MDM quadrant Gartner, without mentioning if this relates to customer MDM only or multi-Domain MDM in general, mentions two ways of connecting MDM with big data:

  • Capabilities to perform MDM functions directly against copies of big data sources such as social network data copied into a Hadoop environment. Gartner have found that there have been very few successful attempts (from a business value perspective) to implement this use case, mostly as a result of an inability to perform governance on the big datasets in question.
  • Capabilities to link traditionally structured master data against those sources. Gartner have found that this use case is also sparse, but more common and more readily able to prove value. This use case is also gaining some traction with other types of unstructured data, such as content, audio and video.

My take is that these ways applies to the other MDM domains (supplier, product, location, asset …) as well – just as I think Gartner sooner or later will need to make only one MDM quadrant as pondered in the post called The second part of the Multi-Domain MDM Magic Quadrant is out.

Also I think the ability to perform governance on big datasets is key. In fact, in my eyes master data will tend to be more externally generated and maintained, just like big data usually is. This will change our ways of doing information governance as discussed in my previous post on this blog. This post was by the way inspired by the Gartner product MDM person. The post is called MDM and SCM: Inside and outside the corporate walls.

MDM and SCM: Inside and outside the corporate walls

QuadrantIn my journey through the Master Data Management (MDM) landscape, I am currently working from a Supply Chain Management (SCM) perspective. SCM is very exciting as it connects the buy-side and the sell-side of a company. In that connection we will be able to understand some basic features of multi-domain MDM as touched in a recent post about the MDM ancestors called Customer Data Integration (CDI) and Product Information Management (PIM). The post is called CDI, PIM, MDM and Beyond.

MDM and SCM 1.0: Inside the corporate walls

Traditional Supply Chain Management deals with what goes on from when a product is received from a supplier, or vendor if you like, to it ends up at the customer.

In the distribution and retail world, the product physically usually stays the same, but from a data management perspective we struggle with having buying views and selling views on the data.

In the manufacturing world, we sees the products we are going to sell transforming from raw materials over semi-finished products to finished goods. One challenge here is when companies grow through acquisitions, then a given real world product might be seen as a raw material in one plant but a finished good in another plant.

Regardless of the position of our company in the ecosystem, we also have to deal with the buy side of products as machinery, spare parts, supplies and other goods, which stays in the company.

MDM and SCM 2.0: Outside the corporate walls

SCM 2.0 is often used to describe handling the extended supply chain that is a reality for many businesses today due to business process outsourcing and other ways of collaboration within ecosystems of manufacturers, distributors, retailers, end users and service providers.

From a master data management perspective the ways of handling supplier/vendor master data and customer master data here melts into handling business-partner master data or simply party master data.

For product master data there are huge opportunities in sharing most of these master data within the ecosystems. Usually you will do that in the cloud.

In such environments, we have to rethink our approach to data / information governance. This challenge was, with set out in cloud computing, examined by Andrew White of Gartner (the analyst firm) in a blog post called “Thoughts on The Gathering Storm: Information Governance in the Cloud”.

The World of Reference Data

Google EarthReference Data Management (RDM) is an evolving discipline within data management. When organizations mature in the reference data management realm we often see a shift from relying on internally defined reference data to relying on externally defined reference data. This is based on the good old saying of not to reinvent the wheel and also that externally defined reference data usually are better in fulfilling multiple purposes of use, where internally defined reference data tend to only cater for the most important purpose of use within your organization.

Then, what standard to use tend to be a matter of where in the world you are. Let’s look at three examples from the location domain, the party domain and the product domain.

Location reference data

If you read articles in English about reference data and ensuring accuracy and other data quality dimensions for location data you often meet remarks as “be sure to check validity against US Postal Services” or “make sure to check against the Royal Mail PAF File”. This is all great if all your addresses are in the United States or the United Kingdom. If all your addresses are in another country, there will in many cases be similar services for the given country. If your address are spread around the world, you have to look further.

There are some Data-as-a-Service offerings for international addresses out there. When it comes to have your own copy of location reference data the Universal Postal Union has an offering called the Universal POST*CODE® DataBase. You may also look into open data solutions as GeoNames.

Party reference data

Within party master data management for Business-to-Business (B2B) activities you want to classify your customers, prospects, suppliers and other business partners according to what they do, For that there are some frequently used coding systems in areas where I have been:

  • Standard Industrial Classification (SIC) codes, the four-digit numerical codes assigned by the U.S. government to business establishments.
  • The North American Industry Classification System (NAICS).
  • NACE (Nomenclature of Economic Activities), the European statistical classification of economic activities.

As important economic activities change over time, these systems change to reflect the real world. As an example, my Danish company registration has changed NACE code three times since 1998 while I have been doing the same thing.

This doesn’t make conversion services between these systems more easy.

Product reference data

There are also a good choice of standardized and standardised classification systems for product data out there. To name a few:

  • TheUnited Nations Standard Products and Services Code® (UNSPSC®), managed by GS1 US™ for the UN Development Programme (UNDP).
  • eCl@ss, who presents themselves as: “THE cross-industry product data standard for classification and clear description of products and services that has established itself as the only ISO/IEC compliant industry standard nationally and internationally”. eCl@ss has its main support in Germany (the home of the Mercedes E-Class).

In addition to cross-industry standards there are heaps of industry specific international, regional and national standards for product classification.

Bookmark and Share

Image Coming Soon

End customer self-service has grown dramatically during the last decades due to the increasing adoption of ecommerce. When customers shop online they need a lot of information about the product they intent to buy. One of the pieces of information they need is an image of the product. The image helps customers to understand if it is the intended product they are going to buy and helps with quickly differentiating among a range of products.

Unfortunately the most common image around on web shops is the “image coming soon”.

Image coming soon

Completeness is a huge problem in Product Information Management (PIM) as examined in my previous post called Multi-Domain MDM and Data Quality Dimensions. A missing product image is a classic completeness issue for product master data.

As a web shop you can collect a product image in several ways, namely:

  • Take the image yourself
  • Get it from the manufacturer

The former approach is cumbersome and usually only used for selected products for a special purpose of use. The latter one is far the most common. When you deal with many products and constant new on-boarding of products, you want to have a uniform and automated approach to collect images along with all the other product information needed for the specific product category.

A clumsy variant of the latter is scraping it from your manufacturer’s website or even your competitor’s website. Or having someone far away doing that for you.

The better way is to start sharing product data and digital assets, including product images, within the ecosystems of manufacturers, distributors, retailers and end users. Stay tuned. A service for that is coming soon 🙂

Bookmark and Share