Data Quality and Data Visualization

This is a self-centric blog post about data quality and data visualization.

The figure to the right is a statistic about who viewed my profile in a certain period on LinkedIn.

Looking at that makes me think about a couple of data quality and data visualization issues especially linked to visualization of data on a world map.

Hidden value

Fortunately there is both a map and some numbers below, because the map is too small to show from where I have the most views: My very small home country Denmark.

Misleading proportions

I have no views from the grey countries. So I should certainly concentrate on Greenland (the big grey land in the top of the map) to get more viewers, right?

Well, the Mercator projections make areas close to the poles like Greenland look much bigger than in the real world. Greenland is a big island, but in fact only less than 1/3 of Australia (the almost as big light blue land in the down under right corner) – and Greenland only has 1/400 of the population of Australia.

Cultural dependency

My blogging and LinkedIn activities are in English due to the moderate population of Denmark. Therefore, and because of the spread of LinkedIn biased in the English speaking world, it’s no surprise most viewers are from English speaking countries.

Bookmark and Share

Foreign Affairs

There is a famous poster called The New Yorker. This poster perfectly illustrates the centricity we often have about the town, region or country we live in.

The same phenomenon is often seen in data management.

I mentioned United States centricity as a minor criticism in my recent book review about the excellent book “Master Data Management and Data Governance”.  

An example from the book is this statement:

“It is important to differentiate between U.S. domestic addresses and international addresses. This distinction is important for U.S.-centric MDM solutions because U.S. domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

The same fact could be expressed by saying:

“It is important to differentiate between Danish domestic addresses and international addresses. This distinction is important for Danish-centric MDM solutions because Danish domestic addresses are normally better defined and therefore can be processed in a more automatic fashion, while international addresses require more manual intervention.”

Only, the better formatted address in the first case is the messy address in the last case, and the better formatted address in the last case is the messy address in the first case.

If your MDM scope is country-centric it is sensible to concentrate on automation related to that country.

If your MDM scope is international there are two options:

  • The easy way: The one size fits all option. This is a moderate investment, but also, it only yields moderate results in terms of automation and data quality.
  • The hard way: You have to implement specialized automation and investigate best external reference data for each country. I made a Danish-centric post on that last year here.

Bookmark and Share

Book Review: Berson and Dubov on MDM

A few days ago Julian Schwarzenbach over at the Data and Process Advantage Blog published a review of the book “Master Data Management and Data Governance” by Alex Berson and Larry Dubov. Link to Julian’s review here.

And hey, that’s the book I have been reading too during the last months. So why not make my review too.    

I agree very much with Julian’s positive review of the book. It is a very comprehensive book – and thick and heavy I have learned from bringing it with me on travel which is where I usually read offline stuff. But master data management and related data governance is a big and heavy discipline with a lot of details that has to be dealt with.

Probably I have annoyed fellow travellers in trains and airplanes while reading the book with exclamations as: Yes, precisely, that’s what I always have said, good point and so on. Because I agree very much with many of the issues described and the solutions discussed in the book.

For the mandatory bit of criticism that must be included in every book review I will bring on my pet bashing about United States and English language centricity. Well, it’s actually not that bad, as the book at many places does indicate that other angles and pains exist than those being prominent in the United States and with the English language.

Oh, and I bear with that  my surname in the references are spelled “Sorensen” instead of “Sørensen” and that a related date are formatted like “11/22/2009” which will be the 11th day in the 22nd month of the year 2009 to me.     

Bookmark and Share

No Privacy Customer Onboarding

This post is a follow up on today’s #DataKnightsJam happening on twitter. Today’s subject was data quality and data privacy.

Diversity in data quality is a subject discussed a lot of times on this blog.

So I want to share a real life example of a good upstream get it right first time data sharing approach that might compromise privacy thresholds in other places.

The image to the right is the data entry form from a Swedish webshop used for customer self-registration. The main flow is that:

  • You type your national ID (personnummer in Swedish)
  • You press the following button
  • The system fetches your name and address data from the public citizen hub
  • The webshop gets an accurate, complete single customer view  

The webshop www.jula.se sells tools for home improvement.

Bookmark and Share

What is Identity Resolution?

We are continuously struggling with defining what it is we are doing like defining: What is data quality? What is Master Data? Lately I’ve been involved in discussions around: What is Identity Resolution? A current discussion on this topic is rolling in the Data Matching LinkedIn group.

This discussion has roots in one of my blog posts called Entity Revolution vs Entity Evolution. Jeffrey Huth of IBM Initiate followed up with the post Entity Resolution & MDM: Interchangeable? In January Phillip Howard of Bloor made a post called There’s identity resolution and then there’s identity resolution (followed up by a correction post the other day called My bad).

It is a “same same but different” discussion. Traditional data matching (or record linkage) as seen in a data quality tool and master data management solution is the bright view: Being about finding duplicates and making a “single business partner view” (or “single party view” or “single customer view”). Identity resolution is the dark view: Preventing fraud and catching criminals, terrorists and other villains.

The Gartner Hype Cycle describes the dark view as ”Entity Resolution and Analysis”. This discipline is approaching the expectation peak and will, according to Gartner, be absorbed by other disciplines as no one can tell the difference I guess.

Certainly there are poles. In an article from 2006 called Identity Resolution and Data Integration David Loshin said: There is a big difference between trying to determine if the same person is being mailed two catalogs instead of one and determining if the individual boarding the plane is on the terrorist list.

But there is also a grey zone.

From a business perspective for example the prevention of misuse of a restricted campaign offer is a bit of both sides. Here you want to avoid that an existing customer is using an offer only meant for new customers. How does that apply to members of the same household or the same company family tree? Or you want to avoid someone using an introduction offer twice by typing her name and address a bit different.

From a technical perspective I have an example from working with a newspaper in a big fraud scam described in the post Big Time ROI in Identity Resolution. Here I had no trouble using a traditional deduplication tool in discovering non-obvious relationships. Also the relationships discovered in traditional data matching ends up quite nicely in hierarchy management as part of master data management as described in the post Fuzzy Hierarchy Management.

And then there is the use of the words identity (resolution) versus entity (resolution).

My feeling is that we could use identity resolution for describing all kind of matching and linking with party master data and entity resolution could be used for describing all kind of matching and linking with all master data entity types as seen in multi-domain master data management. But that’s just my words.

Bookmark and Share

Multi-Commerce Data Quality

A month ago I wrote about Multi-Channel Data Quality. Multi-Commerce and the related data quality is pretty much another term covering the same challenges which is that despite we today talk a lot about eCommerce, being doing business online, we still have a lot of business going on offline. So we have challenges with online data quality, offline data quality and not at least a single view of online/offline data quality.

According to the Gartner Hype Cycle there is such a thing as Multicommerce Master Data Management. This discipline has just passed the expectation peak but will, according to Gartner, be absorbed by Multidomain Master Data Management on the descent before climbing up again towards enlightenment and productivity.

As data quality and master data management are best friends I find it very likely that Multi-Commerce Data Quality will be all about Multi-Domain Master Data Management, including:

  • Having a single business partner view (that includes single customer view) encompassing all online and offline activities
  • Having a unified way of maintaining and exposing product data online and offline
  • Having the means for doing content management (that includes unstructured data) embracing online presentation as well as offline distribution.    

I also see Multi-Domain Master Data Management as not only doing master data management for several data domains at the same time (with the same software brand), but also exploring the intersections between the different domains.

If you for example look at a customer/product matrix you may add a third dimension being a channel where we examine the relations between a customer type, a product type/attribute and a given channel, thus having a 3D picture of doing business in a multi-commerce environment.

If you are interested in Multi-Domain Master Data Management including how Multi-Commerce Master Data Management and related data quality are developing right now, then please join the LinkedIn group for Multi-Domain MDM by clicking on the puzzle.

Bookmark and Share

Fuzzy Hierarchy Management

When evaluating results from automated data matching your goal is typically to find false positives and false negatives being entities that are matched, but shouldn’t be (false positives) and entities that are not matched, but should have been (false negatives).

However the fuzziness often used in the data matching process also apply to the evaluation of the results as many dubious results isn’t a question about if the matched database rows are reflecting the same real world entity but more a question about if the matched (or not matched) database rows are reflecting different members of a real world hierarchy.

Example 1:

John Smith on 1 Main Street in Anytown
Mary & John Smith on 1 Main Str in Anytown

Example 2:

Anytown Municipality, Technical Dept
Municipality of Anytown

Example 3:

Acme Corporation, Anytown
Acme Corporation, Anywhere

All three examples above may be considered a false positive if matched and a false negative if not matched.

You may say that it depends on the purpose of use, which is true.

But if we are talking master data management we may probably encompass multiple requirements where we simultaneously need the match and don’t want the match, which is why we need to be able to resolve and store the results from fuzzy data matching into hierarchies.

Bookmark and Share