Typos in the Cloud

By 1st January this year the next largest city in Denmark changed its name. It was only a minor change from “Århus” to “Aarhus” – replacing the Scandinavian letter Å with a double A, which is the normal conversion to the English alphabet.

Data quality would be a lot easier if people, companies and cities stopped changing names. It always goes wrong. First of all a lot of data will be out-of-sync. And then the change may go wrong.

That is what happened at Google Maps. They introduced a typo so the name of the city on the map now is “Aahrus” – swapping the r and the h in the middle of the name.    

For those out there not sure where on earth Århus/Aarhus/Aahrus is, it is the red dot in the upper right corner, where you have London and Paris in the lower left corner on the map below. You may click on map to enlarge.

Bookmark and Share

Boiling Data Silos

Yesterday there where some blog posts dealing with data silos.

Graham Rhind posted: Data silos – learn to live with them.

Rob Karel posted: Stop trying to put a monetary value on data – it’s the wrong path. Though not being the main subject there was a remark saying: “Attempting to boil the ocean and trying to solve Customer, Product, or Financial data for all processes and decisions across the whole organization is too big an effort destined to fail before it starts”.  

Mark Montgomery made a comment on Rob’s post saying: “I also have trouble with the boil the ocean metaphor, which is used too often these days to justify all kinds of protectionist policies in the enterprise. You can’t have it both ways in the enterprise– either you have data silos or you don’t, and I argue that increasingly the world cannot afford them, albeit in highly secure formats in most situations”.

I guess we have to go for the golden mean on this one also. We shouldn’t accept data silos but we must expect them. We could go for eliminating them probably not in one big bang but slice by slice as we climb up the levels in an information maturity model.

I would definitely expect to see fewer and smaller data silos at the top level of an information maturity model than on a bottom level of a data quality immaturity model.

Bookmark and Share

Holistic Accuracy

In community economics you have two terms called

  • Partitive accuracy and
  • Holistic accuracy

In short, partitive accuracy is the accuracy of a single measure being part of a model while holistic accuracy is the accuracy of the model structure and its use. More information here.

I find these terms being very useful in data quality and master data management as well.

The distinction between partitive accuracy and holistic accuracy resembles the distinction between data quality and information quality.

One problem with the term information quality is that it implies a certain context of use, which makes it hard to prepare data for having high data quality for multiple uses other than assuring the accuracy of the single data elements – being similar to the term partitive accuracy.

One clue for assuring better information quality is looking at the model structure of data – being similar to the term holistic accuracy. Here I am thinking beyond traditional data modeling, which is anchored in the technical world, and into how end users of master data hubs are able to build structures of data (with partitive accuracy) that fits the daily business use.

Examples of such holistic information capabilities in master data management will be building flexible product hierarchies and hierarchies of party master data that at the same time reflects hierarchies in the real world as households and company family trees and hierarchies of related accounts and addresses used within the enterprise.

While a single data element as an address component like a postal code may be partitive accurate, the holistic accuracy is seen as how data elements contribute to a holistic accuracy as a part of a data structure that fits multiple purposes of use.

Bookmark and Share

Happy Uniqueness

When making the baseline for customer data in a new master data management hub you often involve heavy data matching in order to de-duplicate the current stock of customer master data, so you so to speak start with a cleansed duplicate free set of data.

I have been involved in such a process many times, and the result has never been free of duplicates. For two reasons:

  • Even with the best data matching tool and the best external reference data available you obviously can’t settle all real world alignments with the confidence needed and manual verification is costly and slowly.
  • In order to make data fit for the business purposes duplicates are required for a lot of good reasons.

Being able to store the full story from the result of the data matching efforts is what makes me, and the database, most happy.

The notion of a “golden record” is often not in fact a single record but a hierarchical structure that reflects both the real world entity as far as we can get and the instances of this real world entity in a form that are suitable for different business processes.

Some of the tricky constructions that exist in the real world and are usual suspects for multiple instances of the same real world entity are described in the blog posts:

The reasons for having business rules leading to multiple versions of the truth are discussed in the posts:

I’m looking forward to yet a party master data hub migration next week under the above conditions.

Bookmark and Share

Hierarchical Completeness

A common technique used when assessing data quality is data profiling. For example you may count different measures as number of fields in a table that have null values or blank values, distribution of filled length of a certain field, average values, highest values, lowest values and so on.

If we look at the most prominent entity types in master data management being customers and products you may certainly also profile your customer tables and product tables and indeed many data profiling tutorials use these common sort of tables as examples.

However, in real life profiling an entire customer table or product table will often be quite meaningless. You need to dig into the hierarchies in these data domains to get meaningful measures for your data quality assessment.

Customer master data

In profiling customer master data you must consider the different types of party master data as business entities, department entities, consumer entities and contact entities, as the demands for completeness will be different for each type. If your raw data don’t have a solid categorization in place, a prerequisite for data profiling will often be to make such a categorization before going any further.

If your customer data model isn’t too simple, as explained in post A Place in Time, your location data (like shipping addresses, billing addresses, visiting addresses) will be separated from your customer naming and identification data. This hierarchical structure must be considered in your data profiling.

For international customer data there will also be different demands and possibilities for completeness of customer data elements.    

Depending on your industry and way of doing business there may also be different demands for customer data related to different industry verticals, demographic groups and data sourced in different channels. However this may be a slippery ground, as current and not at least future requirements for multiple uses of the same master data may change the picture.   

Product master data

For most businesses the requirements for completeness and other data profiling measures will be very different depending on the product type.

Some requirements will only apply to a small range of products; other requirements apply to a broader range of products.

All in all the data profiling requirements is an integrated part of hierarchy management for product master data which make a very strong case for having data profiling capabilities implemented as part of a product information management (PIM) solution.

Multi-Domain Master Data Management

For master data management solutions embracing both customer data integration (CDI) and product information management (PIM) integrated capabilities for profiling customer master data, location master data and product master data as part of hierarchy management makes a lot of sense.

As improving data quality isn’t a one-off activity but a continuous program, so is the part being measuring the completeness of your master data of any kind.

Bookmark and Share

All that glisters is not gold

As William (not Bill) Shakespeare wrote in the play The Merchant of Venice:

All that glisters is not gold;
Often have you heard that told

I was reminded about that phrase when commenting on a comment from John Owens in my recent post called Non-Obvious Entity Relationship Awareness.

Loraine Lawson wrote a piece on IT Business Edge yesterday called Adding Common Sense to Data Quality. That post relates to a post by Phil Simon on Mike 2.0 called Data Error Inequality. That post relates to a post on this blog called Pick Any Two.

Anyway, one learning from all this glistering relationship fuzz is that when looking for return on investment (Gold) in data quality improvement and master data management perfection I agree with adding some common sense.

One of the first posts on this blog actually was Data Quality and Common Sense.  

Bookmark and Share

#MDM is dead, long live #XXX

When tweeting about Master Data Management (MDM) it has been custom to use the #MDM hashtag.

However you sometimes have seen other subjects tagged with #MDM, often in other languages than English as for example “Matin de Merde”.

But now #MDM has been completely taken over by the Tourism Queensland (Australia) Million Dollar Memo campaign.

So, Master Data Management tweeps: Do we have to find a new hashtag?.

Is #MasterDataManagement too long?

Other suggestions?

Bookmark and Share

The Worst Best Sale

One of my large disappointments from my data quality tool selling days was being involved in a great license sale.

It was a new way of doing business. The initial contact was made through social media by getting in talk with a key employee in one of the not so small players in world-wide multi-channel fashion selling.

It also from there was the good old way of doing business. We spend plus one year with proof of concept and price bargaining until finally standing head to head with one other competitor: The Data Quality quadrant leader owned by a company with the same name as our local airline.

Done deal – and then a few days after much of the business in question was outsourced. I’m actually not aware if the outsource partner had some homemade data quality techniques or couldn’t care less about data quality.

But there is a not opened box with a data quality tool somewhere out there. 

Bookmark and Share

Non-Obvious Entity Relationship Awareness

In a recent post here on this blog it was discussed: What is Identity Resolution?

One angle was the interchangeable use of the terms “Identity Resolution” and “Entity Resolution”. These terms can be seen as truly interchangeable, as that “Identity Resolution” is more advanced than “Entity Resolution” or as (my suggestion) that “Identity Resolution” is merely related to party master data, but “Entity Resolution” can be about all master data domains as parties, locations and products.

Another term sometimes used in this realm is “Non-Obvious Relationship Awareness”. Also this term is merely related to finding relationships between parties, for example individuals at a casino that seems to do better than the croupiers. Here’s a link to a (rather old) O’Reilly Radar post on Non-Obvious Relationship Awareness.

Going Multi-Domain

So “Non-Obvious Entity Relationship Awareness” could be about finding these hidden relationships in a multi-domain master data scope.

An example could be non-obvious relationships in a customer/product matrix.

The data supporting this discovery will actually not be found in the master data itself, but in transaction data probably being in an Enterprise Data Warehouse (EDW). But a multi-domain master data management platform will be needed to support the complex hierarchies and categorizations needed to make the discovery.   

One technical aspect of discovering such non-obvious relationships is how chains of keys are stored in the multi-domain master data hub.

Customer Master Data

The transactions or sums hereof in the data warehouse will have keys referencing customer accounts. These accounts can be stored in staging areas in the master data hub with references to a golden record for each individual or company in the real world. Depending on the identity resolution available the golden records will have golden relations to each other as they are forming hierarchies of households, company family trees, contacts within companies and their movements between companies and so on.

My guess as described in the post Who is working where doing what? is that this will increasingly include social media data.

Product Master Data

Some of the same transactions or sums hereof in the data warehouse will have keys referencing products. These products will exist in the master data hub as members of various hierarchies with different categorizations.

My guess is that future developments in this field will further embrace not just your own products but also competitor products and market data available in the cloud all attached to your hierarchies and categorizations.   

Bookmark and Share

As Bill Shakespeare Wrote …

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

Bookmark and Share