A Little Bit of Truth vs A Big Load of Trust

The soul of Master Data Management (MDM) is often explained as the search for a single version of the truth. It has always puzzled me that that search in many cases has been about finding the truth as the best data within different data silos inside a given organization.

business partnersBig data, including how MDM and big data can be a good match, has been a well covered subject lately. As discussed in the post Adding 180 Degrees to MDM this has shed the light on how external data may help having better master data by looking at data from outside in.

At Gartner, the analyst firm, they have phrased that movement as a shift from truth to trust for example as told in the post by Andrew White called From MDM to Big Data – From truth to trust.

Don’t get me (and master data) wrong. The truth isn’t out there in a single silver bullet shot. You have to mash up your internal master data with some of the most trustworthy external big reference data. This include commercial directory offerings, open data possibilities, public sector data (made available for private entities) and social networks.

Indeed there are potholes in that path.  Timeliness of directories, completeness of open data, consistency and availability and price tags on public sector data and validity of social network data are common challenges.

Bookmark and Share

Building an instant Data Quality Service for Quotes

In yesterday’s post called Introducing the Famous Person Quote Checker the issue with all the quotes floating around in social media about things apparently said by famous persons was touched.

The bumblebee can’t fly faster than the speed of light – Albert Einstein
The bumblebee can’t fly faster than the speed of light – Albert Einstein

If you were to build a service that could avoid postings with disputable quotes, what considerations would you have then? Well, I guess pretty much the same considerations as with any other data quality prevention service.

Here are three things to consider:

Getting the reference data right

Finding the right sources for say reference data for world-wide postal addresses was discussed in the post A Universal Challenge.

The same way, so to speak, it will be hard to find a single source of truth about what famous persons actually said. It will be a daunting task to make a registry of confirmed quotes.

Embracing diversity

Staying with postal addresses this blog has a post called Where the Streets have one Name but Two Spellings.

The same way, so to speak again, quotes are translated, transliterated and has gone through transcription from the original language and writing system. So every quote may have many true versions.

Where to put the check?

As examined in the post The Good, Better and Best Way of Avoiding Duplicates there are three options:

1)      A good and simple option could be to periodically scan through postings in social media and when a disputable quote is found sending an eMail to the culprit who did the posting. However, it’s probably too late, as even if you for example delete your tweet, the 250 retweets will still be out there. But it’s a reasonable way of starting marking up all the disputable quotes out there.

2)      A better option could be a real-time check. You type in a quote on a social media site and the service prompts you: “Hey Dude, that person didn’t say that”. The weak point is that you already did all the typing, and now you have to find a new quote. But it will work when people try to share disputable quotes.

3)    The best option would be that you start typing “If you can’t explain it simply… “ and the service prompts a likely quote as: “Everything should be as simple as it can be, but not simpler – Albert Einstein”.

Bookmark and Share

On Maps, Data Quality and MDM

Maps are great but sometimes you’ll have some trouble with data quality issues on maps as told in the post Troubled Bridge over Water.

When it comes to political borders on maps things may get really nasty as it happened lately for Huawei with a congratulation to Pakistan on the independence day showing a map with borders not in line with the Pakistani version of the truth. The story is told here.

Google EarthThere are plenty of disputes about borders in the world stretching from the serious situations in the Himalaya region to for example the close to comical case between Canada and Denmark/Greenland over Hans Island.

In these situations you can’t settle on a single version of the truth.

However, even if we don’t have disputes on what is right or wrong we may have very different views on how to look at various entities as examined in the post The Greenland Problem in MDM.

Bookmark and Share

Hierarchical Data Matching

A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.

HierarchyOne of the trends mentioned is hierarchical data matching.

The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

I usually divide a data matching process into three main steps:

  • Candidate selection
  • Match scoring
  • Match destination

(More information on the page: The Art of Data Matching)

Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.

In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.

Bookmark and Share

180 Degree Prospective Customer View isn’t Unusual

My eMail inbox is collecting received mails from several eMail accounts and therefore it’s not unusual to have duplicate messages in there.

This morning I had two eMails coming in to two different eMail accounts probably part of the same campaign but with different messages:

180 degree

Apparently I have landed in two different segments with two different eMail accounts: One technology oriented and one sales and marketing oriented.

Record linking of sparse subscription profiles isn’t easy and even Informatica, a big player in Master Data Management and Data Quality solutions, have land to be covered in this game.

Bookmark and Share

Hear ye, hear ye, hear ye

royal-crier

A certain birth in London the other day was widely visualized by the announcement by a royal crier in front of St. Mary’s Hospital.

However, as reported by International Business Times here, the crier in fact just crashed the party, as he wasn’t invited by any Royal party. But the cries and included facts were true right enough.

So, this time everything was OK. But in general it’s amazing how we confuse great visualization and trustworthiness.

Bookmark and Share

Where the Streets have one Name but Two Spellings

Last week’s post called Where The Streets have Two Names caught a lot of comments both on this blog and in LinkedIn groups as here on Data Quality Professionals and on The Data Quality Association, with a lot of examples from around the world on how this challenge actually exist more or less everywhere.

Recently I had the pleasure of experiencing a variant of the challenge when driving around in a rented car in the Saint Petersburg area in Russia. Here the streets usually only have one name but that may be presented in two different alphabets being the local Cyrillic or the Latin alphabet I’m used to which also was included in the reference data on the Sat Nav. So while it was nice for me to type destinations in Latin letters it was nice to have directions in Cyrillic in order to follow the progress on road signs.

So here standardization (or standardisation) to one preferred language, alphabet or script system isn’t the best solution. Best of breed solutions for handling addresses must be able to handle several right spellings for the same address.

Nevsky_Prospekt,_St_Petersburg,_street_sign
Street sign in Cyrillic with Latin subtitle

Bookmark and Share

The Country List

It’s the second day of the MDM Summit Europe 2013 in London today.

The last session I attended today was an expert panel on Reference Data Management (RDM).

Country ListI guess the list of countries on this planet is the prime example of what is reference data and today’s session provided no exception from that.

Even though a list of countries is fairly small and there shouldn’t be everyday changes to the list, maintaining a country list isn’t as simple as you should think.

First of all official sources for a country list aren’t in agreement. The range of countries given an ISO code isn’t the same as the range of countries where for example the Universal Postal Union (UPU) says you can make a delivery.

Another example I have had some challenges with is that for example the D&B WorldBase (a large word-wide business directory) has four country codes for what is generally regarded as the United Kingdom, as the D&B country reference data probably is defined by a soccer fan recognizing the distinct national soccer teams from England, Wales, Scotland and Northern Ireland.

The expert panel moderator, Aaron Zornes, went as far as suggesting that a graph database maybe the best technology for reflecting the complexity in reference data. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

Bookmark and Share