Storing a Single Version of the Truth

An ever recurring subject in the data quality and master data management (MDM) realms is whether we can establish a single version of the truth.

The most prominent example is whether an enterprise can implement and maintain a single version of the truth about business partners being customers, prospects, suppliers and so on.

In the quest for establishing that (fully reachable or not) single version of the truth we use identity resolution techniques as data matching and we are exploiting ever increasing sources of external reference data.

However I am often met with the challenge that despite what is possible in aiming for that (fully reachable or not) single version of the truth, I am often limited by the practical possibilities for storing it.

In storing party master data (and other kind of data) we may consider these three different ways:

Flat files

This “Keep It Simple, Stupid” way of storing data has been on an ongoing retreat – however still common, as well as new inventions of big flat file structures of data are emerging.

Also many external sources of reference data is still flat file like and the overwhelming choice of exchanging reference and master data is doing it by flat files.

Despite lots of work around solutions for storing the complex links of the real world in flat files we basically ends up with using very simplified representations of the real world (and the truth derived) in those flat files.  

Relational databases

Most Customer Relationship Management (CRM) systems are based on a relational data model, however mostly quite basic regarding master data structures making it not straight forward to reflect the most common hierarchical structures of the real world as company family trees, contacts working for several accounts and individuals forming a household.  

Master Data Management hubs are of course built for storing exactly these hierarchical kinds of structures. Common challenges here are that there often is no point in doing that as long as the surrounding applications can’t follow and that you often may restrict your use to a simplified model anyway like an industry model.   

Neural networks

The relations between parties in the real world are in fact not truly hierarchical. That is why we look into the inspiration from the network of biological neurons.

Doing that has been an option I have heard about for many years but still waits to meet as a concrete choice when delivering a single version of the truth.   

Bookmark and Share

Hell in Norway

Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.

Like if you see a city name called “Hell”.

Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.

But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.

Bookmark and Share

Despite Best Intentions

Sometimes you have the best intentions in improving things as data quality and a lot of other things, but somewhere you failed seeing the big picture and it is too late to correct.

From the sports world this apparently happened to the Singapore water polo team at the current Asian Games.

They have new designed speedos honoring the nation’s flag.

But now some ministry tells them, that the swimsuit is inappropriate. But you can’t change outfit during the games.

By the way: I also work at a company with this logo:

Fortunately we haven’t got company speedos.

Bookmark and Share

Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

The Princess and the Pea

I have earlier used the fairy tales of Hans Christian Andersen on this blog. This time it is the story about the princess on the pea.

The story tells of a prince who wants to marry a princess, but is having difficulty finding a suitable wife. Something is always wrong with those he meets, and he cannot be certain they are real princesses. One stormy night (always a harbinger of either a life-threatening situation or the opportunity for a romantic alliance in Andersen’s stories), a young woman drenched with rain seeks shelter in the prince’s castle. She claims to be a princess, so the prince’s mother decides to test their unexpected guest by placing a pea in the bed she is offered for the night, covered by 20 mattresses and 20 featherbeds. In the morning the guest tells her hosts—in a speech colored with double entendres—that she endured a sleepless night, kept awake by something hard in the bed; which she is certain has bruised her. The prince rejoices. Only a real princess would have the sensitivity to feel a pea through such a quantity of bedding. The two are married, and the pea is placed in the Royal Museum.

Buying a data quality tool is just as hard as it was for a prince to find a real princess in the good old days. How can you be certain that the tool is able to help you finding the difficult not obvious flaws hidden in your already stored data or the data streams coming in?

I think performing a test like the queen did in Andersen’s story is a must, and like the queen didn’t, don’t tell the vendor about the pea. Wait and see if the tool gets black and blue all over by the pea.

Bookmark and Share

Entity Revolution vs Entity Evolution

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

  • Is a given customer master data record representing a real world person or organization?
  • Is a person acting as a private customer and a small business owner going to be seen as the same?
  • Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
  • Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

Bookmark and Share

Donkey Business

When I started focusing on data quality technology 15 years ago I had great expectations about the spread of data quality tools including the humble one I was fabricating myself.

Even if you tell me that tools haven’t spread because people are more important than technology, I think most people in the data and information quality realm think that the data and information quality cause haven’t spread as much as deserved.

Fortunately it seems that the interest in solving data quality issues is getting traction these days. I have noticed two main drivers for that. If we compare with the traditional means of getting a donkey to move forward, the one encouragement is like the carrot and the other encouragement is like the stick:

  • The carrot is business intelligence
  • The stick is compliance

With business intelligence there has been a lot things said and written about that business intelligence don’t deliver unless the intelligence is build on a solid valid data foundation. As a result I have noticed I’m being involved in data quality improvement initiatives around aimed as a foundation for delivering business decisions. One of my favorite data quality bloggers Jim Harris has turned that carrot a lot on his blog: Obsessive Compulsive Data Quality.  

Another favorite data quality blogger Ken O’Conner has written about the stick being compliance work on his blog, where you will find a lot of good points that Ken has learned from his extensive involvement in regulatory requirement issues.

These times are interesting times with a lot of requirements for solving data quality issues. As we all know, the stereotype donkey is not easily driven forward and we must be aware not making the burden to heavy:    

Bookmark and Share

Bilateral Master Data Management

There is an issue I have come over and over again when creating a master data hub, making a golden copy, establishing a single version of the truth or whatever we like the name to be. The issue is about the scope of data sources.

Basically you take (practically) all the master data sources from within your organization and consolidate these data. Often you match with external sources as business directories and so. But what you often miss is the master data operated by your partners. These are partners like:

  • Your suppliers of products, be that raw materials or finished products for resale
  • Your sales agents and distributors
  • Your service providers as direct marketing agencies and factoring partners

These partners are part of your business processes and they often create and consume master data which are only shared with you in a limited way via some form of interface.

I know that even handling master data from within most organizations is a complex issue. Integrating with external reference data doesn’t add simplicity. But without embracing the master data life at your partners, the hub isn’t complete; the copy is only made of plated gold and the single version of the truth isn’t the only truth.

My guess is that many master data programs in the future will extend to embrace internal (private) data, as well as external (public) data and bilateral data as described on the page about Data Quality 3.0.

Bookmark and Share

Data Quality Tool Exaggerations

When following articles and blogs about information and data quality you often meet a sentiment like this:

“Data Quality tool vendors describe their products as if they will solve every possible data quality challenge around once and for all”.

Some years ago I was involved in making the English text for a description of a data quality vendor and our products. Here is the text:

“With activities in Germany, Denmark, Norway, Sweden, Austria, Switzerland, Italy, Spain, and France, [our company] is one of the leading data quality experts in Europe. We provide ready-made solutions, products, and services that increase your profits by protecting and improving your company’s customer, address, supplier and product data.

[Our company] offers state-of-the-art solutions for all of the following tasks:

  • Find, match, and eliminate duplicates
  • Restructure customer, supplier, and product databases
  • Compare with major reference data suppliers in order to correct incorrect data records
  • Enrich existing data with missing information
  • Find customers when searching within CRM and ERP systems
  • Integrate Data Quality components in SOA environments
  • Create a Master Data Hub”

Now, I don’t think we promised to boil the ocean here.

Have you stumbled upon a description on web-sites, white papers, product sheets or so where the vendor tells you that every data quality problem will be eliminated when you buy the tool?

Show me.

Bookmark and Share

Movable Types

A big boost in knowledge sharing in mans history was made around year 1450 (in the Gregorian calendar) when Johannes Gutenberg of Germany invented the use of movable types in printing. However the movable types was actually invented 400 years before in China here using porcelain and 200 years before in Korea where metal also was used. But the East Asian inventions did not spread very well do to the Script Systems used, where you have thousands of different tablets representing each syllable or word opposite to when using an alphabet.

Anyway the invention of movable types in printing is regarded as maybe the most important invention since someone invented the wheel (for the first time).

Data quality flaws also got a big boost with the sudden increase in printed work made possible by this invention. I remember my grandmother was a text reviewer at a local newspaper, and she always complained about journalists with poor spelling capabilities and she was very upset when the names of people was spelled wrong in articles. I guess her reference file for correct spelled names was in her head as she knew every known person in the town (being my town of birth: Randers).

The use of computers (including the internet) has made the next big boost in knowledge sharing and data quality flaws including the introduction of the term data quality. Before poor data quality was called sloppiness I guess. The problem however stays the same: Putting the right characters in the right order. The first time.

Bookmark and Share