Single Company View

27th April 2011

Getting a single customer view in business-to-business (B2B) operations isn’t straight forward. Besides all the fuzz about agreeing on a common definition of a customer within each enterprise usually revolving around fitting multiple purposes of use, we also have complexities in real world alignment.

One Number Utopia

Back in the 80’s I worked as a secretary for the committee that prepared a single registry for companies in Denmark. This practice has been live for many years now.

But in most other countries there are several different public registries for companies resulting in multiple numbering systems.

Within the European Union there is a common registry embracing VAT numbers from all member states. The standard format is the two letter ISO country code followed by the different formatted VAT number in each country – some with both digits and letters.

The DUNS-number used by Dun & Bradstreet is the closest we get to a world-wide unique company numbering system.  

2-Tier Reality

The common structure of a company is that you have a legal entity occupying one or several addresses.

The French company numbering system is a good example of how this is modeled. You have two numbers:

  • SIREN is a 9-digit number for each legal entity (on the head quarter address).
  • SIRET is a 14-digit (9 + 5) number for each business location.

This model is good for companies with several locations but strange for single location companies.

Treacherous Family Trees (and Restaurants)

The need for hierarchy management is obvious when it comes to handling data about customers that belongs to a global enterprise.

Company family trees are useful but treacherous. A mother and a daughter may be very close connected with lots of shared services or it may be a strictly matter of ownership with no operational ties at all.

Take McDonald’s as a not perfectly simple (nor simply perfect) example. A McDonald’s restaurant is operated by a franchisee, an affiliate, or the corporation itself. I’m lovin’ modeling it.

Bookmark and Share


What is Identity Resolution?

8th March 2011

We are continuously struggling with defining what it is we are doing like defining: What is data quality? What is Master Data? Lately I’ve been involved in discussions around: What is Identity Resolution? A current discussion on this topic is rolling in the Data Matching LinkedIn group.

This discussion has roots in one of my blog posts called Entity Revolution vs Entity Evolution. Jeffrey Huth of IBM Initiate followed up with the post Entity Resolution & MDM: Interchangeable? In January Phillip Howard of Bloor made a post called There’s identity resolution and then there’s identity resolution (followed up by a correction post the other day called My bad).

It is a “same same but different” discussion. Traditional data matching (or record linkage) as seen in a data quality tool and master data management solution is the bright view: Being about finding duplicates and making a “single business partner view” (or “single party view” or “single customer view”). Identity resolution is the dark view: Preventing fraud and catching criminals, terrorists and other villains.

The Gartner Hype Cycle describes the dark view as ”Entity Resolution and Analysis”. This discipline is approaching the expectation peak and will, according to Gartner, be absorbed by other disciplines as no one can tell the difference I guess.  

Certainly there are poles. In an article from 2006 called Identity Resolution and Data Integration David Loshin said: There is a big difference between trying to determine if the same person is being mailed two catalogs instead of one and determining if the individual boarding the plane is on the terrorist list.

But there is also a grey zone.

From a business perspective for example the prevention of misuse of a restricted campaign offer is a bit of both sides. Here you want to avoid that an existing customer is using an offer only meant for new customers. How does that apply to members of the same household or the same company family tree? Or you want to avoid someone using an introduction offer twice by typing her name and address a bit different.

From a technical perspective I have an example from working with a newspaper in a big fraud scam described in the post Big Time ROI in Identity Resolution. Here I had no trouble using a traditional deduplication tool in discovering non-obvious relationships. Also the relationships discovered in traditional data matching ends up quite nicely in hierarchy management as part of master data management as described in the post Fuzzy Hierarchy Management.

And then there is the use of the words identity (resolution) versus entity (resolution).

My feeling is that we could use identity resolution for describing all kind of matching and linking with party master data and entity resolution could be used for describing all kind of matching and linking with all master data entity types as seen in multi-domain master data management. But that’s just my words.

Bookmark and Share


Fuzzy Hierarchy Management

2nd March 2011

When evaluating results from automated data matching your goal is typically to find false positives and false negatives being entities that are matched, but shouldn’t be (false positives) and entities that are not matched, but should have been (false negatives).

However the fuzziness often used in the data matching process also apply to the evaluation of the results as many dubious results isn’t a question about if the matched database rows are reflecting the same real world entity but more a question about if the matched (or not matched) database rows are reflecting different members of a real world hierarchy.

Example 1:

John Smith on 1 Main Street in Anytown
Mary & John Smith on 1 Main Str in Anytown

Example 2:

Anytown Municipality, Technical Dept
Municipality of Anytown

Example 3:

Acme Corporation, Anytown
Acme Corporation, Anywhere

All three examples above may be considered a false positive if matched and a false negative if not matched.

You may say that it depends on the purpose of use, which is true.

But if we are talking master data management we may probably encompass multiple requirements where we simultaneously need the match and don’t want the match, which is why we need to be able to resolve and store the results from fuzzy data matching into hierarchies.

Bookmark and Share


Citizen ID and Biometrics

9th January 2011

As I have stated earlier on this blog: The solution to the single most frequent data quality problem being party master data duplicates is actually very simple: Every person (and every legal entity) gets a unique identifier which is used everywhere by everyone.

Some countries, like Denmark where I live, has a unique Citizen ID (National identification number). Some countries are on the way like India with the Aadhaar project. But some of the countries with the largest economies in the world like United Kingdom, Germany and United States don’t seem to getting it in the near future.

I think United Kingdom was close lately, but as I understand it the project was cancelled. As seen in a tweet from a discussion on twitter today the main obstacles were privacy considerations and costs:

A considerable cost in the suggested project in United Kingdom, and also as I have seen in discussions for a US project, may be that an implementation today should also include biometric technology.

The question is however if that is necessary.  

If we look at the systems in force today for example in Scandinavia they were implemented +40 years ago, and the Swedish citizen ID was actually implemented without digitalization in 1947. There are discussions going on about biometrics also as this is inevitable for issuing passports anyway. In the mean time the systems however continues to make a lot of data quality prevention and party master data management a lot easier than else around the world without having biometrics as a component.

No doubt about that biometrics will solve some problems related to fraud and so. But these are rare exceptions. So the cost/benefit analysis for enhancing an existing system with biometrics seems to be negative.     

I guess the alleged need for biometric may have something to do with privacy considerations in a strange way: Privacy considerations are often overruled by the requirements for fighting terrorism – and here you need biometrics in identity resolution.    

Bookmark and Share


Storing a Single Version of the Truth

30th November 2010

An ever recurring subject in the data quality and master data management (MDM) realms is whether we can establish a single version of the truth.

The most prominent example is whether an enterprise can implement and maintain a single version of the truth about business partners being customers, prospects, suppliers and so on.

In the quest for establishing that (fully reachable or not) single version of the truth we use identity resolution techniques as data matching and we are exploiting ever increasing sources of external reference data.

However I am often met with the challenge that despite what is possible in aiming for that (fully reachable or not) single version of the truth, I am often limited by the practical possibilities for storing it.

In storing party master data (and other kind of data) we may consider these three different ways:

Flat files

This “Keep It Simple, Stupid” way of storing data has been on an ongoing retreat – however still common, as well as new inventions of big flat file structures of data are emerging.

Also many external sources of reference data is still flat file like and the overwhelming choice of exchanging reference and master data is doing it by flat files.

Despite lots of work around solutions for storing the complex links of the real world in flat files we basically ends up with using very simplified representations of the real world (and the truth derived) in those flat files.  

Relational databases

Most Customer Relationship Management (CRM) systems are based on a relational data model, however mostly quite basic regarding master data structures making it not straight forward to reflect the most common hierarchical structures of the real world as company family trees, contacts working for several accounts and individuals forming a household.  

Master Data Management hubs are of course built for storing exactly these hierarchical kinds of structures. Common challenges here are that there often is no point in doing that as long as the surrounding applications can’t follow and that you often may restrict your use to a simplified model anyway like an industry model.   

Neural networks

The relations between parties in the real world are in fact not truly hierarchical. That is why we look into the inspiration from the network of biological neurons.

Doing that has been an option I have heard about for many years but still waits to meet as a concrete choice when delivering a single version of the truth.   

Bookmark and Share


Entity Revolution vs Entity Evolution

18th November 2010

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

  • Is a given customer master data record representing a real world person or organization?
  • Is a person acting as a private customer and a small business owner going to be seen as the same?
  • Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
  • Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

Bookmark and Share


Golden Copy Musings

22nd October 2010

In a recent blog post by Jim Harris called Data Quality is not an Act, it is a Habit the term “golden mean” was mentioned.   

As I commented, mentioning the “golden mean” made me think about the terms “golden copy” and “golden record” which are often used terms in data quality improvement and master data management.

In using these terms I think we mostly are aiming on achieving extreme uniqueness. But we should rather go for symmetry, proportion, and harmony.

The golden copy subject is very timely for me as I this weekend is overseeing the execution of the automated processes that create a baseline for a golden copy of party master data at a franchise operator for a major brand in car rental.

In car rental you are dealing with many different party types. You have companies as customers and prospects and you have individuals being contacts at the companies, employees using the cars rented by the companies and individuals being private renters. A real world person may have several of these roles. Besides that we have cases of mixed identities.

During a series of workshops we have worked with defining the rules for merge and survivorship in the golden copy. Though we may be able to go for extreme uniqueness in identifying real world companies and persons this may not necessary serve the business needs and, like it or not, be capable of being related back into the core systems used in daily business.

Therefore this golden copy is based on a beautiful golden mean exposing symmetry, proportion, and harmony.

Bookmark and Share


To be called Hamlet or Olaf – that is the question

18th October 2010

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share


Big Trouble with Big Names

10th October 2010

An often seen issue in party master data management is handling information about your most active customers, suppliers and other roles of interest. These are often big companies with many faces.

I remember meeting that problem way back in the 80’s when I was designing a solution for the Danish Maritime Authorities.  

In relation to a ship there are three different main roles:

  • The owner of the ship, who has some legal rights and obligations
  • The operator of ship, who has responsibilities regarding the seaworthiness of the ship
  • The employer, who has responsibilities regarding the seamen onboard the ship

Sometimes these roles don’t belong to the same company (or person) for a given ship. That real world reality was modeled all right. But even if it practically is the same company, then the roles are materialized very different for each role. I remember this was certainly the case with the biggest ship-owner in Denmark (and also by far the biggest company in Denmark) being the A.P. Moller – Maersk Group.

We really didn’t make a golden record for that golden company in my time on the project.

Bookmark and Share


Out-of-Africa

30th August 2010

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share


Follow

Get every new post delivered to your Inbox.

Join 125 other followers