Aadhar (or Aadhaar)

The solution to the single most frequent data quality problem being party master data duplicates is actually very simple. Every person (and every legal entity) gets an unique identifier which is used everywhere by everyone.

Now India jumps the bandwagon and starts assigning a unique ID to the 1.2 billion people living in India. As I understand it the project has just been named Aadhar (or Aadhaar). Google translate tells me this word (आधार) means base or root – please correct if anyone knows better.

In Denmark we have had such an identifier (one for citizens and one for companies) for many years. It is not used by everyone everywhere – so you still are able to make money being a data quality professional specializing in data matching.

The main reason that the unique citizen identifier is not used all over is of course privacy considerations. As for the unique company identifier the reason is that data quality often are defined as fit for immediate purpose of use.

Bookmark and Share

Data Quality and World Food

I have touched the analogy between food (quality) and data (quality) several times before for example in the posts “Bon Appétit” and “Under New Master Data Management”.

Why not continue down that road?

Let’s have a look at some local food that has become popular around the world.

寿司

Imagine you go to a restaurant where you order a fish dish. When starting to consume your dinner you realize that the fish hasn’t been boiled, fried or in any other way exposed to heat. Then I guess it is perfectly normal to shout out: THE FISH IS RAW – and demanding apologies from the chef, the head waiter, Gordon Ramsey or anyone else in charge. Unless of course if you are in a sushi restaurant where the famous Japanese dish that may include raw fish is prepared.

Köttbullar

Köttbullar is the Swedish word for meatballs. This had rightfully stayed as a fact only known to Swedes if it wasn’t for cheap furniture sold around the world by IKEA. By reasons still unclear to me IKEA has chosen to serve Köttbullar in the store cafeterias and even sell the stuff along with the particle board furniture on their e-commerce sites.

Pizza

Italian originated dish usually brought to you by someone on a bike or in extreme cases in a very old car.

McChicken

Selling food of different kind in the form as a burger works in the United States – and by reasons that I can’t explain even in France.

Data Quality analogies

Well, let’s just say that data quality tools and services:

  • May be regarded very different around the world,
  • Usually are sold along with tools and services made for something completely different,
  • Are brought to you in various ways by local vendors and
  • By reasons I can’t explain often are made for use in the United States (no other pun intended but pure admiration of execution).

Bon appétit.

Bookmark and Share

Beyond Home Improvement

During my many years in customer master data quality improvement I have worked with a lot of clients having data from several countries. In almost every case the data has been prioritized in two pots:

  • Master Data referring to domestic customers
  • Master Data referring to foreign customers

Even though the enterprise defines itself as an international organization, the term domestic still in a lot of cases is easily assigned to the country where a headquarter is situated and where the organization was born.

Signs of this include:

  • Data formats are designed to fit domestic customers
  • Internal reference data are richer for domestic locations
  • External reference data services are limited to domestic customers

The high prioritizing of domestic data is of course natural for historical reasons, because domestic customers almost certainly are the largest group, and because the rules are common to most delegates in a data quality program.

If we accept the fact that improving data quality will be reflected in an improved bottom line, there is still a margin you may improve by not stopping when having optimal procedures for domestic data.

One way of dealing with this in an easy way is to apply general formats, services and rules that may work for data from all over the world, and this approach may in some cases be the best considering costs and benefits.

But I have no doubt that achieving the best data quality with customer master data is done by exploiting the specific opportunities that exist for each country / culture.

Examples are:

  • The completeness and depth for address (location) data available in each country is very different – so are the rules of the postal service’s operating there
  • Public sector company and citizen registration practice also differs why the quality of external reference data is different and so are the rules of access to the data.
  • Using local character sets, script systems, naming conventions and addressing formats besides (or instead of) what applies to that of the headquarter helps with data quality through real world alignment

My guess is that we will see services in cloud in the near future helping us making the global village also come true for master data quality.

Bookmark and Share

What is a best-in-class match engine?

Latest in connection with that TIBCO acquires data matching vendor Netrics the term best-in-class match engine has been attached to the Netrics product.

First: I have no doubt that the Netrics product is a capable match engine – I know that from discussions in the LinkedIn Data Matching group and here on this blog.

Next: I don’t think anyone knows what product is the best match engine, because I don’t think that all match engines have been benchmarked with a representative set of data.

There are of course on top the matching capabilities with different entity types to consider. Here party master data (like customer data) are covered by most products whereas capabilities with other entity types (be that considered same same or not) are far less exposed.

As match engine products are acquired and integrated in suites the core matching capabilities somehow becomes mixed up with a lot of other capabilities making it hard to compare the match engine alone.

Some independent match engines work stand alone and some may be embedded into other applications.

These may then be the classes to be best in:

  • Match engines in suites
  • Embedded match engines (for say SAP, MS CRM and so on)
  • Stand alone match engines

Many match engines I have seen are tuned to deal with data from the country (culture) where they are born and had their first triumphs. As the US market is still far the largest for match engines the nomination of best match engine resembles when a team becomes World Champions in American Football. International/multi-cultural capabilities will become more and more important in data matching. But indeed we may define a class for each country (culture).

In the old days I have heard that one match engine was best for marketing data and another match engine was best for credit risk management. I think these days are over too. With Master Data Management you have to embrace all data purposes.

Some match engines are more successful in one industry. The biggest differentiator in match effectiveness is with B2C and/or B2B data. B2C is the easiest, B2B is more complex and embracing both is in my eyes a must for being considered best-in-class – unless we define separate classes for B2C, B2B and both.

As some matching techniques are deterministic and some are probabilistic the evaluation on the latter one will be based on data already processed in a given instance, as the matching gets better and better as the self learning element is warmed up.

So, yes, an endless religious-like discussion I reopened here.

Bookmark and Share

Data Quality in the Cloud

In my previous post I advocated that Data Quality tools in the near future will exploit the huge data resources in the cloud in order to achieve having data of high quality by correctly reflecting the real world construct to which they refer.

I am well aware that this is based on an assumption that data in the cloud are accurate, timely and so on, which is of course not always the case – now. This will only come when a certain data source has a number of subscribers that require a certain level of data quality and perhaps contributes to correcting flaws.

I tried that out right before writing this post when I installed Google Earth on a new laptop. A journey where I shifted between being very impressed and then a bit disappointed.

First the site from where to install – either by position or my OS language – guessed that I am not English speaking. Unfortunately it changed to Dutch – and not Danish. Well, most Dutch words are either like German or English or at least urban slang. I went through. Inside the application most text has now changed to Danish – only with a few Dutch and English labels.

Knowing that the application hasn’t learned anything about me yet I started to type just my street address which is only 8 characters but global unique: “Lerås 13” (remember: house number after street name in my part of the world). The application answered promptly with my full address as first candidate and when clicking on that it took me from high above the earth right down to where I live. Impressing.

Well, the pointer was actually 40 meters NNE from the nearest corner of our premise – and in front of our garage I could recognize the grey car I had 2 years ago. Disappointing.

Grandpa’s Story

Now I have become a grandfather it’s time for a blog post about lessons learned in life.

One of my favourite authors as a young man was Cyril Northcote Parkinson, the grand father of the famous Parkinson’s Law saying:

Work expands so as to fill the time available for its completion.

Early in my career I learned how true this is. My first experience was also like the statistics behind Parkinson’s Law from within public administration, but later I learned that private enterprises are just the same.

My first real job after graduation was at the Danish Tax Authorities. After having worked there a few years I was assigned on a mission to assist the Faroe Islands Financial Authorities in developing a modernised tax collection solution.

The Faroe Islands

For those readers that hate old people not sticking to the subject – please continue to the next headline.

For those readers who don’t have a clue about where on earth the Faroe Islands are: Well. 1000 years ago the Vikings sailed out from Scandinavia and finally made it to say hello to the Native Americans – 500 years before Columbus. When doing that they used islands in the Northern Atlantic as stepping stones. First British Isles, then Faroe Islands, Iceland, Greenland and finally Newfoundland at the American coast.

Just like Columbus found America by mistake, as he was actually looking for India, the Vikings probably also found America and the stepping stones by mistake when getting lost on the ocean during storms.

1/100

Back on track. The mission for the Faroe Island Authorities I joined in the early 1980’s seemed impossible. As the Faroese population is only 1/100 of the population of the continental Denmark there were of course only 1/100 of the resources available for making a solution doing exactly the same as the solution built for continental Denmark

But what I learned was that the solution actually was built using only those resources and in surprisingly short time (and with minimal help from me and my colleagues).

While I during my career have worked in both modest sized organisations and large organisations I have noticed numerous examples on how exactly the same task may consume resources not sized by the nature of the task but by the size of the organisation.

People and technology

Maybe this observation is an explanation to the ever recurring subject on whether people or technology is most important when doing projects like improving data quality. If the technology part is (close to) constant but the over-all resource consumption grows with the size of the organisation in question, well, then the people part becomes more and more important by the size of the organisation

Tool making

I have tried single handed to build a data quality tool – or to be more specific a data matching tool. At several occasions it has been benchmarked with products residing as leaders in the Gartner Magic Quadrant for data quality tools, and it didn’t come out short. Some of the features included in the product called SuperMatch are described in the post “When computer says maybe”.

It’s my impression, that if you look at tool vendors with many employees, it’s only a very small group of people who is actually working on the tool

Standardise this, standardize that

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

  • Compare the original data rows using fuzzy logic techniques
  • Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

  • In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
  • Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
  • In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

  • The culture of the real world entity that the data represents
  • The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Bookmark and Share

Having the right element to the left

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

  • In some cultures like in east Asia the family name is written first and the given name is written last.
  • Some notations indicate that the given name isn’t the first element:
    • “DUPONT Michel” is a custom French way of telling, that the family name is the first element
    • “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

  • In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
  • In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
  • The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Bookmark and Share

Bon Appetit

If I enjoy a restaurant meal it is basically unimportant to me what raw ingredients from where were used and which tools the chef used during preparing the meal. My concerns are whether the taste meet my expectations, the plate looks delicious in my eyes, the waiter seems nice and so on.

This is comparable to when we talk about information quality. The raw data quality and the tools available for exposing the data as tasty information in a given context is basically not important to the information consumer.

But in the daily work you and I may be the information chef. In that position we have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Let’s look at some analogies.

Best before

Fresh raw ingredients is similar to actualized raw data. Raw data also has a best before date depending on the nature of the data. Raw data older than that date may be spiced up but will eventually make bad tasting information.

One-stop-shopping

Buying all your raw ingredients and tools for preparing food – or taking the shortcut with ready made cookie cutting stuff – from a huge supermarket is fast and easy (and then never mind the basket usually also is filled with a lot of other products not on the shopping list).

A good chef always selects the raw ingredients from the best specialized suppliers and uses what he consider the most professional tools in the preparing process.

Making information from raw data has the same options.

Compliance

Governments around the world has for long time implemented regulations and inspection regarding food mainly focused at receiving, handling and storing raw ingredients.

The same is now going on regarding data. Regulations and inspections will naturally be directed at data as it is originated, stored and handled.

Diversity

Have you ever tried to prepare your favorite national meal in a foreign country?

Many times this is not straightforward. Some raw ingredients are simply not available and even some tools may not be among the kitchen equipment.

When making information from raw data under varying international conditions you often face the same kind of challenges.