Big Reference Data – Page 16 – Liliendahl on Data Quality

Echoes in the Database

11th August 200930th June 2010Henrik Gabs Liliendahl3 Comments

A basic structure of B2B (Business-to-Business) Party Master Data is that you have accounts being business entities each having one or several contacts being employees in each business entity. These employees act in the roles of decision makers, gate keepers, invoice receivers and so on. In Data Model language there is a parent-child relationship between accounts and contacts.

When doing deduplication with such data you aim to make a golden copy with unique business entities having unique contacts.

After achieving that you may gaze the data and stumble over rows in the golden copy as these (function, contact name, account name, address):

HR, John Smith, Smashing Estates Ltd, Same Place in Anytown
HR, John Smith, Smashing Solicitors Ltd, Same Place in Anytown
…
IT, Tushnelda von Keine-Mustermann, The Old Treadmill Ltd, Anytown
IT, Tushnelda von Keine-Mustermann, Brand New Brands Ltd, Anytown

Duplicates? Probably it’s the same real world individuals.

Chang-eng-bunker-PD John Smith is the ultimate Anglo common name, but if your favorite external business directory tells you that the 2 companies has the same mother and are modest size organizations, the possibility of John Smith being the same person having the same role at the same time in 2 companies is very high.

Tushnelda has a very unique name, so here there is a high possibility that she has got a new job in a new company, which makes one of the entries inactive. If one is going to be selected as the active survivor it may be chosen from newest update, found in external reference data or investigated otherwise.

B2B is often not actually Business-to-Business but also E2E – Employee-to-Employee – as the relationship exists between employees in the selling and buying business entities and it is not unusual that the relation may follow the employees when they change employer.

So striving for “one version of the truth” through “360 degree view on customer” is not a one layer exercise. This fact must be modeled in the Master Data structure, supported by functionality and prevented by feasible data quality implementations.

It’s my plan to do some blog posts around hierarchies in Party Master Data and how this must be handled in data matching. Next post will be about B2C data.

Sweden meets United States

5th August 200919th June 2010Henrik Gabs Liliendahl2 Comments

obama-ikea

Finding duplicate customers may be very different tasks depending on from which country you are and from which country the data origins.

Besides all the various character sets, naming traditions and address formats also the alternative possibilities with external reference data makes something easy – and then something very hard.

Most technology, descriptions and presented examples around are from the United States.

But say you are a Swedish company having Swedish persons in your database and among those these 2 rows (name, address, postal code and city):

Oluf Palme, Sveagatan 67, 10001 Stockholm
Oluf Palme, Savegatan 76, 10001 Stockholm

What you do is that you plug into the government provided citizen master data hub and ask for a match. The outcome can be:

The same citizen ID is returned because the person has relocated. It’s a duplicate.
Two different citizen ID’s is returned. It’s not a duplicate.
Either only one or no citizen ID is returned. Leave it or do fuzzy matching.

If you go for fuzzy matching then you better be good, because all the easy ones are handled and you are left with the ones where false positives and false negatives are most likely. Often you will only do fuzzy matching if you have phone numbers, email addresses or other data to support the match.

Another angle is that it is almost only Swedish companies who use this service with the government provided reference data – but everyone having Swedish data may use it upon an approval.

Data quality solutions with party master data is not only about fuzzy matching but also about integrating with external reference data exploiting all the various world wide possibilities and supporting the logic and logistics in doing that. Also we know that upstream prevention as close to the root as possible is better than downstream cleansing.

Deployment of such features as composable SOA components is described in a previous post here.

Master Data Quality: The When Dimension

28th July 20091st July 2010Henrik Gabs Liliendahl6 Comments

Often we use the who, what and where terms in defining master data opposite to transaction data, like saying:

Transaction data accurately identifies who, what, where and when and…
Master data accurately describes who, what and where

Who is easily related to our business partners, what to the products we sell, buy and use – where is the locations of the events.

In some industries when is also easily related to master data entities like in public transportation a time table valid for a given period. Also a fiscal year in financial reporting belongs to the when side of things.

But when is also a factor in improving and preventing data quality related to our business partners, products and locations and assigned categories because the description of these entities do change over time.

This fact is named as “slowly changing dimensions” when building data warehouses and attempting to make sense of data with business intelligence.

But also in matching, deduplication and identity resolution the “when” dimension matters. Having data with the finest actuality doesn’t necessary lead to a good match as you may compare with data not having the same actuality. Here history tracking is a solution by storing former names, addresses, phones, e-mail addresses, descriptions, roles and relations.

Such a complexity is often not handled in master data containers around – and even less in matching environments.

My guess is that the future will bring public accessible reference data in the cloud describing our master data entities with a rich complexity including the when – the time – dimension and capable matching environments around.

The art of Business Directory Matching

22nd July 20091st September 2010Henrik Gabs LiliendahlLeave a comment

A business directory is a list of companies in a given area and perhaps a given industry. One very useful type of such a directory related to data quality is a list of all companies in a given country. In many countries the authorities maintains such a list, other places it’s a matter of assembling local lists or other forms of data capture. Many private service providers offer such lists often with added information value of different kinds.

If you take the customer/prospect master table from an enterprise doing B2B in a given country one should believe that the rows in that table would match 100% to the business directory of that country. I am not talking about that all data are spelled exactly as in the directory but “only” about that it’s the same real world object reflected.

neural1 During many years of providing solutions for business directory match and tuning these as well as handling such match services from colleagues in the business I have very, very seldom seen a 100% match – even 90% matches are very rare.

Why is that so? Some of the reasons – related to the classic data quality dimensions – I have stumbled over has been:

Completeness of business directories varies from country to country and between the lists provided by vendors. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordics and others have a tight registration and then it is less tight from countries in North America, other European countries and the rest of the world.

Actuality in business directories also differs a lot. Also it is important if the business directory covers dissolved entities and includes history tracking like former names and addresses. Then take the actuality of the customer/prospect table to be matched and once again the time dimension has a lot to say.

Validity, accuracy, consistency both concerning the directory and the table to be matched is a natural course of mismatch. Also many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.

Uniqueness may be different defined in the directory and table to be matched. This includes the perception of hierachies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also different roles as those of a small business owner makes challenges. The same is true about roles as franchise takers and the use of trading styles.

Then of course the applied automated match technique and the human interaction executed are factors of the resulting match rate and the quality of the match measured as frequency of false positives.

The Tower of Babel

15th July 20095th November 2010Henrik Gabs Liliendahl2 Comments

Several old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed we still are.

Multi-cultural issues is one of the really big challenges in data quality improvement. This includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name the ones I have experienced.

As globalization moves forward these challenges becomes more and more important. Enterprises tend to standardize world wide on tools and services, shared service centres takes care of data covering many countries and so on. When an employee works with data from another country he often wrongly adapts his local standards to these data and thereby challenges the data quality more than seen before.

Recently I updated this site with pages around “The art of Matching”. One topic is “Match Techniques” and comments posted here were exactly very much around the need for methods that solves the problems arising from having multi-cultural data. Have a look.

International and multi-cultural aspects of data quality improvement has been a favourite topic of mine for a long time.

Whether and when an organisation has to deal with international issues is of course dependent on whether and in what degree that organisation is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality.

Some of the many different observations I have made includes the following:

Nicknames is a top issue in name matching in some cultures, but not of much importance in other cultures
Family names is key element in identifying households in some cultures, but not very useful in other cultures
Address verification and correction is very useful in some countries but close to impossible in other countries
Business directories are complete, consistent and available in some countries, but not that good in other countries
Citizen information is available for private entities in some countries, but is a no go in other countries

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

The GlobalMatchBox

11th July 20091st September 2010Henrik Gabs LiliendahlLeave a comment

10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.

This solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.

Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.

It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.

So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.

Fit for what purpose?

5th July 200922nd June 2010Henrik Gabs LiliendahlLeave a comment

The goal of data quality improvement is often set as ”fit for purpose”. The first purpose addressed will almost naturally be within the domain where the data in question are captured. Then you address other domains where the same data also may be used, but probably with other purposes leading to additional or varying measures for fitness.

If an organisation identifies several domains where the same data are used the normal approach will be to gather all purposes and then start to align all the needs, find the highest common denominators and so on. This may be a very cumbersome process as you need to consider all the different dimensions of data quality: uniqueness, completeness, timeliness, validity, accuracy, consistency.

Another way will be to assume that if you gather many purposes the total needs will almost certainly tend to be a reflection of the real world objects to which the data refer.

So my thesis is, that there is a break even point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

Master Data are often used in many different functions in an organisation and not at least party data – names and addresses – are known to be a focus area for data quality improvement. Here it is very obvious that real world objects exists and they are basically the same to every organisation. acme

Earlier this year I wrote an entry on dataqualitypro about possibilities with external party reference data: http://www.dataqualitypro.com/data-quality-home/external-reference-data-an-overview.html

In my previous post on this blog I noticed that governments around the world are releasing data stores that surely add traction to the real world approach to data quality improvement.

I will for sure touch this subject in forthcoming posts on this blog.

Government says so

24th June 200920th October 2010Henrik Gabs Liliendahl3 Comments

External reference data are going to play an increasing role in data quality improvement and a recent trend around the world helps a lot: Governments are unlocking their data stores.

Some available initiatives in English are the US data.gov and the UK “show us a better way”.

Today I attended a “Workshop on the use of public data in the private sector” arranged by the Danish National IT and Telecom Agency as part of the similar initiative in my home country.

The initiatives around the world are a bit different in focus areas and on which data to be released depending on the administrative traditions and local privacy policies.

As an organisation you may integrate with such public reference data either directly or through services from private vendors who add value by reformatting, merging, enriching and bundling with other services. One add on service on the international scene will be supplying consistency – as far as possible – between the datasets from each country.

One way or the other public reference data will become a part of the data architecture in most organisations. Applications in the cloud will probably be (actually are) first movers in this field.

Public reference data will bring operational databases and data warehouses closer to that “one version of the truth” that we talk so much about but have so much trouble achieving and even define. Now some of the trouble can be solved by: Government says so.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph