Big Reference Data – Page 10 – Liliendahl on Data Quality

The Big ABC of Reference Data

7th February 201229th May 2012Henrik Gabs LiliendahlLeave a comment

Reference Data is a term often used either instead of Master Data or as related to Master Data. Reference data is those data defined and (initially) maintained outside a single organisation. Examples from the party master data realm are a country list, a list of states in a given country or postal code tables for countries around the world.

The trend is that organisations seek to benefit from having reference data in more depth than those often modest populated lists mentioned above.

In the party master data realm such reference data may be core data about:

Addresses being every single valid address typically within a given country.
Business entities being every single business entity occupying an address in a given country.
Consumers (or Citizens) being every single person living on an address in a given country.

There is often no single source of truth for such data. Some of the challenges I have met for each type of data are:

Addresses

The depth (or precision if you like) of an address is a common problem. If the depth of address data is at the level of building numbers on streets (thoroughfares) or blocks, you have issues as described in the blog post called Multi-Occupancy.

Address reference data of course have issues with the common data quality dimensions as:

Timeliness, because for example new addresses will exist in the real world but not yet in a given address directory.
Accuracy, as you are always amazed when comparing two official sources which should have the same elements, but haven’t.

Business Entities

Business directories have been accessible for many years and are often used when handling business-to-business (B2B) customer master data and supplier master data management. Some hurdles in doing this are:

Uniqueness, as your view of what a given business entity is occasionally don’t match the view in the business directory as discussed in the post 3 out of 10
Conformity, because for example an apparently simple exercise as assigning an industry vertical can be a complex matter as mentioned in the post What are they doing?

Consumers (or Citizens)

In business-to-consumer (B2C) or other activities involving citizens a huge challenge is identifying the individuals living on this planet as pondered in the post Create Table Homo Sapiens. Some troubles are:

Consistency isn’t easy, as governments around the world have found 240 (or so) different solutions to balancing privacy concerns and administrative effectiveness.
Completeness, as the rules and traditions not only between countries, but also within different industries, certain activities and various channels, are different.

Big Reference Data as a Service

Even though I have emphasized on some data quality dimensions for each type of data, all dimensions apply to all types of data.

For organisations operating multinational and/or multichannel exploiting the wealth and diversity of external reference data is a daunting task.

This is why I see reference data as a service embracing many sources as a good opportunity for getting data quality right the first time. There is more on this subject in the post Reference Data at Work in the Cloud.

Multi-Occupancy

26th January 201229th May 2012Henrik Gabs Liliendahl2 Comments

The fact that many people doesn’t live in a single family house but live in a flat sharing the same building number on a street with people living in other flats in the same building is a common challenge in data quality and data matching.

The same challenge also applies to companies sharing the same building number with other companies and not to say when companies and households are in the same building. So this is a common party master data issue.

Address verification and geocoding is seen as important methods for achieving data quality improvement related to the top data quality pain all over being quality of party master data and aiming at getting a single customer view.

Multi-occupancy is a pain in the (you know) getting there.

My pain

I have had some personal experiences living at multi-occupancy addresses lately.

One and a half years ago I was living a painless life in single family house in a Copenhagen suburb.

Then I moved closer to downtown Copenhagen in a flat as mentioned in post Down the Street.

The tradition in Denmark is to send letters and make deliveries and register master data with a common format of units within a building and having separate mailboxes with flat ID and names for each flat. I have received most of my post since then and got all deliveries I’m aware of.

Then I moved to London in a flat. Here the flats in my building have numbers. But the postman delivers the letters in one batch in the street door, and there are no names on the doorbells in front of the door.

So now I sense I don’t get many letters and today I had to order the same stuff trice from amazon.co.uk, because I haven’t received the first two packages despite of their state of the art online accessible package tracking systems that tells me that delivery was successful.

Master data pains unresolved

Address reference data at building number level and related geocodes are becoming commonly available many places around these days.

But having reference data and real world aligned location and related party master data at the unit level is still a challenge most places. Therefore we are still struggling with using address verification and geocoding for single customer view where a given building number has more than a single occupancy.

Data Accessibility

12th January 20128th May 2012Henrik Gabs Liliendahl2 Comments

When solving data quality issues we are usually dealing with important data quality dimensions as uniqueness, completeness, timeliness, accuracy, consistency and integrity.

One other data quality dimension I have been addressing lately is accessibility.

Data accessibility is a universal feature within information technology as recently emphasized in the big story about that the Spanish bank BBVA is to move +100,000 employees to the cloud (using Google Apps). The main driver for that is data accessibility.

The increasing adoption of cloud services will in my eyes contribute positively to solving many data quality issues through improved accessibility as discussed in the post Data Quality from the Cloud.

The fact that data is available and in principle accessible doesn’t however mean that the case is solved. The Achilles Heel is how to smoothly integrate accessible data into business processes, not at least how to integrate and present data from many different sources be that external and internal sources.

Data accessibility must be seen in conjunction with the other data quality dimensions. Fulfilling one dimension doesn’t make the day. Accessibility to data that isn’t satisfactory unique, complete, timely and accurate isn’t that much worth. Making data consistent across multiple sources isn’t a walkover. Securing data integrity between more and more accessible data will be paramount.

Metadata management is also closely related to data accessibility. The importance of a common understanding about what is in the accessible data can’t be overestimated.

My guess is that we will see data accessibility as an increasingly important data quality dimension in the years to follow.

Reference Data at Work in the Cloud

5th January 201215th August 2012Henrik Gabs Liliendahl2 Comments

One of the product development programs I’m involved in is about exploiting rich external reference data and using these data in order to get data quality right the first time and being able to maintain optimal data quality over time.

The product is called instant Data Quality (abbreviated as iDQ ™). I have briefly described the concept in an earlier post called instant Data Quality.

iDQ ™combines two concepts:

Software as a Service
Data as a Service

While most similar solutions are bundled with one specific data provider the iDQ ™ concept embraces a range data sources. The current scope is around customer master data where iDQ ™ may include Business-to-Business (B2B) directories, Business-to-Consumer (B2C) directories, real estate directories, Postal Address Files and even social media network data from external sources as well as internal master data at the same time all presented in a compact mash-up.

The product has already gained a substantial success in my home country Denmark leading to the formation of a company solely working with development and sales of iDQ ™.

The results iDQ ™ customers gains may seem simple but are the core advantages of better data quality most enterprises are looking for, like said by one of Denmark’s largest companies:

“For DONG Energy iDQ ™ is a simple and easy solution when searching for master data on individual customers. We have 1,000,000 individual customers. They typically relocate a few times during the time they are customers of us. We use iDQ ™ to find these customers so we can send the final accounts to the new address. iDQ ™ also provides better master data because here we have an opportunity to get names and addresses correctly spelled.

iDQ ™ saves time because we can search many databases at the time. Earlier we had to search several different databases before we found the right master data on the customer. ”

Please find more testimonials here.

I hope to be able to link to testimonials in more languages in the future.

Nationally International

29th October 201129th May 2012Henrik Gabs Liliendahl8 Comments

I am right now in the process of moving most of my business from the Kingdom of Denmark to the United Kingdom.

During that process I have become a regular customer at the Gatwick Express, the (sometimes) fast train going from London’s second largest airport to central London.

When buying tickets online they require you to enter a billing address. Here you can choose between entering a UK address or an international address.

If you enter a UK address the site takes advantage of the UK postal code system where you just have to enter a postcode, which is very granular in the UK, and a house number, and then the system will know your address.

Alternatively you can choose to enter an international address. In that case you will get a form with more fields for you to enter. But, in order not to be too international the form still have the UK way of formatting an address.

Also the default country is United Kingdom which I guess is the only value that should not be applicable for this form.

Down the Street

3rd September 201114th July 2012Henrik Gabs Liliendahl9 Comments

Having an address consisting of a house number and a street name, or vice versa, is the usual way of addressing in most parts of the world. This construct is also featured in the presentation of the Universal Postal Union’s (UPU) international standard initiative (S42):

(Click on image to see the presentation)

Somehow I always end up living at a place with issues in relation to this construct.

Our current address is (without unit):

“Kenny Drews Vej 27” which would be “27 Kenny Drews Way” in an Anglo-phone country.

But our area has a new style of block buildings with canals between as we like to pretend that we live in Venice or Amsterdam:

This means that the house numbers aren’t sequenced down the street, but is spread round the block as if we were living in Japan. Google maps have the position exactly as it is:

Number 27 on Kenny Drews Vej is actually much closer to two other streets, which makes it very difficult when people are visiting us the first time and for some also the second time.

But that’s because I, and some of our visitors, are old fashioned. As Prashanta Chan says in his blog post Geocoding: Accurate Location Master Data: It will be much better to invite folks to your geocode.

The same thing applies to when you want some goods delivered to your premises or want a taxi as close to your front door as possible.

And regarding letters delivered by the good old postman: They will probably all be sent electronically before the UPU S42 addressing mapping standard is adapted by everyone.

Some Deduplication Tactics

24th August 2011Henrik Gabs Liliendahl2 Comments

When doing the data quality kind of deduplication you will often have two kinds of data matching involved:

Data matching in order to find duplicates internally in your master data, most often your customer database
Data matching in order to align your master data with an external registry

As the latter activity also helps with finding the internal duplicates, a good question is in which order to do these two activities.

External identifiers

If we for example look at business-to-business (B2B) customer master data it is possible to match against a business directory. Some choices are:

If you have mostly domestic data in a country with a public company registration you can obtain a national ID from matching with a business directory based on such a registry. An example will be the French SIREN/SIRET identifiers as mentioned in the post Single Company View.
Some registries cover a range of countries. An example is the EuroContactPool where each business entity is identified with a Site ID.
The Dun & Bradstreet WorldBase covers the whole world by identifying approximately 200 million active and dissolved business entities with a DUNS-number. The DUNS-number also serves as a privatized national ID for companies in the United States.

If you start with matching your B2B customers against such a registry, you will get a unique identifier that can be attached to your internal customer master data records which will make a succeeding internal deduplication a no-brainer.

Common matching issues

A problem is however is that you seldom get a 100 % hit rate in a business directory matching, often not even close as examined in the post 3 out of 10.

Another issue is the commercial implications. Business directory matching is often performed as an external service priced per record. Therefore you may save money by merging the duplicates before passing on to external matching. And even if everything is done internally, removing the duplicates before directory matching will save process load.

However a common pitfall is that an internal deduplication may merge two similar records that actually are represented by two different entities in the business directory (and the real world).

So, as many things data matching, the answer to the sequence question is often: Both.

A good process sequence may be this one:

An internal deduplication with very tight settings
A match against an external registry
An internal deduplication exploiting external identifiers and having more loose settings for similarities not involving an external identifier

Big Master Data

24th July 20113rd May 2012Henrik Gabs LiliendahlLeave a comment

Right now I am overseeing the processing of yet a master data file with millions of records. In this case it is product master data also with customer master data kind of attributes, as we are working with a big pile of author names and related book titles.

The Big Buzz

Having such high numbers of master data records isn’t new at all and compared to the size of data collections we usually are talking about when using the trendy buzzword BigData, it’s nothing.

Data collections that qualify as big will usually be files with transactions.

However master data collections are increasing in volume and most transactions have keys referencing descriptions of the master entities involved in the transactions.

The growth of master data collections are also seen in collections of external reference data.

For example the Dun & Bradstreet Worldbase holding business entities from around the world has lately grown quickly from 100 million entities to near 200 millions entities. Most of the growth has been due to better coverage outside North America and Western Europe, with the BRIC countries coming in fast. A smaller world resulting in bigger data.

Also one of the BRICS, India, is on the way with a huge project for uniquely identifying and holding information about every citizen – that’s over a billion. The project is called Aadhaar.

When we extend such external registries also to social networking services by doing Social MDM, we are dealing with very fast growing number of profiles in Facebook, LinkedIn and other services.

Extreme Master Data

Gartner, the analyst firm, has a concept called “extreme data” that rightly points out, that it is not only about volume this “big data” thing; it is also about velocity and variety.

This is certainly true also for master data management (MDM) challenges.

Master data are exchanged between organizations more and more often in higher and higher volumes. Data quality focuses and maturity may probably not be the same within the exchanging parties. The velocity and volume makes it hard to rely on people centric solutions in these situations.

Add to that increasing variety in master data. The variety may be international variety as the world gets smaller and we have collections of master data embracing many languages and cultures. We also add more and more attributes each day as for example governments are releasing more data along with the open data trend and we generally include more and more attributes in order to make better and more informed decisions.

Variety is also an aspect of Multi-Domain MDM, a subject that according to Gartner (the analyst firm once again) is one of the Three Trends That Will Shape the Master Data Management Market.

Managing Client On-Boarding Data

19th July 201115th April 2012Henrik Gabs Liliendahl3 Comments

This year I will be joining FIMA: Europe’s Premier Financial Reference Data Management Conference for Data Management Professionals. The conference is held in London from 8^th to 10^th November.

I will present “Diversities In Using External Registries In A Globalised World” and take part in the panel discussion “Overcoming Key Challenges In Managing Client On-Boarding Data: Opportunities & Efficiency Ideas”.

As said in the panel discussion introduction: The industry clearly needs to normalise (or is it normalize?) regional differences and establish global standards.

The concept of using external reference data in order to improve data quality within master data management has been a favorite topic of mine for long.

I’m not saying that external reference data is a single source of truth. Clearly external reference data may have data quality issues as exemplified in my previous blog post called Troubled Bridge Over Water.

However I think there is a clear trend in encompassing external sources, increasingly found in the cloud, to make a shortcut in keeping up with data quality. I call this Data Quality 3.0.

The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

Lately I have worked with a concept called instant Data Quality. The idea is to make simple yet powerful functionality that helps with hooking up with many external sources at the same time when on-boarding clients and making continuous maintenance possible.

One aspect of such a concept is how to exploit the different opportunities available in each country as public administrative practices and privacy norms varies a lot over the world.

I’m looking forward to present and discuss these challenges and getting a lot of feedback.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph