Technology – Page 24 – Liliendahl on Data Quality

Probabilistic Learning in Data Matching

26th July 201222nd February 2013Henrik Gabs Liliendahl7 Comments

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Return on Investment in Big Reference Data

19th July 201226th July 2012Henrik Gabs LiliendahlLeave a comment

Currently I’m working with a cloud based service where we are exploiting available data about addresses, business entities and consumers/citizens from all over the world.

The cost of such data varies a lot around the world.

In Denmark, where the product is born, the costs of such data are relatively low. The joys of the welfare state also apply to access to open public sector data as reported in the post The Value of Free Address Data. Also you are able to check the identity of an individual in the citizen hub. Doing it online on a green screen you will be charged (what resembles) 50 cent, but doing it with cloud service brokerage, like in iDQ™, it will only cost you 5 cent.

In the United Kingdom the prices for public sector data about addresses, business entities and citizens are still relatively high. The Royal Mail has a license tag on the PAF file even for government bodies. Ordnance Survey is given the rest of AddressBase free for the public sector, but there is a big tag for the rest of the society. The electoral roll has a price tag too even if the data quality isn’t considered for other uses than the intended immediate purpose of use as told in the post Inaccurately Accurate.

At the moment I’m looking into similar services for the United States and a lot of other countries. Generally speaking you can get your hands on most data for a price, and the prices have come down since I checked the last time. Also there is a tendency of lowering or abandoning the price for the most basic data as names and addresses and other identification data.

As poor data quality in contact data is a big cost for most enterprises around the world, the news of decreasing prices for big reference data is good news.

However, if you are doing business internationally it is a daunting task to keep up with where to find the best and most cost effective big reference data sources for contact data and not at least how to use the sources in business processes.

Wednesday the 25^th July I’m giving a presentation, in the cloud, on how iDQ™ comes to the rescue. More information on DataQualityPro.

Beyond Address Validation

11th July 201223rd July 2012Henrik Gabs LiliendahlLeave a comment

The quality of contact master data is the number one data quality issue around.

Lately there has been a lot of momentum among data quality tool providers in offering services for getting at least the postal address in contact data right. The new services are improved by:

Being cloud based offering validation services that are implemented at data entry and based on fresh reference data.
Being international and thus providing address validation for customer and other party data embracing a globalized world.

Capturing an address that is aligned with the real world may have a significant effect on business outcomes as reported by the tool vendor WorldAddresses in a recent blog post.

However, a valid address based on address reference data only tells you if the address is valid, not if the addressee is (still) on the address, and you are not sure if the name and other master data elements are accurate and complete. Therefore you often need to combine address reference data with other big reference data sources as business directories and consumer/citizen reference sources.

Using business directories is not new at all. Big reference sources as the D&B WorldBase and many other directories have been around for many years and been a core element in many data quality initiatives with customer data in business-to-business (B2B) environments and with supplier master data.

Combining address reference data and business entity reference data makes things even better, also because business directories doesn’t always come with a valid address.

Using public available reference data when registering private consumers, employees and other citizen roles has until now been practiced in some industries and for special reasons. Therefore the big reference data and the services are out there and being used today in some business processes.

Mashing up address reference data, business entity reference data and consumer/citizen reference data is a big opportunity for many organizations in the quest for high quality contact master data, as most organizations actually interact with both companies and private persons if we look at the total mix of business processes.

The next big source is going to be exploiting social network profiles as well. As told in the post Social Master Data Management social media will be an additional source of knowledge about our business partners. Again, you won’t find the full truth here either. You have to mashup all the sources.

Olympic Moments

9th July 20129th July 2012Henrik Gabs Liliendahl2 Comments

The London 2012 Olympic Games is approaching. You feel that very well in London. For example my usual walking path thru Hyde Park is closed because of the upcoming sport event.

I’m sure these games are going to produce some great moments. Some of the moments I’m remembering from previous games have a touch of data quality technology learning attached.

The Fosbury Flop

In 1968 the American athlete Dick Fosbury introduced a better way of doing the high jump. What I find interesting about the Fosbury Flop is that this technique hasn’t always been possible. In the old days the jumpers landed in a sandpit. If you did the flop then, it would certainly be a flop most probably getting you injured after the first attempt. But after deep foam matting was put in place, the flop has been a good choice.

It’s the same with data quality technology. Some techniques for improvement you have found to be a flop previously may because of new circumstances be a good choice today. The high esteemed scissors jump didn’t prevail forever.

Eddie the Eagle

In 1988, at the winter event, a Brit made a lot of headlines by being totally bad at ski jumping. Eddie the Eagle finished not surprisingly far behind natural born Finnish, Norwegian and Czech ski jumpers coming from a country where the first sign of the white fluffy stuff from above isn’t considered a severe weather condition. But Eddie set a new British record.

It’s the same with data quality technology. Some tools and services are leading in some countries, but have a hard time when challenged internationally.

Sailing under Wrong Flag

In the 2008 games something spectacular happened in the sailing competitions. The Danish 49er boat was in first place but broke the mast when leaving the harbor for the last race. The Croatian team offered their boat. The Danes sailed into the race long after the other boats have started, but managed to get a result just good enough to secure the gold. The other teams might have been confused by the wrong flag.

As told in the post Most Times the Home Team Wins flags are important – in sports, in data quality and other data management disciplines too.

2012

What do you guess will make a difference in this year’s Olympic Games? – And in Data Quality improvement?

Sharing Bigger Data

6th July 2012Henrik Gabs Liliendahl1 Comment

Yesterday I attended an event called Big Data Forum 2012 held in London.

Big data seems to be yet a buzzing term with many definitions. Anyway, surely it is about datasets that are bigger (and more complex) than before.

The Olympics is Going to be Bigger

One session on the big data forum was about how BBC will use big data in covering the upcoming London Olympics on the BBC website.

James Howard who I know as speckled_jim on Twitter told that the bulk of the content on the BBC Sports website is not produced by BBC. The data is sourced from external data providers and actually also the structure of the content is based on the external sources.

So for the Olympics there will be rich content about all the 10,000 athletes coming from all over the world. The BBC editorial stuff will be linked to this content of course emphasizing on the British athletes.

I guess that other broadcasting bodies and sports websites from all over the world will base the bulk of the content from the same sources and then more or less link targeted own produced content in the same way and with their look and feel.

There are some data quality issues related to sourcing such data Jim told. For example you may have your own guideline for how to spell names in other script systems.

I have noticed exactly that issue in the news from major broadcasters. For example BBC spells the new Egyptian president Mursi while CNN says his name is Morsi.

Bigger Data in Party Master Data Management

The postal validation firm Postcode Anywhere recently had a blog post called Big Data – What’s the Big Deal?

The post has the well known sentiment that you may use your resources better by addressing data quality in “small data” rather than fighting with big data and that getting valid addresses in your party master data is a very good place to start.

I can’t agree more about getting valid addresses.

However I also see some opportunities in sharing bigger datasets for valid addresses. For example:

The reference dataset for UK addresses typically based on the Royal Mail Postal Address File (PAF) is not that big. But the reference dataset for addresses from all over the world is bigger and more complex. And along with increasing globalization we need valid addresses from all over the world.
Rich address reference data will be more and more available. The UK PAF file is not that big. The AddressBase from Ordnance Survey in the UK is bigger and more complex. So are similar location reference data with more information than basic postal attributes from all over world not at least when addressed together.
A valid address based on address reference data only tells you if the address is valid, not if the addressee is (still) on the address. Therefore you often need to combine address reference data with business directories and consumer/citizen reference sources. That means bigger and more complex data as well.

Similar to how BBC is covering the Olympics my guess is that organizations will increasingly share bigger public address, business entity and consumer/citizen reference data and link private master data that you find more accurate (like the spelling example) along with essential data elements that better supports your way of doing business and makes you more competitive.

My recent post Mashing Up Big Reference Data and Internal Master Data describes a solution for linking bigger data within business processes in order to get a valid address and beyond.

Big Data and Multi-Domain Master Data Management

27th June 201223rd March 2013Henrik Gabs LiliendahlLeave a comment

The possible connection between the hot buzz within IT today being “big data” and the good old topic of master data management has been discussed a lot lately. An example from CIO UK today is this article called Big data without master data management is a problem.

As said in the article there is a connection through big master data (and big reference data) to big transaction data. Big transaction data is what we usually would call big data, because these are the really big ones.

The two most mentioned kind of big transaction data are:

Social data and
Sensor data

I also have seen a lot of connections between these big data and master data in multiple domains.

Social Data

Connecting social data to Master Data Management (MDM) is an ongoing discussion I have been involved in for the last three years lately through the new LinkedIn group called Social MDM.

The customer master data domain is in focus here, as the immediate connection here is how to relate traditional systems of record holding customer master data and the systems of engagement where the big social data are waiting to be analyzed and eventually be a part of day-to-day customer centric business processes.

However being able to analyze, monitor and take action on what is being said about specific products in social data is another option and eventually that has to be linked to product master data. In product master data management the focus has traditionally been on your own (resell) products. Effectively listening to social data will mean that you also have to manage data about competing products.

Attaching location to social data has been around for long. Connecting social data to your master data will also require that your location master data are well aligned with the real world.

Sensor Data

During the past many years I have been involved in data management within public transportation where we have big data coming in from sensors of different kind.

The big problem has for sure being able to connect these transactions correctly to master data. The challenges here are described in the post Multi-Entity Master Data Quality.

The biggest problem is that all the different equipment generating the sensor data in practice can’t be at the same stage at the same time and this will eventually create data that if related without care will show very wrong information about who was the passenger(s), what kind of trip it were, where the journey happened and under which timetable.

State of this Data Quality Blog

21st June 2012Henrik Gabs Liliendahl6 Comments

Today is a big day on this blog as it has been live for 3 years.

Success versus Failure

The first entry called Qualities of Data Architecture was a promise to talk about data quality success stories. The reason for emphasizing on success stories related to data quality is a feeling that data quality improvement is too often promoted by horror stories telling about how bad your business may go if you don’t pay attention to data quality.

The problem is that stories about failure usually aren’t taken too seriously. Jim Harris recently had a very good take on that in the post Data Quality and Chicken Little Syndrome.

So, I plan to tell even more success stories along with the inevitable stories about failure that so easily and obviously could have been avoided.

Getting Social

Using social networks to promote your blogging is quite natural.

At the same time social networks has emerged as new source in doing master data management (I call this Social MDM).

Exploring this new discipline over the hype peak, down through the valley of disappointment and up to the plateau of productivity will for sure be a recurring subject on this blog.

People, Processes and Technology

Sometimes you see a statement like “Data Quality is not about technology, it’s all about people”.

Well, most things we can’t solve easily are not just about one thing. In my eyes the old cliché about addressing people, processes and technology surely also relates to getting data quality right.

There are many good blogs around about people and processes. On this blog I’ll try to tell about my comfort zone being technology without forgetting people and processes.

The Hidden Agenda

Most people blogging are doing this to promote our (employers) expertise, services and tools and I am not different.

Lately I have written a lot about a second to none cloud based service for upstream data quality prevention. The wonder is called instant Data Quality.

While upstream prevention is the best approach to data quality still a lot of work must be done every day in downstream cleansing as told in the post Top 5 Reasons for Downstream Cleansing.

As I’m also working with a new stellar cloud based platform for data quality improvement productivity I will for sure share some props for that in the near future.

Data Driven Data Quality

19th June 201220th June 2012Henrik Gabs LiliendahlLeave a comment

In a recent article Loraine Lawson examines how a vast majority of executives describes their business as “data driven” and how the changing world of data must change our approach to data quality.

As said in the article the world has changed since many data quality tools were created. One aspect is that “there’s a growing business hunger for external, third-party data, which can be used to improve data quality”.

Embedding third-party data into data quality improvement especially in the party master data domain has been a big part of my data quality work for many years.

Some of the interesting new scenarios are:

Ongoing Data Maintenance from Many Sources

As explained in the article on Wikipedia about data quality services as the US National Change of Address (NCOA) service and similar services around the world has been around for many years as a basic use of external data for data quality improvement.

Using updates from business directories like the Dun & Bradstreet WorldBase and other national or industry specific directories is another example.

In the post Business Contact Reference Data I have a prediction saying that professional social networks may be a new source of ongoing data maintenance in the business-to-business (B2B) realm.

Using social data in business-to-consumer (B2C) activities is another option though also haunted with complex privacy considerations.

Near-Real-Time Data Enrichment

Besides updating changes of basic master data from business directories these directories typically also contains a lot of other data of value for business processes and analytics.

Address directories may also hold further information like demographic stereotype profiles, geo codes and property data elements.

Appending phone numbers from phone books and checking national suppression lists for mailing and phoning preferences are other forms of data enrichment used a lot related to direct marketing.

Traditionally these services have been implemented by sending database extracts to a service provider and receiving enriched files for uploading back from the service provider.

Lately I have worked with a new breed of self service data enrichment tools placed in the cloud making it possible for end users to easily configure what to enrich from a palette of address, business entity and consumer/citizen related third-party data and executing the request as close to real-time as the volume makes it possible.

Such services also include the good old duplicate check now much better informed by including third-party reference data.

Instant Data Quality in Data Entry

As discussed in the post Avoiding Contact Data Entry Flaws third-party reference data as address directories, business directories and consumer/citizen directories placed in the cloud may be used very efficiently in data entry functionality in order to get data quality right the first time and at the same time reduce the time spend in data entry work.

Not at least in a globalized world where names of people reflect the diversity of almost any nation today, where business names becomes more and more creative and data entry is done at shared service centers manned with people from cultures with other address formatting rules, there is an increased need for data entry assistance based on external reference data.

When mashing up advanced search in third-party data and internal master when doing data entry you will solve most of the common data quality issues around avoiding duplicates and getting data as complete and timely as needed from day one.

Pulling Data Quality from the Cloud

7th June 201215th January 2013Henrik Gabs Liliendahl1 Comment

In a recent post here on the blog the benefits of instant data enrichment was discussed.

In the contact data capture context these are some examples:

Getting a standardized address at contact data entry makes it possible for you to easily link to sources with geo codes, property information and other location data.
Obtaining a company registration number or other legal entity identifier (LEI) at data entry makes it possible to enrich with a wealth of available data held in public and commercial sources.
Having a person’s name spelled according to available sources for the country in question helps a lot with typical data quality issues as uniqueness and consistency.

However, if you are doing business in many countries it is a daunting task to connect with the best of breed sources of big reference data. Add to that, that many enterprises are doing both business-to-business (B2B) and business-to-consumer (B2C) activities including interacting with small business owners. This means you have to link to the best sources available for addresses, companies and individuals.

A solution to this challenge is using Cloud Service Brokerage (CSB).

An example of a Cloud Service Brokerage suite for contact data quality is the instant Data Quality (iDQ™) service I’m working with right now.

This service can connect to big reference data cloud services from all over the world. Some services are open data services in the contact data realm, some are international commercial directories, some are the wealth of national reference data services for addresses, companies and individuals and even social network profiles are on the radar.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph