Duplicates – Page 5 – Liliendahl on Data Quality

How to Avoid True Positives in Data Matching

26th February 2013Henrik Gabs Liliendahl2 Comments

Now, this blog post title might sound silly, as we generally consider true positives to be the cream of data matching as it means that we have found a match between two data records that reflects the same real world entity and it has been confirmed, that this is true and based on that we can eliminate a harmful and costly duplicate in our records.

Why this isn’t still an optimal situation is that the duplicate shouldn’t have entered our data store in the first place. Avoiding duplicates up front is by far the best option.

So, how do you do that?

You may aim for low latency duplicate prevention by catching the duplicates in (near) real-time by having duplicate checks after records have been captured but before they are committed in whatever is the data store for the entities in question. But still, this is actually also about finding true positives and at the same time to be aware of false positives.

The best way is to aim for instant data quality. That is, instead of entering data for the (supposed) new records, you are able to pick the data from data stores already available presumably in the cloud through an error tolerant search that covers external data as well as data records already in the internal data store.

This is exactly such a solution I’m working with right now. And oh yes, it is exactly called instant Data Quality.

Beware of False Positives in Data Matching

23rd February 201324th February 2013Henrik Gabs Liliendahl3 Comments

In a recent blog post by Kristen Gregerson of Satori Software you may learn A Terrible Tale where the identity of two different real world individuals were merged into one golden record with the most horrible result you may imagine associated with a recent special day related to the results of the other kind of matching going around.

As reported by Jim Harris some years ago in the post The Very True Fear of False Positives the bad things happening from false positives in data matching is indeed a hindrance for doing data matching

If we do data matching we should be aware that false positives will happen and we should know the probability of that it happens and we should know how to avoid the resulting heartache.

Indeed using a data matching tool is better than relying on simple database indexes and indeed there are differences in how good various data matching tools are at doing the job, not at least doing it under different circumstances as told in the post What is a best-in-class match engine?

Curious about how data matching tools work (differently)? There is an eLearning course available co-authored by yours truly. The course is called Data Parsing, Matching and De-duplication.

Beyond True Positives in Deduplication

20th November 2012Henrik Gabs LiliendahlLeave a comment

The most frequent data quality improvement process done around is deduplication of party master data.

A core functionality of many data quality tools is the capability to find duplicates in large datasets with names, addresses and other party identification data.

When evaluating the result of such a process we usually divide the result of found duplicates into:

False positives being automated match results that actually do not reflect real world duplicates
True positives being automated match results reflecting the same real world entity

The difficulties in reaching the above result aside, you should think the rest is easy. Take the true positives, merge into a golden record and purge the unneeded duplicate records in your database.

Well, I have seen so many well executed deduplication jobs ending just there, because there are a lot of reasons for not making the golden records.

Sure, at lot of duplicates “are bad” and should be eliminated.

But many duplicates “are good” and have actually been put into the databases for a good reason supporting different kind of business processes where one view is needed in one case and another view is needed in another case.

Many, many operational applications, including very popular ERP and CRM systems, do have inferior data models that are not able to reflect the complexity of the real world.

Only a handful of MDM (Master Data Management) solutions are able to do so, but even then the solutions aren’t easy as most enterprises have an IT landscape with all kinds of applications with other business relevant functionality that isn’t replaced by a MDM solution.

What I like to do when working with getting business value from true positives is to build a so called Hierarchical Single Source of Truth.

Hierarchical Single Source of Truth

27th October 201227th October 2012Henrik Gabs LiliendahlLeave a comment

Most data quality and master data management gurus, experts and practitioners agree that achieving a “single source of truth” is a nice term, but is not what data quality and master data management is really about as expressed by Michele Goetz in the post Master Data Management Does Not Equal The Single Source Of Truth.

Even among those people, including me, who thinks emphasis on real world alignment could help getting better data and information quality opposite to focusing on fitness for multiple different purposes of use, there is acknowledgement around that there is a “digital distance” between real world aligned data and the real world as explained by Jim Harris in the post Plato’s Data. Also, different public available reference data sources that should reflect the real world for the same entity are often in disagreement.

When working with improvement of data quality in party master data, which is the most frequent and common master data domain with issues, you encounter the same issues over and over again, like:

Many organizations have a considerable overlap of real world entities who is a customer and a supplier at the same time. Expanding to other party roles this intersection is even bigger. This calls for a 360° Business Partner View.
Most organizations divide activities into business-to-business (B2B) and business-to-consumer (B2C). But the great majority of business’s are small companies where business and private is a mixed case as told in the post So, how about SOHO homes.
When doing B2C including membership administration in non-profit you often have a mix of single individuals and households in your core customer database as reported in the post Household Householding.
As examined in the post Happy Uniqueness there is a lot of good fit for purpose of use reasons why customer and other party master data entities are deliberately duplicated within different applications.
Lately doing social master data management (Social MDM) has emerged as the new leg in mastering data within multi-channel business. Embracing a wealth of digital identities will become yet a challenge in getting a single customer view and reaching for the impossible and not always desirable single source of truth.

A way of getting some kind of structure into this possible, and actually very common, mess is to strive for a hierarchical single source of truth where the concept of a golden record is implemented as a model with golden relations between real world aligned external reference data and internal fit for purpose of use master data.

Right now I’m having an exciting time doing just that as described in the post Doing MDM in the Cloud.

Data Quality along the Timeline

24th September 2012Henrik Gabs Liliendahl3 Comments

When working with data quality improvement it is crucial to be able to monitor how your various ways of getting better data quality is actually working. Are things improving? What measures are improving and how fast? Are there things going in the wrong direction?

Recently I had a demonstration by Kasper Sørensen, the founder of the open source data quality tool called DataCleaner. The new version 3.0 of the tool has comprehensive support of monitoring how data quality key performance indicators develop over time.

What you do is that you take classic data quality assessment features as data profiling measurements of completeness and duplication counting. The results from periodic executing of these features are then attached to a timeline. You can then visually asses what is improving, at what speed and eventually if anything is not developing so well.

Continuously monitoring how data quality key performance indicators are developing is especially interesting in relation to using concepts of getting data quality right the first time and follow up by ongoing data maintenance through enrichment from external sources.

In a traditional downstream data cleansing project you will typically measure completeness and uniqueness two times: Once before and once after the executing.

With upstream data quality prevention and automatic ongoing data maintenance you have to make sure everything is running well all the time. Having a timeline of data quality key performance indicators is a great feature for doing just that.

Probabilistic Learning in Data Matching

26th July 201222nd February 2013Henrik Gabs Liliendahl7 Comments

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection. This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Mashing Up Big Reference Data and Internal Master Data

4th July 20124th July 2012Henrik Gabs Liliendahl4 Comments

Right now I’m working on a cloud service called instant Data Quality (iDQ™).

It is basically a very advanced search engine capable of being integrated into business processes in order to get data quality right the first time and at the same time reducing the time needed for looking up and entering contact data.

With iDQ™ you are able to look up what is known about a given address, company and individual person in external sources (I call these big reference data) and what is already known in internal master data.

From a data quality point of view this mashup helps with solving some of the core data quality issues almost every organization has to deal with, being:

Avoiding duplicates
Getting data as complete as possible
Ensuring maximal accuracy

The mashup is also a very good foundation for taking real-time decisions about master data survivorship.

The iDQ™ service helps with getting data quality right the first time. However, you also need Ongoing Data Maintenance in order to keep data at a high quality. Therefore iDQ™ is build for trigging into subscription services for external reference data.

At iDQ we are looking for partners world-wide who see the benefit of having such a cloud based master data service connected to providing business-to-business (B2B) and/or business-to-consumer (B2C) data services, data quality services and master data management solutions.

Here’s the contact data: http://instantdq.com/contact/

Avoiding Contact Data Entry Flaws

19th May 201229th May 2012Henrik Gabs Liliendahl9 Comments

Contact data is the data domain most often mentioned when talking about data quality. Names and addresses and other identification data are constantly spelled wrong, or just different, by the employees responsible of entering party master data.

Cleansing data long time after it has been captured is a common way of dealing with this huge problem. However, preventing typos, wrong hearings and multi-cultural misunderstandings at data entry is a much better option wherever applicable.

I have worked with two different approaches to ensure the best data quality for contact data entered by employees. These approaches are:

Correction and
Assistance

Correction

With correction the data entry clerk, sales representative, customer service professional or whoever is entering the data will enter the name, address and other data into a form.

After submitting the form, or in some cases leaving each field on the form, the application will check the content against business rules and available reference data and return a warning or error message and perhaps a correction to the entered data.

As duplicated data is a very common data quality issue in contact data, a frequent example of such a prompt is a warning about that a similar contact record already exists in the system.

Assistance

With assistance we try to minimize the needed number of key strokes and interactively help with searching in available reference data.

For example when entering address data assistance based data entry will start with the highest geographical level:

If we are dealing with international data the country will set the context and know about if a state or province is needed.
Where postal codes (like ZIP) exists, this is the fast path to the city.
In some countries the postal code only covers one street (thoroughfare), so that’s settled by the postal code. In other situations we will usually have a limited number of streets that can be picked from a list or settled with the first characters.

(I guess many people know this approach from navigation devices for cars.)

When the valid address is known you may catch companies from business directories being on that address and, depending on the country in question, you may know citizens living there from phone directories and other sources and of course the internal party master data, thus avoiding entering what is already known about names and other data.

When catching business entities a search for a name in a business directory often leads to being able to pick a range of identification data and other valuable data and not at least a reference key to future data updates.

Lately I have worked intensively with an assistance based cloud service for business processes embracing contact data entry. We have some great testimonials about the advantages of such an approach here: instant Data Quality Testimonials.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph