How to Avoid True Positives in Data Matching

Now, this blog post title might sound silly, as we generally consider true positives to be the cream of data matching as it means that we have found a match between two data records that reflects the same real world entity and it has been confirmed, that this is true and based on that we can eliminate a harmful and costly duplicate in our records.

Why this isn’t still an optimal situation is that the duplicate shouldn’t have entered our data store in the first place. Avoiding duplicates up front is by far the best option.

So, how do you do that?

You may aim for low latency duplicate prevention by catching the duplicates in (near) real-time by having duplicate checks after records have been captured but before they are committed in whatever is the data store for the entities in question. But still, this is actually also about finding true positives and at the same time to be aware of false positives.

Killing Keystrokes
Killing Keystrokes

The best way is to aim for instant data quality. That is, instead of entering data for the (supposed) new records, you are able to pick the data from data stores already available presumably in the cloud through an error tolerant search that covers external data as well as data records already in the internal data store.

This is exactly such a solution I’m working with right now. And oh yes, it is exactly called instant Data Quality.

Bookmark and Share

Tomorrow’s Data Quality Tool

In a blog post called JUDGEMENT DAY FOR DATA QUALITY published yesterday Forrester analyst Michele Goetz writes about the future of data quality tools.

Michele says:

“Data quality tools need to expand and support data management beyond the data warehouse, ETL, and point of capture cleansing.”

and continues:

“The real test will be how data quality tools can do what they do best regardless of the data management landscape.”

As described in the post Data Quality Tools Revealed there are two things data quality tools do better than other tools:

  • Data profiling and
  • Data matching

Some of these new challenges I have worked with within designing tomorrow’s data quality tools are:

  • open-doorPoint of capture profiling
  • Searching using data matching techniques
  • Embracing social networks

Point of capture profiling:

The sweet thing about profiling your data while you are entering your data is that analysis and cleansing becomes part of the on-boarding business process. The emphasis moves from correction to assistance as explained in the post Avoiding Contact Data Entry Flaws. Exploiting big external reference data sources within point of capture is a core element in getting it right before judgment day.

Searching using data matching techniques:

Error tolerant searching is often the forgotten capability when core features of Master Data Management solutions and data quality tools are outlined. Applying error tolerant search to big reference data sources is, as examined in the post The Big Search Opportunity, a necessity to getting it right before judgment day.

Embracing social networks:

The growth of social networks during the recent years has been almost unbelievable. Traditionally data matching has been about comparing names and addresses. As told in the post Addressing Digital Identity it will be a must to be able to link the new systems of engagement with the old systems of record in order to getting it right before judgment day.

How have you prepared for judgment day?

Bookmark and Share

Addressing Digital Identity

A physical address has traditionally been a core element of doing identity resolution. Stating a name and an address is the most widespread way of telling with which person or which company we are (aiming at) having a business and other form of relationship.

However, during the last 25 years a lot of things have moved from the physical world to the online world. Not at least a lot of things start in the online world while in many cases ends up in the physical world. Today selling, the smart way, starts in social media. Final delivery may be digital or may be sending a package or a consultant to a physical address. A thing like dating most often starts in the online world today but surely the aim is a physical encounter.

This new way of life has a tremendous affect on data quality and master data management. Within quality of contact data, the most frequent domain for data quality issues, we have traditionally dealt with verifying names and addresses and deduplicating names and addresses.

As the best way of preventing data quality issues is looking at the root we must address that onboarding of contact data often starts with a digital identity where a physical address isn’t present in the first place but often will be updated at a later stage.

As described in the post Social MDM and Systems of Engagement a new trend in master data management is to establish a link between the new systems of engagement and the old systems of record.

In the same way data quality prevention and improvement will have to cover establishing a link between a new discipline being digital identity resolution and the good old address verification stuff.

Bookmark and Share

Data Quality along the Timeline

When working with data quality improvement it is crucial to be able to monitor how your various ways of getting better data quality is actually working. Are things improving? What measures are improving and how fast? Are there things going in the wrong direction?

Recently I had a demonstration by Kasper Sørensen, the founder of the open source data quality tool called DataCleaner. The new version 3.0 of the tool has comprehensive support of monitoring how data quality key performance indicators develop over time.

What you do is that you take classic data quality assessment features as data profiling measurements of completeness and duplication counting. The results from periodic executing of these features are then attached to a timeline. You can then visually asses what is improving, at what speed and eventually if anything is not developing so well.

Continuously monitoring how data quality key performance indicators are developing is especially interesting in relation to using concepts of getting data quality right the first time and follow up by ongoing data maintenance through enrichment from external sources.

In a traditional downstream data cleansing project you will typically measure completeness and uniqueness two times: Once before and once after the executing.

With upstream data quality prevention and automatic ongoing data maintenance you have to make sure everything is running well all the time. Having a timeline of data quality key performance indicators is a great feature for doing just that.

Bookmark and Share

instant Data Quality at Work

DONG Energy is one of the leading energy groups in Northern Europe with approximately 6,400 employees and EUR 7.6 billion in revenue in 2011.

The other day I sat down with Ole Andres, project manager at DONG Energy, and talked about how they have utilized a new tool called iDQ™ (instant Data Quality) in order to keep up with data quality around customer master data.

iDQ™ is basically a very advanced search engine capable of being integrated into business processes in order to get data quality for contact data right the first time and at the same time reduce the time needed for looking up and entering contact data.

Fit for multiple business processes

Customer master data is used within many different business processes. Dong Energy has successfully implemented iDQ™ within several business processes, namely:

  • Assigning new customers and ending old customers on installation addresses
  • Handling returned mail
  • Debt collection

Managing customer master data in the utility sector has many challenges as there are different kinds of addresses to manage such as installation addresses, billing addresses and correspondence addresses as well as different approaches to private customers and business customers including considering the grey zone between who is a private account and who is a business account.

New technology requires change management

Implementing new technology into a large organization doesn’t just go by itself. Old routines tend to stick around for a while. DONG Energy has put a lot of energy, so to say, into training the staff in reengineering business processes around customer master data on-boarding and maintenance including utilizing the capabilities of the iDQ™ tool.

Acceptance of new tools comes with building up trust in the benefits of doing things in a new way.

Benefits in upstream data quality 

A tool like iDQ™ helps a lot with safeguarding the quality of contact data where data is born and when something happens in the customer data lifecycle. A side effect, which is at least as important stresses Ole Andres, is that data collection is going much faster.

Right now DONG Energy is looking into further utilizing the rich variety of reference data sources that can be found in the iDQ™ framework.

Bookmark and Share

We Need Better Search

Often we have all the information we need. What we don’t have is the right means to search in and make sense of all the information.

It’s now been a little more than a year since the terrible terrorist attacks in Norway carried out by a right-wing extremist.

Since then an investigation have been done in order to find out if the tragic incident could have been avoided. A report is due for tomorrow, but bits and pieces are already flowing in the press now.

Today the Norwegian newspaper Aftenposten has an article telling about the inadequate searching features available to the Norwegian Police Intelligence. Article in Norwegian here.

As I understand it the Police Intelligence did have a few registrations about suspicious activities by the terrorist. Probably not enough to act upon before the tragedy. But even if they had more information they wouldn’t have been able to match it with the technology available and prevent the attacks.

It’s a shame.

Bookmark and Share

Instant Data Enrichment

Data enrichment is one of the core activities within data quality improvement. Data enrichment is about updating your data in order to be more real world aligned by correcting and completing with data from external reference data sources.

Traditionally data enrichment has been a follow up activity to data matching and doing data matching as a prerequisite for data enrichment has been a good part of my data quality endeavor during the recent 15 years as reported in the post The GlobalMatchBox.

During the last couple of years I have tried to be part of the quest for doing something about poor data quality by moving the activities upstream. Upstream data quality prevention is better than downstream data cleansing wherever applicable. Doing the data enrichment at data capture is the fast track to improve data quality for example by avoiding contact data entry flaws.

It’s not that you have to enrich with all the possible data available from external sources at once. What is the most important thing is that you are able to link back to external sources without having to do (too much) fuzzy data matching later. Some examples:

  • Getting a standardized address at contact data entry makes it possible for you to easily link to sources with geo codes, property information and other location data at a later point.
  • Obtaining a company registration number or other legal entity identifier (LEI) at data entry makes it possible to enrich with a wealth of available data held in public and commercial sources.
  • Having a person’s name spelled according to available sources for the country in question helps a lot when you later have to match with other sources.

In that way your data will be fit for current and future multiple purposes.

Bookmark and Share

Avoiding Contact Data Entry Flaws

Contact data is the data domain most often mentioned when talking about data quality. Names and addresses and other identification data are constantly spelled wrong, or just different, by the employees responsible of entering party master data.

Cleansing data long time after it has been captured is a common way of dealing with this huge problem. However, preventing typos, wrong hearings and multi-cultural misunderstandings at data entry is a much better option wherever applicable.

I have worked with two different approaches to ensure the best data quality for contact data entered by employees. These approaches are:

  • Correction and
  • Assistance


With correction the data entry clerk, sales representative, customer service professional or whoever is entering the data will enter the name, address and other data into a form.

After submitting the form, or in some cases leaving each field on the form, the application will check the content against business rules and available reference data and return a warning or error message and perhaps a correction to the entered data.

As duplicated data is a very common data quality issue in contact data, a frequent example of such a prompt is a warning about that a similar contact record already exists in the system.


With assistance we try to minimize the needed number of key strokes and interactively help with searching in available reference data.

For example when entering address data assistance based data entry will start with the highest geographical level:

  • If we are dealing with international data the country will set the context and know about if a state or province is needed.
  • Where postal codes (like ZIP) exists, this is the fast path to the city.
  • In some countries the postal code only covers one street (thoroughfare), so that’s settled by the postal code. In other situations we will usually have a limited number of streets that can be picked from a list or settled with the first characters.

(I guess many people know this approach from navigation devices for cars.)

When the valid address is known you may catch companies from business directories being on that address and, depending on the country in question, you may know citizens living there from phone directories and other sources and of course the internal party master data, thus avoiding entering what is already known about names and other data.

When catching business entities a search for a name in a business directory often leads to being able to pick a range of identification data and other valuable data and not at least a reference key to future data updates.

Lately I have worked intensively with an assistance based cloud service for business processes embracing contact data entry. We have some great testimonials about the advantages of such an approach here: instant Data Quality Testimonials.

Bookmark and Share

How to Avoid Losing 5 Billion Euros

Two years ago I made a blog post about how 5 billion Euros were lost due to bad identity resolution at European authorities. The post was called Big Time ROI in Identity Resolution.

In the carbon trade scam criminals were able to trick authorities with fraudulent names and addresses.

One way of possible discovery of the fraudster’s pattern of interrelated names and physical and digital locations was, as explained in the post, to have used an “off the shelf” data matching tool in order to achieve what is sometimes called non-obvious relationship awareness. When examining the data I used the Omikron Data Quality Center.

Another and more proactive way would have been upstream prevention by screening identity at data capture.

Identity checking may be a lot of work you don’t want to include in business processes with high volume of master data capture, and not at least screening the identity of companies and individuals on foreign addresses seems a daunting task.

One way to help with overcoming the time used on identity screening covering many countries is using a service that embraces many data sources from many countries at the same time. A core technology in doing so is cloud service brokerage. Here your IT department only has to deal with one interface opposite to having to find, test and maintain hundreds of different cloud services for getting the right data available in business processes.

Right now I’m working with such a solution called instant Data Quality (iDQ).

Really hope there’s more organisations and organizations out there wanting to avoid losing 5 billion Euros, Pounds, Dollars, Rupees, Whatever or even a little bit less.

Bookmark and Share

Data Quality at Terminal Velocity

Recently the investment bank Saxo Bank made a marketing gimmick with a video showing a BASE jumper trading foreign currency with the banks mobile app at terminal velocity (e.g. the maximum speed when free falling).

Today business decisions have to be taken faster and faster in the quest for staying ahead of competition.

When making business decisions you rely on data quality.

Traditionally data quality improvement has been made by downstream cleansing, meaning that data has been corrected long time after data capture. There may be some good reasons for that as explained in the post Top 5 Reasons for Downstream Cleansing.

But most data quality practitioners will say that data quality prevention upstream, at data capture, is better.

I agree; it is better.  Also, it is faster. And it supports faster decision making.

The most prominent domain for data quality improvement has always been data quality related to customer and other party master data. Also in this quest we need instant data quality as explained in the post Reference Data at Work in the Cloud.

Bookmark and Share