The Good, Better and Best Way of Avoiding Duplicates

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

  • Downstream data matching
  • Real time duplicate check
  • Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

iDQ mashup
instant Data Quality

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

  • If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
  • If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

Bookmark and Share

One thought on “The Good, Better and Best Way of Avoiding Duplicates

  1. mckennamike 12th September 2014 / 16:24

    Hi Henrick

    Check out Nimble CRM (www.nimble.com) which takes exactly this approach to data entry. Enter basic information about a contact (name and email address) and Nimble will search the leading social networks (Facebook, Twitter, LinkedIn, Google+, Foursquare, Instagram and AngelList) to find the contacts profiles there and pull them automatically into the CRM database, building a relevant profile of the contact from the available social information and minimising data entry.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s