Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.
A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:
- Downstream data matching
- Real time duplicate check
- Search and mash-up of internal and external data
Downstream Data Matching
The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.
Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.
Real Time Duplicate Check
The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.
Search and Mash-Up of Internal and External Data
The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.
The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.
The advantages are:
- If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
- If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.