Fuzzy matching techniques were originally developed for batch processing in order to find duplicates and consolidate database rows with no unique identifiers with the real world.
These processes have traditionally been implemented for downstream data cleansing.
As we know that upstream prevention is much more effective than tidy up downstream, real time data entry checking is becoming more common.
But we are able to go further upstream by introducing error tolerant search capabilities.
A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for either introducing duplicates or challenge the real time checking.
An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.
Implementation of such features may be as embedded functionality in CRM and ERP systems or as my favourite term: SOA components. So besides classic data quality elements for monitoring and checking we can add error tolerant search to the component catalogue needed for a good MDM solution.
Checking for duplicates upstream is good but if there are multiple source systems (each having its own data capturing system), de-duplication needs to be done at the time of data consolidation/integration.
Thanks for commenting Tirthankar. That is true – unless you are able to make a mash-up of the source systems. I’m working with such a solution right now, where we also include several external registries in the upfront search.
I am interested to know the solution architecture in brief after your assignment is over.
Henrik, Really good post. The solution you mention is interesting to me too. My company has a contact data matching product that connects to multiple data sources at data capture – currently limited to Windows web services, SQL Server and Oracle databases, so I would be interested to know the parameters of the solution you’re working with.
Hi Steve, thanks for commenting. I have written a post about it called Reference Data at Work in the Cloud