Finding duplicate customers may be very different tasks depending on from which country you are and from which country the data origins.
Besides all the various character sets, naming traditions and address formats also the alternative possibilities with external reference data makes something easy – and then something very hard.
Most technology, descriptions and presented examples around are from the United States.
But say you are a Swedish company having Swedish persons in your database and among those these 2 rows (name, address, postal code and city):
- Oluf Palme, Sveagatan 67, 10001 Stockholm
- Oluf Palme, Savegatan 76, 10001 Stockholm
What you do is that you plug into the government provided citizen master data hub and ask for a match. The outcome can be:
- The same citizen ID is returned because the person has relocated. It’s a duplicate.
- Two different citizen ID’s is returned. It’s not a duplicate.
- Either only one or no citizen ID is returned. Leave it or do fuzzy matching.
If you go for fuzzy matching then you better be good, because all the easy ones are handled and you are left with the ones where false positives and false negatives are most likely. Often you will only do fuzzy matching if you have phone numbers, email addresses or other data to support the match.
Another angle is that it is almost only Swedish companies who use this service with the government provided reference data – but everyone having Swedish data may use it upon an approval.
Data quality solutions with party master data is not only about fuzzy matching but also about integrating with external reference data exploiting all the various world wide possibilities and supporting the logic and logistics in doing that. Also we know that upstream prevention as close to the root as possible is better than downstream cleansing.
Deployment of such features as composable SOA components is described in a previous post here.