When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.
We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.
Lets look at the following example. We have 2 names and addresses:
Banca di Toscana Società per azioni
IT 51234 Firenze
Vanca di Toscana SpA
12, Via Niccolò Machiavelli
A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:
- The ISO country code for Italy is IT
- Florence is the English name for the city called Firenze in Italian
- In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
- In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
- “Società per azioni” with the acronym SpA or S.p.A is an Italian legal form
But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.
If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.
I have worked with the following techniques for overcoming this challenge:
Parsing and standardization
The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:
IT, 51234, Via Niccolo Machiavelli, 12
Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.
This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.
Here you make use of the same fuzzy techniques used in similarity assignment when searching.
If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.
As always in data quality automation, using all different techniques in a given implementation makes your margins better.