The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.
These language variants of city (and other) names is a challenge in data matching.
If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.
Computerized match processing may handle the challenge in different ways, including:
- The data model may reflect the real world by having places described by multiple names in given languages.
- Some data matching solutions use synonym listing for this challenge.
- Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.
As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.