Diversity in City Names

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

  • The data model may reflect the real world by having places described by multiple names in given languages.
  • Some data matching solutions use synonym listing for this challenge.
  • Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

Bookmark and Share

8 thoughts on “Diversity in City Names

  1. Jim Harris 17th January 2010 / 17:44

    As a data quality “expert” based in the United States, I have to admit that I have spent most of my career working with local data, especially for postal addresses, which despite a few European language variations, is mostly in American English.

    Copenhagen, Copenhague, Copenaghen, and maybe Kopenhagen – I might pick up on, but København, Köpenhamn, and definitely Hafnia (is that near Narnia?) – I would not.

    Of the computerized matching options you listed, I have most often had to rely on synonym listings, where during pre-matching preparation, “standardization” created a “Match City Name” field, which either contained the original value (if not found on the synonym list) or the converted value from the synonym list – which for me, was usually in English, and therefore in your example: Copenhagen.

    Of course, my challenge was finding a person who knew the language variations and who could therefore properly populate the synonym list.

    Most often, I had to rely on resources such as Graham Rhind’s Global Source-Book for Address Data Management:
    http://www.grcdi.nl/book2.htm

    Excellent post Henrik – Best Regards, Jim

  2. Satesh 17th January 2010 / 18:36

    Great Post Herik!!!

    On the outset, DQ challenges may look simple but the business implications are multi fold!!! In ur Copenhagen case multiple definitions of the same place could result in misleading reports (customer/sales)

    As u have rightly stated about the computerized data matching algorithms, one must agree that even the best breed of AI software would require human intelligence and synonym data base.

    To summarize it needs human intelligence (SME)to educate the software for more accurate pattern matching

    Satesh

  3. John O'Gorman 17th January 2010 / 19:02

    Like many other of Henrik’s excellent posts on the topic, this one has much broader implications for DQ in a general sense. The place-name equivalent challenge can be extended to the person name equivalent, the asset or thing name equivalent, the process name equivalent and so on. The flip side of the issue – with respect to de-duplication efforts – is also a problem: the same string (say, ‘Portland’)refers to different entities in the real world.

    These two issues alone are why a different ‘geometry’ if you will is required to make information management really sing. The two-dimensional world of bits and bytes sees Copenhagen, København, Köpenhamn, Kopenhagen, Copenhague, Copenaghen and Hafnia as distinct and Portland, Portland and Portland as identical.

    This next statement may seem obvious or heretical (or both) to data management professionals, but the digital representations of place names are not the actual places themeselves. The new geometry requires the addition of this ‘real’ dimension to the otherwise flatscape of data to help unify or disambiguate the labels.

    Think in terms of cartography: a map has single point for all seven digital references to the city in Denmark, and twenty-seven different points that reference the single ‘Portland’.

    All other types of equivalences (and homographs) work exactly the same way.

  4. Henrik Liliendahl Sørensen 17th January 2010 / 20:50

    Thanks a lot for comments and kind words Jim, Satesh and John.

    Jim, I agree, Grahams source-book is a great resource and we might as well advertise for Grahams IAIDQ webinar on Global Customer Data on the 27th January 2010.

    Satesh, it’s so true, even artificial intelligence is in fact an interactive and ongoing development involving humans and computers.

    John, I really like your comment and introduction to the flip side of the issue – which also tells us that using synonyms is not that easy.

  5. Per Olsson 17th January 2010 / 21:28

    I think the human mind needs to be involved to figure out the needed variants in a city’s name. The problem is to find that person:)

    interesting post!

  6. Graham Rhind 18th January 2010 / 08:11

    Place name diversity is a fascinating problem, and one I spend a lot of time on. Apart from the language differences (endonyms/exonyms: Copenhagen, Kopenhagen, Köbenhamn etc.) and the shared place names problem (Portland, Portland, Portland – one of the reasons why postal codes were invented); there are syntactic issues (New York/NY, St. Petersburg/Sankt Petersburg), and the usual issues of data quality and data entry errors (Lodnon, Pariss).

    One of my rules is not to reinvent the wheel – there’s no need to employ somebody to identify these synonyms when somebody else spends most of their time researching it already. If I may do some shameless self-promotion to add to the kind words from other contributors: my Global Sourcebook for Address Data Management (http://www.grcdi.nl/book2.htm) does cover synonyms for many cities, but I have also built a synonym table (based on postal code and place name) with almost 30 million entries (equating to some 80 million place name synonym/postal code combinations) which is used to standardise and correct place names to overcome precisely the issues Henrik describes – see http://www.grcdi.nl/settlements.htm for details.

    End of plug 🙂

  7. Daragh O Brien 19th January 2010 / 16:06

    Great post Henrik.

    From an Irish perspective, I’d add to the mix the problem of address master data often not matching the spelling/format that locals might use.

    For example, not too far from where I live is a village called Murrintown. It has been spelled Murrintown for a few hundred years. 5 years ago the roadsigns all changed to “Murntown” as that was decreed to be the correct spelling by the relevant government agency.

    And don’t get me started on the blind assumptions made about countries having postcodes…..

  8. Henrik Liliendahl Sørensen 19th January 2010 / 22:31

    Thanks Per, Graham and Daragh

    I’m actually still in a light Irish mode after visiting Dublin (Baile Átha Cliath) last week and looking at road signs with both English and Irish (Gaeilge) names.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s