This blog is written in English. Therefore the letters used are normally restricted to A to Z.
The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.
Æ, which in lower case is æ, was part of the old English alphabet. For example an old English king was called Æthelred the Unready.
The letter Æ is a combined AE and is pronounced in English as the first letter in Edmund and Edward.
Today Æ exists in a few alphabets: The Danish/Norwegian, the Faroese and the Icelandic. People and places from the corresponding Viking territories may have the letter Æ/æ as part of the string. For example the home of Microsoft Dynamics AX and NAV is the town Vedbæk north of Copenhagen. When represented in the English alphabet the town name will be Vedbaek.
So Vedbæk and Vedbaek should be a 100% match when doing data matching. And so should Vedbæk and Vedb%C%A6k when systems are as bad as Æthelred the Unready was in handling the Vikings.
And oh, Æthelred wasn’t actually unready. He was unræd meaning bad-counseled.
Hi,
This is interesting.
In fact, not just Æ , but all the diacritical marks (such as é,è,ç, ê etc.) needs proper replacement during cleansing phase before matching takes place.
Regards,
Tirthankar Ghosh
Thanks for commenting Thirtankar.
I have worked with two different approaches to this.
The first one is transliteration. Here you for example replace æ with ae and é with e before matching.
The second one is embedding where the possible correspondence between æ and ae and é and e is taken into consideration within matching.
The same goes for transcription (transforming from one script system to another script system; for example Arabic to Roman). An alternative to transcription before matching is embedding in matching. Hereby you avoid mismatch because of many possible transcriptions. For example the transcript from Arabic to Roman of the name of the former Libyan dictator could be Gaddafi, Gadhafi, Kadafi and many more outcomes.
The same actually goes for handling nicknames. If you standardize Peggy to Margaret before matching you miss the match between Peggy and the typo Pegy.
You are right. Transliteration will not cover everything though it may be a simple solution.
I like the example of “Peggy” and “Pegy”
Henrik, I’m not a linguist, so may be on dodgy ground here, but æ in English is not quite extinct, though it’s not in the alphabet, being a digraph/dipthong/ligature (make your choice!). Those of us of a certain age might still write encyclopædia, for example, though computer hardware has made this rather more troublesome than it used to be when we could still use pens!
Thanks for adding in Graham. I also always thought that the new source of truth should be called wikipædia.