Whether you are celebrating Christmas or not, whether you say Merry Christmas, Feliz Navidad, Frohe Weihnachten, Joyeux Noël, God Jul or plenty of other greetings from around the world: May these days be a wonderful time for you and yours and thanks for reading this blog.
Diversity
Matching Down Under
As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.
This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.
As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.
The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.
As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.
I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:
Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.
Hell in Norway
Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.
Like if you see a city name called “Hell”.
Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.
But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.
Despite Best Intentions
Sometimes you have the best intentions in improving things as data quality and a lot of other things, but somewhere you failed seeing the big picture and it is too late to correct.
From the sports world this apparently happened to the Singapore water polo team at the current Asian Games.
They have new designed speedos honoring the nation’s flag.
But now some ministry tells them, that the swimsuit is inappropriate. But you can’t change outfit during the games.
By the way: I also work at a company with this logo:
Fortunately we haven’t got company speedos.
Legal Forms from Hell
When doing data matching with company names a basic challenge is that a proper company name in most cultures in most cases have two elements:
- The actual company name
- The legal form
Some worldwide examples:
- Informatica Corporation
- Talend SA
- SAP Deutschland AG & Co. KG
- Sony Kabushiki Kaisha
- LEGO A/S
There are hundreds of different legal forms in full and abbreviated forms. Wikipedia has a list here (here called types of business entity).
However, when typing in company names in databases the legal form is often omitted. And even where legal forms are present they may be represented differently in full or abbreviated forms, with varying spelling and punctuation and so on. As the actual company names also suffer from this fuzziness, the complexity is overwhelming.
A common way of handling this issue in data matching is to separate the legal form and then emphasize on comparing the remaining part being the actual company name. When doing that it has to be done country specific or else you may remove the entire name of a company like with a name of an Italian company called Société Anonyme, which is a French legal form.
While the practice of having legal forms in company names may serve well for the original purpose of knowing the risk of doing business with that entity, it is certainly not serving the purpose of having the uniqueness data quality dimension solved.
One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.
Free and Open Sources of Reference Data
This Monday I mingled in a tweetjam organized by the open source data integration vendor Talend.
One of the questions discussed was: Are free and open sources of reference data becoming more important in your projects?
When talking “free and open“, not at least in the open source realm, we can’t avoid talking about “free for a fee”. Some sources of open data like Geonames are free as in “free beer”. Other data comes with a fee. In my home country Denmark we have had some discussions about the reasoning in that the government likes to put a fee on mandatory collected data and I have observed similar considerations in our close neighbor country Sweden (By the way: The picture of a bridge that Talend uses a lot like on top of home page here looks like the bridge between Denmark and Sweden).
One challenge I have met over and over again in using free (maybe for a fee) and open data in data integration and data quality improvement is the cost of conformity. When using open government data there may, apart from the pricing, be a lot of differences between the countries in formats, coverage and so on. I think there is a great potential in delivering conformed data from many different sources for specific purposes.
Magic Quadrant Diversity
The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.
However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?
I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.
DataFlux
First up is DataFlux, the (according to Gartner) leading data quality tool vendor.
Result: No hits.
Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.
PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.
Informatica
Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.
Result: No Hits.
Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.
Stibo Systems
Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.
They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).
Script Systems
This Friday my blog post was called Follow Friday diversity. In my hope to reach for more equalized worldwide interaction I wonder if writing in English with roman (latin) characters is enough?
Take a look at the diversity in script systems around the world:
Alphabets
In an alphabet, each letter corresponds to a sound. These are also referred to as phonographic scripts. Examples of Alphabets: Roman (Latin); Cyrillic; Greek
Abjads
Abjads consist exclusively of consonants. Vowels are omitted from most words, because they are obvious for native speakers, and are simply inserted when speaking. In addition, Abjads are normally written from right to left. Examples of Abjads: Hebrew; Arabic
Abugidas
Abugidas are characteristic for scripts in India and Ethiopia. In this style, only the consonants are normally written, and standard vowels are assumed. If a different vowel is required, it is indicated with a special mark. Abugidas form an intermediate level between alphabetic and syllabic scripts. Examples of Abugidas: Hindi (Devanagari); Singhalese
Syllabic Scripts
Like alphabets, syllabic scripts are another type of phonographic script. In a syllabic script, each character stands for a syllable. Examples of Syllabic Scripts: Japanese (Hiragana, Katakana); Cherokee
Symbol Scripts
In symbolic scripts, each character is an ideogram standing for a complete word. Compound terms or concepts are composed of multiple symbols. Symbolic scripts are also called logographic scripts. Examples of Symbolic Scripts: Chinese; Japanese (Kanji)
Source: Worldmatch® Comparing International Data by Omikron Data Quality – full version here.
Follow Friday Diversity
Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hashtag.
So do I.
Below please find my follow Friday recommendations grouped by global region:
Canada: @carrni @datamartist @sheezaredhead @andrewsinfotech @aniagl @DQamateur @bivcons @projmgr @DQStudent @datachick; United States: @GarnieBolling @stevesarsfield @UtopiaInc @bbreidenbach @fionamacd @RobertsPaige @BIMarcom @IDResolution @FirstSanFranMDM @dan_power @merv @NISSSAMSI @jilldyche @howarddresner @GartnerTedF @RobPaller @marc_hurst @dcervo @datamentors @VishAgashe @IBMInitiate @RamonChen @JackieMRoberts @philsimon @Nick_Giuliano @DataInfoCom @juliebhunt @Futureratti @dqchronicle @jonrcrowell @elc @Experian_QAS @paulboal @im4infomgt @WinstonChen @ocdqblog @KeithMesser @murnane @BrendaSomich @alanmstein @JGoldfed @jaimefitzgerald @tedlouie @bslarkin
Venezuela: @pigbar
Ireland: @daraghobrien @KenOConnorData @MapMyBusiness: United KIngdom: @SteveTuck @VeeMediaFactory @mktginsightguy @Daryl70 @Teresacottam @AnishRaivadera @ExperianQAS_UK @DataQualityPro @SarahBurnett @faropress @jschwa1 @mikeferguson1 @jtonline @Master_OBASHI @Nicola_Askham; France: @DataChannel @mydatanews @jmichel_franco @ydemontcheuil;Switzerland: @alexej_freund @openmethodology; Austria: @omathurin; Germany: @stiebke @dwhp @dakoller @marketingBOERSE; Belgium: @guypardon; Netherlands: @harri00413 @GrahamRhind; Denmark: @jeric40 @eobjects @StiboSystems;Norway @Orvei; Sweeden: @MrPerOlsson @DarioBezzina; Finland: @JoukoSalonen; Lithuania: @googlea; Italy: @Stray__Cat
Algeria: @aboussaidi; South Africa: @MarkGStacey
Pakistan: @monisiqbal; India: @MDMAnswers @twitrvenky @ashwinmaslekar; Indonesia: @VaiaTweets
Australia: @emx5 @vmcburney;New Zeeland: @JohnIMM @Intelligentform
It’s my hope, that I in the future will be able to interact even more diverse.
Military Intelligence
Many data quality issues may be prevented by having some intelligent (error tolerant) search going on. I wrote a post about it called Upstream prevention by error tolerant search.
Intelligent search may have a lot of other advantages too.
A scam related to the Danish Military has been going on for a while. The short story is:
A member of the Special Forces wrote a book about combat actions in Afghanistan. The Military tried to stop it, because it could help the enemy. In that process they by some reason made an Arabic translation and by some mistake leaked that to the press. The key person at the military around doing that has the surname “Sønderskov”.
Police “experts” were assigned to find the leak. For a month they unsuccessful searched for an e-mail address including “Sønderskov” only to realize: Oh, e-mail addresses can’t have the national character “ø”. It must either be “oe” or “o” instead as “Soenderskov” or “Sonderskov”.
The story (in Danish) here from the online computer media Version2.






