Matching Down Under

As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.

This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.

As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.

The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.

As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.

I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:

Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.   

Bookmark and Share

Hell in Norway

Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.

Like if you see a city name called “Hell”.

Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.

But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.

Bookmark and Share

Despite Best Intentions

Sometimes you have the best intentions in improving things as data quality and a lot of other things, but somewhere you failed seeing the big picture and it is too late to correct.

From the sports world this apparently happened to the Singapore water polo team at the current Asian Games.

They have new designed speedos honoring the nation’s flag.

But now some ministry tells them, that the swimsuit is inappropriate. But you can’t change outfit during the games.

By the way: I also work at a company with this logo:

Fortunately we haven’t got company speedos.

Bookmark and Share

Legal Forms from Hell

When doing data matching with company names a basic challenge is that a proper company name in most cultures in most cases have two elements:

  • The actual company name
  • The legal form

Some worldwide examples:

  • Informatica Corporation
  • Talend SA
  • SAP Deutschland AG & Co. KG
  • Sony Kabushiki Kaisha
  • LEGO A/S

There are hundreds of different legal forms in full and abbreviated forms. Wikipedia has a list here (here called types of business entity).

However, when typing in company names in databases the legal form is often omitted. And even where legal forms are present they may be represented differently in full or abbreviated forms, with varying spelling and punctuation and so on. As the actual company names also suffer from this fuzziness, the complexity is overwhelming.

A common way of handling this issue in data matching is to separate the legal form and then emphasize on comparing the remaining part being the actual company name. When doing that it has to be done country specific or else you may remove the entire name of a company like with a name of an Italian company called Société Anonyme, which is a French legal form.

While the practice of having legal forms in company names may serve well for the original purpose of knowing the risk of doing business with that entity, it is certainly not serving the purpose of having the uniqueness data quality dimension solved.

One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.

Bookmark and Share

Free and Open Sources of Reference Data

This Monday I mingled in a tweetjam organized by the open source data integration vendor Talend.

One of the questions discussed was: Are free and open sources of reference data becoming more important in your projects?

When talking “free and open“, not at least in the open source realm, we can’t avoid talking about “free for a fee”. Some sources of open data like Geonames are free as in “free beer”. Other data comes with a fee. In my home country Denmark we have had some discussions about the reasoning in that the government likes to put a fee on mandatory collected data and I have observed similar considerations in our close neighbor country Sweden (By the way: The picture of a bridge that Talend uses a lot like on top of home page here looks like the bridge between Denmark and Sweden).

One challenge I have met over and over again in using free (maybe for a fee) and open data in data integration and data quality improvement is the cost of conformity. When using open government data there may, apart from the pricing, be a lot of differences between the countries in formats, coverage and so on. I think there is a great potential in delivering conformed data from many different sources for specific purposes.

Bookmark and Share

Magic Quadrant Diversity

The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.

However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?

I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.

DataFlux

First up is DataFlux, the (according to Gartner) leading data quality tool vendor.

Result: No hits.

Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.

PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.

Informatica

Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.

Result: No Hits.

Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.

Stibo Systems

Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.

Result: Hit:

They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).

Bookmark and Share

Script Systems

This Friday my blog post was called Follow Friday diversity. In my hope to reach for more equalized worldwide interaction I wonder if writing in English with roman (latin) characters is enough?

Take a look at the diversity in script systems around the world:

Alphabets

In an alphabet, each letter corresponds to a sound. These are also referred to as phonographic scripts. Examples of Alphabets: Roman (Latin); Cyrillic; Greek

Abjads

Abjads consist exclusively of consonants. Vowels are omitted from most words, because they are obvious for native speakers, and are simply inserted when speaking. In addition, Abjads are normally written from right to left. Examples of Abjads: Hebrew; Arabic

Abugidas

Abugidas are characteristic for scripts in India and Ethiopia. In this style, only the consonants are normally written, and standard vowels are assumed. If a different vowel is required, it is indicated with a special mark. Abugidas form an intermediate level between alphabetic and syllabic scripts. Examples of Abugidas: Hindi (Devanagari); Singhalese

Syllabic Scripts

Like alphabets, syllabic scripts are another type of phonographic script. In a syllabic script, each character stands for a syllable. Examples of Syllabic Scripts: Japanese (Hiragana, Katakana); Cherokee

Symbol Scripts

In symbolic scripts, each character is an ideogram standing for a complete word. Compound terms or concepts are composed of multiple symbols. Symbolic scripts are also called logographic scripts. Examples of Symbolic Scripts: Chinese; Japanese (Kanji)

Source: Worldmatch® Comparing International Data by Omikron Data Quality – full version here.

Bookmark and Share

Follow Friday Diversity

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hashtag.

So do I.

Below please find my follow Friday recommendations grouped by global region:

 

Canada: @carrni @datamartist @sheezaredhead @andrewsinfotech @aniagl @DQamateur @bivcons @projmgr @DQStudent @datachickUnited States: @GarnieBolling @stevesarsfield @UtopiaInc @bbreidenbach @fionamacd @RobertsPaige @BIMarcom @IDResolution @FirstSanFranMDM @dan_power @merv @NISSSAMSI @jilldyche @howarddresner @GartnerTedF @RobPaller @marc_hurst @dcervo @datamentors @VishAgashe @IBMInitiate @RamonChen @JackieMRoberts @philsimon @Nick_Giuliano @DataInfoCom @juliebhunt  @Futureratti  @dqchronicle  @jonrcrowell @elc  @Experian_QAS @paulboal @im4infomgt @WinstonChen @ocdqblog @KeithMesser @murnane @BrendaSomich @alanmstein @JGoldfed @jaimefitzgerald @tedlouie @bslarkin

Venezuela: @pigbar

Ireland: @daraghobrien @KenOConnorData @MapMyBusiness: United KIngdom: @SteveTuck @VeeMediaFactory @mktginsightguy @Daryl70 @Teresacottam @AnishRaivadera @ExperianQAS_UK @DataQualityPro @SarahBurnett @faropress @jschwa1 @mikeferguson1 @jtonline @Master_OBASHI @Nicola_Askham; France: @DataChannel @mydatanews @jmichel_franco @ydemontcheuil;Switzerland: @alexej_freund @openmethodology; Austria: @omathurin; Germany: @stiebke @dwhp @dakoller @marketingBOERSE; Belgium: @guypardon; Netherlands: @harri00413 @GrahamRhind; Denmark: @jeric40 @eobjects @StiboSystems;Norway @Orvei; Sweeden: @MrPerOlsson @DarioBezzina; Finland: @JoukoSalonen; Lithuania: @googlea; Italy: @Stray__Cat

Algeria: @aboussaidi; South Africa: @MarkGStacey

Pakistan: @monisiqbal; India: @MDMAnswers @twitrvenky @ashwinmaslekar; Indonesia: @VaiaTweets

Australia: @emx5 @vmcburney;New Zeeland: @JohnIMM @Intelligentform

It’s my hope, that I in the future will be able to interact even more diverse.

Bookmark and Share

Military Intelligence

Many data quality issues may be prevented by having some intelligent (error tolerant) search going on. I wrote a post about it called Upstream prevention by error tolerant search.

Intelligent search may have a lot of other advantages too.

A scam related to the Danish Military has been going on for a while. The short story is:

A member of the Special Forces wrote a book about combat actions in Afghanistan. The Military tried to stop it, because it could help the enemy. In that process they by some reason made an Arabic translation and by some mistake leaked that to the press. The key person at the military around doing that has the surname “Sønderskov”.

Police “experts” were assigned to find the leak. For a month they unsuccessful searched for an e-mail address including “Sønderskov” only to realize: Oh, e-mail addresses can’t have the national character “ø”. It must either be “oe” or “o” instead as “Soenderskov” or “Sonderskov”.

The story (in Danish) here from the online computer media Version2.

Bookmark and Share