Script Systems

This Friday my blog post was called Follow Friday diversity. In my hope to reach for more equalized worldwide interaction I wonder if writing in English with roman (latin) characters is enough?

Take a look at the diversity in script systems around the world:

Alphabets

In an alphabet, each letter corresponds to a sound. These are also referred to as phonographic scripts. Examples of Alphabets: Roman (Latin); Cyrillic; Greek

Abjads

Abjads consist exclusively of consonants. Vowels are omitted from most words, because they are obvious for native speakers, and are simply inserted when speaking. In addition, Abjads are normally written from right to left. Examples of Abjads: Hebrew; Arabic

Abugidas

Abugidas are characteristic for scripts in India and Ethiopia. In this style, only the consonants are normally written, and standard vowels are assumed. If a different vowel is required, it is indicated with a special mark. Abugidas form an intermediate level between alphabetic and syllabic scripts. Examples of Abugidas: Hindi (Devanagari); Singhalese

Syllabic Scripts

Like alphabets, syllabic scripts are another type of phonographic script. In a syllabic script, each character stands for a syllable. Examples of Syllabic Scripts: Japanese (Hiragana, Katakana); Cherokee

Symbol Scripts

In symbolic scripts, each character is an ideogram standing for a complete word. Compound terms or concepts are composed of multiple symbols. Symbolic scripts are also called logographic scripts. Examples of Symbolic Scripts: Chinese; Japanese (Kanji)

Source: Worldmatch® Comparing International Data by Omikron Data Quality – full version here.

Bookmark and Share

Follow Friday Diversity

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hashtag.

So do I.

Below please find my follow Friday recommendations grouped by global region:

 

Canada: @carrni @datamartist @sheezaredhead @andrewsinfotech @aniagl @DQamateur @bivcons @projmgr @DQStudent @datachickUnited States: @GarnieBolling @stevesarsfield @UtopiaInc @bbreidenbach @fionamacd @RobertsPaige @BIMarcom @IDResolution @FirstSanFranMDM @dan_power @merv @NISSSAMSI @jilldyche @howarddresner @GartnerTedF @RobPaller @marc_hurst @dcervo @datamentors @VishAgashe @IBMInitiate @RamonChen @JackieMRoberts @philsimon @Nick_Giuliano @DataInfoCom @juliebhunt  @Futureratti  @dqchronicle  @jonrcrowell @elc  @Experian_QAS @paulboal @im4infomgt @WinstonChen @ocdqblog @KeithMesser @murnane @BrendaSomich @alanmstein @JGoldfed @jaimefitzgerald @tedlouie @bslarkin

Venezuela: @pigbar

Ireland: @daraghobrien @KenOConnorData @MapMyBusiness: United KIngdom: @SteveTuck @VeeMediaFactory @mktginsightguy @Daryl70 @Teresacottam @AnishRaivadera @ExperianQAS_UK @DataQualityPro @SarahBurnett @faropress @jschwa1 @mikeferguson1 @jtonline @Master_OBASHI @Nicola_Askham; France: @DataChannel @mydatanews @jmichel_franco @ydemontcheuil;Switzerland: @alexej_freund @openmethodology; Austria: @omathurin; Germany: @stiebke @dwhp @dakoller @marketingBOERSE; Belgium: @guypardon; Netherlands: @harri00413 @GrahamRhind; Denmark: @jeric40 @eobjects @StiboSystems;Norway @Orvei; Sweeden: @MrPerOlsson @DarioBezzina; Finland: @JoukoSalonen; Lithuania: @googlea; Italy: @Stray__Cat

Algeria: @aboussaidi; South Africa: @MarkGStacey

Pakistan: @monisiqbal; India: @MDMAnswers @twitrvenky @ashwinmaslekar; Indonesia: @VaiaTweets

Australia: @emx5 @vmcburney;New Zeeland: @JohnIMM @Intelligentform

It’s my hope, that I in the future will be able to interact even more diverse.

Bookmark and Share

Can Anybody Hear Me?

Blogging and evangelizing about data quality is a fairly lonely trade.

Hopefully it is not because it is not a good cause. And I don’t think so. Also this week I followed another good cause not getting much attention.

After the disaster with the oil spill in the Gulf of Mexico Greenpeace launched an operation aimed at getting attention to the probably even more dangerous deep water drilling in the fragile Arctic environment.

The ship Esperanza sailed to the Baffin Bay, launched inflatables with 4 climbers who hanged in under an oil rig in 40 hours in the blistering cold wind while practically no one cared.

Oh yes, there were live tweeting from the ship on the @gp_espy account – followed by 1,500 tweeps world-wide, including yours truly.

Surely, a few articles was written by the press – mainly in Britain where the drilling company Cairn Energy belong and in Denmark because the waters belongs to Greenland/Kingdom of Denmark.

But I guess Greenpeace must be pretty disappointed with the overall attention. I guess they chose the wrong right place (platform you might say). Not much press in the Baffin Bay.   

And hey, I guess I chose the wrong time for publishing this post (based on my reader demographics as I know it). No one is online in the Pacifics now, it’s early Saturday morning in Europe and it’s the night before a 3 day weekend in the United States.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share

Consultants

Just arrived home from summer vacation I have been thinking a bit about how we consultants act at work. On our vacation we used local guides at some places. These guides were our consultants at places they know very well and we didn’t know at all. But I also noticed they had some habits which may be considered as common weak sides of practicing consultancy.

Different language

Francisco Caballero has lived all his long life in the beautiful town Ronda in Southern Spain. He shared his great knowledge about the town with us in his distinguished blend of English and Spanish spiced up with some Russian, German and probably also Dutch words. I think we understood the most though we did have some variances when we compared our perceptions afterwards.

Personal opinions

Besides telling about the town and the history behind Señor Caballero also shared his views about politics. He told about problems with young people today and increasing crime. He remembered things were much better when Generalissimo Franco was in charge. He admitted though that today there is no “bandidos” in the mountains as in the old days, but as he put it: “Today all bandidos in Madrid”. I guess he was referring to recent governments.

Assessing risk

Robert is fifth generation of British descent living in Gibraltar, the small English enclave around the marvelous rock on the Southern tip of Spain facing Africa cross the narrow strait. I remember the opening scene of the James Bond film The Living Daylights is a hazardous car ride down the rock. Robert took us in his taxi on the very same narrow roads, practicing pretty much the same style of driving while explaining that as we had to go off and on the car all the time at the different sights, there was really no point in using the safety belts.

Personal commercial agenda

Salam seemed to know everyone and everything in Tangier, the Moroccan city on the Northern tip of Africa on the other side of the Strait of Gibraltar. Salam offered us a guided tour where we would go everywhere we wanted and look at everything we fancied using any time as we pleased. Only when going around he strongly urged us to go to exactly that spice shop he knew and strongly recommended not sitting at that café we spotted but preceding to a much better one. As infidels we couldn’t of course go into a mosque, unless (of course) we gave some extra Euro.

Bookmark and Share

Going Upstream in the Circle

One of the big trends in data quality improvement is going from downstream cleansing to upstream prevention. So let’s talk about Amazon. No, not the online (book)store, but the river. Also as I am a bit tired about that almost any mention of innovative IT is about that eShop.

A map showing the Amazon River drainage basin may reveal what may go to be a huge challenge in going upstream and solve the data quality issues at the source: There may be a lot of sources. Okay, the Amazon is the world’s largest river (because it carries more water to the sea than any other river), so this may be a picture of the data streams in a very large organization. But even more modest organizations have many sources of data as more modest rivers also have several sources.

By the way: The Amazon River also shares a source with the Orinoco River through the natural Casiquiare Canal, just as many organizations also shares sources of data.

Some sources are not so easy to reach as the most distant source of the Amazon being a glacial stream on a snowcapped 5,597 m (18,363 ft) peak called Nevado Mismi in the Peruvian Andes.

Now, as I promised that the trend on this blog should be about positivity and success in data quality improvement I will not dwell at the amount of work in going upstream and prevent dirty data from every source.

I say: Go to the clouds. The clouds are the sources of the water in the river. Also I think that cloud services will help a lot in improving data quality in a more easy way as explained in a recent post called Data Quality from the Cloud.

Finally, the clouds over the Amazon River sources are made from water evaporated from the Amazon and a lot of other waters as part of the water cycle. In the same way data has a cycle of being derived as information and created in a new form as a result of the actions made from using the information.

I think data quality work in the future will embrace the full data cycle: Downstream cleansing, upstream prevention and linking in the cloud.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

Did They Put a Man on the Moon?

Recently I have been reading some blog posts circling around having a national ID for citizens in the United States including a post from Steve Sarsfield and another post from Jeffrey Huth of Initiate.

In Denmark where I live we have had such a national ID for about half a century. So if you are a vendor with a great solution for data matching and master data management in healthcare and wants to approach a Danish prospect in healthcare (which are mainly public sector here), they will tell you, that the solutions looks really nice, but they don’t have that problem. You can’t stay many seconds as a patient in a Danish hospital before you are asked to provide your national ID. And if you came in inside your mother you will be given an ID for life within seconds after you are born.

The same national ID is the basis when we have elections. Some weeks before the authorities will push the button and every person with the right status and age gets a ballot. Therefore we are in disbelief when we every fourth year are following when United States elects a president and we learn about all the mess in voter registration.

Is that happening in the nation that put a man on the moon in 1969?. Or did they? Was it after all a studio recording?

Bookmark and Share

Birthday Party

Today this blog has been online one year. It’s time for a birthday party.

The economy around a birthday party usually goes like this:

  • You, the guest, spend some money on a nice birthday present
  • I, the host, spend some money on fine food and beverage

Now, a blog is a virtual thing and I reckon that most of my readers live far, far away from the Copenhagen South Coast.  So it’s going to be a remote birthday party and as most other things happening in the social media realm actually no money is going to be exchanged.

Anyway, here is what I would have liked to serve in the real world:

Paella

The dish I have prepared the most times when we have guests is the Spanish paella. I love paella very much and so do all our polite guests.

Also I am a shrimp addict, so I usually like to add two or three different kind of shrimps as the smaller but extremely tasteful Greenlandic shrimps to delicious giant Thai tiger prawns.

Steak

My second favorite meal is a steak. You probably don’t get a better steak than those originated from cattle grazing on the Argentinean pampas.

As I live in the Northern Hemisphere it’s summertime now and perfect weather for preparing the steak outside on the grill.

Wine

There is so much good wine coming from many places around the world. I like Californian wine, wine from Chile, South African wine, Australian wine, French wine and last but not least Italian wine including the unbeatable Amarone.

Beer

As I am a native Dane you will probably expect me to propose a Carlsberg. Don’t get me wrong: Carlsberg is probably a good beer. But there are many other good beers around. When I am in England I like the ultimate mainstream beer: A John Smith (now owned by Dutch Heineken). The best mainstream beer in my opinion is the Belgian Leffe.

Cheers

Thanks to everyone who has read this blog, subscribed, made a re-tweet and not at least those who has commented.

Bookmark and Share

Algorithm Envy

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.


Bookmark and Share