Bluetooth and Me

Bluetooth is a wonderful technology that allows me to browse the phone book stored on my cell phone and activate a call from my car cockpit even though my mobile is in my jacket pocket and my hands are on the wheel.

By the way: Bluetooth is a communication standard named after king Harald Bluetooth of Denmark because he united all the Danish tribes a thousand years ago, even though his father, Gorm the Sleepy, probably did most of it. Harald however was the guy who abandoned the Old Norse gods in favor of Christianity leaving Odin and Thor the humiliating fate of becoming American comic and film characters.

History aside, Bluetooth doesn’t do anything more than exposing my party master data hub on the cell phone made up from numerous SIM card changes and outlook synchronizations with all the resulting duplicates and outdated numbers even a data quality geek and master data management enthusiast is wearing.

Bookmark and Share

Marco Polo and Data Provenance

Besides being a data geek I am also interested in pre-modern history. So it’s always nice when I’m able to combine data management and history.

A recurring subject in historian circles is a suspicion saying that Explorer Marco Polo never actually went to China.

As said in the linked article from The Telegraph: “It is more likely that the Venetian merchant adventurer picked up second-hand stories of China, Japan and the Mongol Empire from Persian merchants whom he met on the shores of the Black Sea – thousands of miles short of the Orient”.

When dealing with data and ramping up data quality a frequent challenge is that some data wasn’t captured by the data consumer – not even by the organization using the data. Some of the data stored in company databases are second-hand data and in some cases the overwhelming part of data is captured outside the organization.

As with the book telling about Marco Polo’s (alleged) travels called “Description of the World” this doesn’t mean that you can’t trust anything. But maybe some data are mixed up a bit and maybe some obvious data are missing.

I have earlier touched this subject in the post Outside Your Jurisdiction and identified second-hand data as one of the Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

History of Data Quality

When did the first data quality issue occur? Wikipedia says in the data quality article section titled history that it began with the mainframe computer in the United States of America.

Fellow data quality blogger Steve Sarsfield made a blog post a few years ago called A Brief History of Data Quality where it is said “Believe it or not, the concept of data quality has been touted as important since the beginning of the relational database”.

However, a predominant sentiment in the data quality realm is that data quality is not about technology. It is about people. People are the sinners of data quality flaws and as the main part of the problem people should also be the overwhelming part, if not the only part, of the solution.

So I guess data quality challenges were introduced when people showed up in the real world. How and when that happened is a matter of discussion as discussed in the blog post Out of Africa.

As explained in the post Movable Types the invention of movable types in printing some hundreds of years ago (the most important invention since someone invented the wheel for the first time) made a big boost in knowledge sharing among people – and also a big boost in data and information quality issues.

But I think the saying “To err is human, but to really foul things up you need a computer” is valid. Consequently I also think you may need a computer to help with cleaning up the mess and to prevent the mess from happening again. End of (hi)story.    

Bookmark and Share

How long is a Marathon?

Many large cities around the world have a yearly marathon event. Today it’s Copenhagen (and possibly other cities too).

The marathon distance today is 42,195 kilometers (if I use comma as decimal point) which resembles 26 miles and 385 yards or 26.22 miles (if I use a dot as decimal point).

So even if we today agree about the distance we might represent that distance in various ways. The distance has however varied during history as seen in the table with the length of the Olympic marathons.

What about real world alignment?

Well, if the Greek runner called Pheidippides (sometimes spelled Phidippides or Philippides) took the long but flat Southern route from Marathon to Athens it would have been around 42 kilometers. If he took the shorter but steeper Northern route it would only have been around 35 kilometers.

What about me? Oh, I’ll go for 42,195 kilometers – on the bike.   

Bookmark and Share

As Bill Shakespeare Wrote …

This post is a follow up on the post Foreign Affairs and the post Fuzzy Matching and Information Quality over at the Mastering Data Management blog.

The fuzzy post and comments including mine circles around how the relation between “Bill” and “William” must be handled in data matching.

While “Bill” and “William” may be used interchangeable in modern Anglo-Saxon data, it may be a mistake in time (anachronism) to use them interchangeable related to the grand old playwright.

Also it may be a mistake in place to use them interchangeable in other cultures.

For example in my home country Denmark “Bill” and “William” are two different names. Globalization has been going on for a long time as far more people are baptized (or given the name otherwise) William than the original Danish form Wilhelm. There are only 286 people with the name Wilhelm today opposite to 7,355 with the name William including 800 new during the last year. And then there are 353 different people with the name Bill.

But the same use of nicknames has not been localized here yet.

So with Danish data matching “Bill Nielsen” and “William Nielsen” is almost certainly a false positive.

It’s not that it’s a big problem; the risk of making the mistake is very low. The problem is rather that focus should be on different more pressing issues with specific challenges (and possibilities) related to data from each culture and country.

Bookmark and Share

So I’m not a Capricorn?

Yesterday was my birthday. Being born the 14th January makes me a Capricorn according to astrology.

Only there is a slight problem. As told in an article on Huffingtonpost an astronomer has kindly remarked that the assignment of signs with the calendar was made thousands of years ago. In the mean time the earth’s orbit has changed, so we should have completely new signs (and personalities?) today.     

I guess astrology qualifies as a data and information quality trainwreck by forgetting one of the most common pitfalls in data quality: Things change.  

Bookmark and Share

Matching Down Under

As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.

This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.

As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.

The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.

As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.

I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:

Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.   

Bookmark and Share

Movable Types

A big boost in knowledge sharing in mans history was made around year 1450 (in the Gregorian calendar) when Johannes Gutenberg of Germany invented the use of movable types in printing. However the movable types was actually invented 400 years before in China here using porcelain and 200 years before in Korea where metal also was used. But the East Asian inventions did not spread very well do to the Script Systems used, where you have thousands of different tablets representing each syllable or word opposite to when using an alphabet.

Anyway the invention of movable types in printing is regarded as maybe the most important invention since someone invented the wheel (for the first time).

Data quality flaws also got a big boost with the sudden increase in printed work made possible by this invention. I remember my grandmother was a text reviewer at a local newspaper, and she always complained about journalists with poor spelling capabilities and she was very upset when the names of people was spelled wrong in articles. I guess her reference file for correct spelled names was in her head as she knew every known person in the town (being my town of birth: Randers).

The use of computers (including the internet) has made the next big boost in knowledge sharing and data quality flaws including the introduction of the term data quality. Before poor data quality was called sloppiness I guess. The problem however stays the same: Putting the right characters in the right order. The first time.

Bookmark and Share

To be called Hamlet or Olaf – that is the question

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share