My View

This post is inspired by the view from our roof terrace, where I’m sitting with the laptop right now.

One of the buildings I can see in the skyline is the spectacular new Hotel Bella Sky that will open tonight.

The new hotel is situated by the main fair in Copenhagen called Bella Center, the venue of the recent disastrous climate change summit where Wen, Obama and Singh couldn’t agree about anything.     

The Bella Sky isn’t the only new high rising hotel in the nearby skyline. Actually there is currently an overcapacity of hotel rooms in Copenhagen. But as it is said, the new hotels were planned before the credit crunch and couldn’t be stopped.   

Planning several years in advance has always been difficult. Within information technology it’s also a well known fact that projects that is set to deliver some years ahead almost always fails to meet the actual business needs when that time is reached.

On the one hand we need some more agile hotel projects – and agile information technology projects – including agile master data management and data quality programs.

On the under hand, I like it when I see some nice hotel architecture and some good data architecture.

Bookmark and Share

Quotes not originally about Data Quality

Yesterday I was looking for some quotations for a data quality presentation.

I stumbled upon these ones by Niels Bohr:

An expert is a person who has made all the mistakes which can be made in a very narrow field

I found that this quote is most often used this way:

“An expert is a man who has made all the mistakes which can be made in a very narrow field”.

I am pretty sure Bohr said person – not man. There are just as many female experts as male experts around.

And indeed: Learning from mistakes is the path to expertise in data quality.

There are two sorts of truth: Trivialities, where opposites are obviously absurd and profound truths, recognized by the fact that the opposite is also a profound truth

Bohr was into quantum mechanics. I think data quality is very much like quantum mechanics. Sometimes there is a simple single version of the truth; sometimes there are several great versions of a complex truth.

Anyone who is not shocked by quantum theory has not understood it

Anyone who is not shocked by the actual quality of data has probably not measured it (yet).

Bookmark and Share

The Letter Ø

This blog is written in English. Therefore the letters used are normally restricted to A to Z.  

The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.

My last surname is “Sørensen”. This word contains the lower case letter “ø” which is “Ø” in upper case. This letter is part of two alphabets: The Danish/Norwegian and the Faroese. Sometimes data has to be transformed into the English alphabet. Then the letter “ø” may be transformed to either “o” or “oe”. So my last surname will be either “Sorensen” or “Soerensen” in the English alphabet.

The town part of my address is “København”. The word “København” is what we call an endonym, which is the local word for place or a person. The opposite of endonym is exonym. The English exonym for “København” is “Copenhagen” which of course only has letters from the English alphabet. The Swedish exonym for “København” is “Köpenhamn”. Here we have a variant of “Ø” being “Ö”. The letter “Ö” exists in a lot of alphabets as Swedish, German, Hungarian and Turkish.

Usually “Ø” is transformed to “Ö” between Danish/Norwegian and these alphabets. The other way we usually accept the letter “Ö” in Danish/Norwegian master data.

These issues are of course a problem area in data quality, data matching and master data management. And with the complexity only between alphabets using Latin characters there is of course much more land to cover when including Cyrillic and Greek letters and then the other scripts systems with their hierarchical elements.

Bookmark and Share

Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

To be called Hamlet or Olaf – that is the question

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

  • Pegy Smith
  • Peggy Smith
  • Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.


Bookmark and Share

Diversity in City Names

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

  • The data model may reflect the real world by having places described by multiple names in given languages.
  • Some data matching solutions use synonym listing for this challenge.
  • Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

Bookmark and Share

2010 predictions

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.

Bookmark and Share

Data Quality and Climate Change Management

A month ago I made a blog post titled “Data Quality and climate politics”. In this post I highlighted some similarities between data governance / data quality and climate politics mainly focussing on why sometimes nothing is done.

Today, 1 day before the United Nations climate change summit commence in my hometown Copenhagen, it seems that executive buy-in has come through. Over 100 heads of states and government will attend the conference among them key stake holders as Indian prime minister Singh and US president Obama.

The plan for how to manage climate change seems at this moment to have some ingredients with similarities to how to manage data quality change.  

The bill

Related to my previous post Eugene Desyatnik commented on LinkedIn:

In both cases, everyone in their heart agrees it’s a noble cause, and sees how they can benefit — but in both cases, everyone also hopes someone else will pay for most of it.

Progress in fighting climate change seems to be closely related to that the rich countries seems to be in agreement about paying a fair share.

With enterprise data quality you also can’t rely on that one business unit will pay for solving all enterprise wide data quality issues related to common data domains. 

Key Performance Indicators

Reductions in greenhouse gas emissions are key performance indicators and goals in fighting climate change – measuring temperatures is more like looking at the final outcome.

For data quality we also knows that the business outcome is related to information in context but in order to look at improving progress we have to measure (raw) data quality at the root.  

Using technology

This article from BBC “Tackling climate change with technologypoints at a wealth of different technologies that may help fighting global warming while we still get the power we need. There is pros and cons for each. Some technologies works in some geographies but not somewhere else. Some technologies are mature now and some will be in the future. There is no silver bullet but a range of different possibilities

Very similar to data quality technology.