The Data Quality Cuisine

Analogies between making and serving good food and improving data and information quality are among the recurring topics on this blog. Like the term good food is a subjective matter also good information is a subjective matter though the ones who have the task of preparing both knows that fresh and clean raw materials / data is a must for preparing both, as explained in the post Bon Appetit.

Food preferences and data and information preferences differs around the world. High esteemed local dishes from one country may not have the same traction in other parts of the world. As discussed in the post Data Quality and World Food this is also true for data and information quality.

In the post Metadata Meatballs it is examined how the same diversity applies to metadata.

Sometimes you can’t trust data even if data is captured correctly. If you for example ask people about food consumption habits we tend to give answers with some distance from reality. That calls for a Survey Data Laundering.

Estimating the return on investment for improving data quality has always been hard. The post Miracle Food for Thought is about how that resembles how following “good” advices around what you should eat and drink isn’t as simple as often stated.   

Anyway, we all know that better food and better serving in a restaurant does create more business and sometimes we have to put the restaurant and the information bistro Under new Master Data Management.

And finally, tomorrow this blog is two years old. That calls for a Birthday Party in the cloud.

Bookmark and Share

What’s best: Safe or sorry?

As I have now moved much closer to downtown I have now also changed my car accordingly, so two month ago I squeezed myself into a brand new city car, the Fiat Nuova Cinquecento.

(Un)fortunately the car dealer’s service department called the other day and said some part of the motor had to be replaced because there could be a problem with that part. The manufacturer must have calculated that it’s cheaper (and may be a better customer experience) to be proactive rather than being reactive and deal with the problem if it should occur with my car later.  

(Un)fortunately that’s not the way we usually do it with possible data problems. So, back to work again. Someone’s direct marketing data just crashed in the middle of a campaign.    

Bookmark and Share

B2C versus B2B Data Quality

The data quality issues in doing business with private consumers (business-to-consumer = B2C) and doing business with other business’s (business-to-business = B2B) have a lot of similar challenges but also differs in a lot of ways.

Some of my experiences (and thoughts) related to different master data domains are:

Customer master data

In B2C the number of customers, prospects and leads is usually high and characterized by relatively few interactions with each entity.  In B2B you usually have a relatively small number of customers with a high number of interactions.

One of the most automated activities in data quality improvement is matching master data records with information about customers. Many of the examples we see in marketing material, research documents, blog posts and so on is about matching in the B2C realm. This is natural since the high number of records typically with a low attached value calls for automation.

Data matching in the B2B realm is indeed more complex due to numerous challenges like less standardized names of companies and typically more options in what constitutes a single customer. The high value attached to each customer also makes the risk of mistakes a showstopper for too much automation.

So in B2B we see an increasing adaption of creating workflows that insures data quality during data capture often by exploiting external reference data which also in general are more available related to business entities.

Location master data

The location of B2C customers means a lot. Accurate and timely delivery addresses for everything from direct mails to bringing goods to the premises are essential. Location data are used to recognize household relations, assigning demographic stereotypes and in many cases calculating fees of different kind. I had a near disaster experience with a really bad address in my early career.

Even though location data for B2B activities theoretically is just as important, I have often seen that a little less precision is fit for purpose or anyway lower prioritized than more pressing issues.

Product master data

Theoretically there should be no difference between B2C and B2B here, but I guess there is in practice?

The most interesting aspect is probably the multi-domain aspect examining the relations between customers and products.   

I had some experiences some years ago with the B2B realm as described in the post What is Multi-Domain MDM?: 1,000 B2B customers buying 1,000 different finished products can be a quite complicated data quality operation.

Within the B2C realm the most predominant multi-domain data quality issues I have met is related to analytics. As discussed in the post Customer/Product Matrix Management it is about typifying your customers correctly and categorizing your products adequately at the same time.

Bookmark and Share

A geek about Greek

This ninth Data Quality World Tour blog post is about Greece, a favorite travel destination of mine and the place of origin of so many terms and thoughts in today’s civilization.

Super senior citizens

Today Greece has a problem with keeping records over citizens. A recent data profiling activity has exposed that over 9,000 Greeks receiving pensions are over 100 years old. It is assumed that relatives has missed reporting the death of these people and therefore are taking care of the continuing stream of euro’s. News link here.

Diverse dimensions

I found those good advices for you, when going to Greece today:

Timeliness: When coming to dinner, arriving 30 minutes late is considered punctual.

Accuracy:  Under no circumstances should you publicly question someone’s statements.

Uniqueness: Meetings are often interrupted. Several people may speak at the same time.

(We all have some Greek in us I guess).

Previous Data Quality World Tour blog posts:

Don’t confuse me with facts of life

As humans we like to know about simple facts. As with weather forecasts we like to know exactly what temperature it’s going to be, if the sun will be shining or it’s going to be rain and sometimes also about the wind speed and direction relating to a given place and time in the future.

Meteorologists have struggled for ages to tell us about that. A traditional weather forecast will tell us the best guess for these few key indicators.

Many people today, including me, don’t really rely on the weather to do our work. But we may plan when to work, how to get to work and what to do besides work depending on the weather forecast.

So I usually study the weather forecast. Lately I have noticed that the Danish Meteorological Institute has experimented with how to visualize to the common people that the weather forecast is a best guess. So for example instead of having single colored blue plies indicating how much rain to expect, they now have the choice to have blue piles in different light or darker blue colors indicating the risk (or chance if you like) of rain.

Better data quality? I think so. Less confusing? I think not. It could be rain anytime. But it probably won’t.          

   

Bookmark and Share

Data Diversity

As part of my work I deal with data from different countries. In the below figure I have put in some examples of different presentations of the same data from some of the countries I meet the most being Denmark (DK), Germany (DE), France (FR), United States (US) and United Kingdom (GB):

 
Click on figure to enlarge.

I have some more information on the issues regarding the different attributes:

Bookmark and Share

New Eyes on Iceland

This eights Data Quality World Tour blog post is about Iceland.

Patronymics

Rather than using family names, the Icelanders use patronymics. This means that the first Icelandic President Sveinn Björnsson must have been son of Björn and I guess current Prime Minister Jóhanna Sigurðardóttir is the daughter of Sigurð. This must create some havoc for well proven algorithms for finding households. (Add to that that the Prime Minister is in a same-sex marriage).

Volcanoes

In the good old days air traffic wasn’t concerned with the recurring volcanic eruptions on Iceland. Today it seems to be a repeating cause of travel havoc. A bit like poor data quality wasn’t taken seriously in the good old days, but today dirty data creates havoc in business intelligence implementations.  

Previous Data Quality World Tour blog posts:

How long is a Marathon?

Many large cities around the world have a yearly marathon event. Today it’s Copenhagen (and possibly other cities too).

The marathon distance today is 42,195 kilometers (if I use comma as decimal point) which resembles 26 miles and 385 yards or 26.22 miles (if I use a dot as decimal point).

So even if we today agree about the distance we might represent that distance in various ways. The distance has however varied during history as seen in the table with the length of the Olympic marathons.

What about real world alignment?

Well, if the Greek runner called Pheidippides (sometimes spelled Phidippides or Philippides) took the long but flat Southern route from Marathon to Athens it would have been around 42 kilometers. If he took the shorter but steeper Northern route it would only have been around 35 kilometers.

What about me? Oh, I’ll go for 42,195 kilometers – on the bike.   

Bookmark and Share

No NOT NULL

A basic way of ensuring data quality in a database is to define that a certain attribute must be filled. This is done by specifying that the value “null” isn’t allowed or as said in SQL’ish: Setting the NOT NULL constraint.

A common data quality issue is that such constraints almost always are too rigid.

In my last post called Notes about the North Pole it was discussed that every place on earth has a latitude and a longitude except that the North Pole – and the South Pole – hasn’t a longitude. So if you have a table with geocodes you can’t set NOT NULL for the longitude if you (though very unlikely) should store the coordinates for the poles. Alternatively you could store 0 for longitude to make it complete – but then it would be very inaccurate. 360 degree inaccurate so to speak.

Another infrequent example from this blog is that every person in my country has a given (first) name and a family (last) name. But there are a few Royal Exceptions. So, no NOT NULL for the family name.

Related to people and places there are plenty of more frequent examples. If you only expect addresses form United States, Australia or India setting the NOT NULL for the state attribute seems wise. But expect foolish values in here when you get addresses from most other parts of the world. So, no NOT NULL for the state.  

A common variant of the mandatory state value is when you register for data quality webinars, white papers and so on. Most often you must select from a value list containing the United States of America – in some cases also mixed in with Canadian Provinces. The NULL option to be used by strangers may hide as “Not Applicable” way down the list among states beginning with N.

I usually select Alaska which is among the first states in the alphabetical order – which also brings me back close to the North Pole making my data close to 360 degree inaccuracy.     

Bookmark and Share

Notes about the North Pole

This is the seventh post in a series of short blog posts focusing on data quality related to different countries around the world. However, today we will be at a place not belonging to any country (so far) and only reachable on foot because it is in the middle of an ocean covered by ice (so far).

Who lives on the North Pole?

Obviously no one – except of course that according to tradition in some Western countries the North Pole is described as the residence of Santa Claus. Actually the Canada Post as assigned the postal code “H0H 0H0” to the North Pole. So it’s a good data quality question if “H0H 0H0” is a valid Canadian postal code.

Also Santa Claus may have several other residences, as the Finnish claims the correct address is “Santa Claus Village, FIN-96930 Arctic Circle, Finland” and in Denmark we believe the correct address of Santa Claus to be “Box 1615, DK-3900 Nuuk, Greenland”.

If you are interested in identity resolution covering multiple countries, there is a discussion going on in the LinkedIn Data Matching Group.

Where is the North Pole?

The latitude is 90° – but there is no longitude. So if you don’t accept null in the longitude attribute of your geocodes you might get a data quality issue when Santa Claus becomes a customer and you believe the Canada Post is the only single version of the truth.

Previous Data Quality World Tour blog posts: