Postal Code Musings

When working with master data management and data quality including data matching one of the most frequent pieces of information you work with is a postal code.

Postal codesWikipedia has a good article about postal code.

Some of the data quality issues related to the datum postal code are:

Metadata

Over the world different words are used for a postal code:

  • ZIP code, the United States implementation of a postal code, is often used synonymously for a postal code in many databases and user interfaces. This is not seriously wrong, but not right either.
  • In India a postal code (in English) is called a PIN Code (Postal Index Number). This could definitely trick me.

Format

There are basically two different formats of postal codes around:

  • Numeric postal codes are the most common ones. The number of digits does however differ between countries. And there may be some additional considerations:
    •  For example the 9 digit United States ZIP code is split into the original 5 digits and the additional 4 digits implemented later.
    • Postal codes may begin with 0 which may create formatting errors when treated as numeric.
  • Some countries, for example the United Kingdom, the Netherlands, Canada and Argentina, have alphanumeric postal codes.

Embedded Information

Numeric postal codes usually forms some kind of hierarchy in which you can guess the geographical position within the country and make ranges representing smaller or larger geographical areas. But you never know.

This also goes for Dutch (you know, the ones in the Netherlands) postal codes as the first 4 characters are numeric.

The UK postal codes usually start with a mnemonic of the main city in the area, except in a lot of cases.

Precision

Some postal code systems have postal codes covering larger areas with many streets and some postal code systems are very granular where each street, or part of a street, has a distinct postal code.

The UK postal code system is very granular which have paved the way for using rapid addressing as told in a recent article on the UK Database Marketing Magazine.

Coverage

Utilizing rapid addressing requires that reference data for postal codes practically covers every spot in the country and updates are available on a near real time basis.

Some countries have postal code systems not covering every corner and some countries haven’t a postal code system at all.

Uniqueness

The main reason for implementing postal code systems is that a town or city name in many cases isn’t unique within a country.

But that doesn’t mean that uniqueness works the other way as well. A postal code may in many countries cover several town names. France is an example.

Consistency

While we basically have granular and not so granular postal code systems we of course also have hybrids.

In Denmark for example there is a granular system in the capital Copenhagen with a postal code for each street, named by the street, and a system in the rest of country with a postal code for an area named by the suburban or town.

Fit for purpose

A postal code is a hierarchical element in a postal address. We basically have two forms of postal addresses:

  • A geographical address where the postal address including the postal code points to place you also can visit and meet the people receiving the things sent to there
  • A post-office box which may have more or less geographical connection to where the people receiving the things sent to there are

Penetration of post-office boxes differs around the world. In Namibia it is mandatory. In Sweden most companies have a post-office box address.

Trying to compare data with these different concepts is like comparing apples and oranges, which often goes bananas.

Bookmark and Share

Is the Holiday Season called Christmas Time or Yuletide?

Johansen_Viggo_Radosne_Boże_NarodzenieIn English we have these two different terms for the coming holiday season: Christmas Time or yuletide. Christmas Time has a religious touch while yuletide is old English and resembles the term juletid still used in Scandinavia. Also notice that Christmas Time is two words (unless written as Christmastime) while yuletide is a compound word like common in Germanic language. And oh, Christmas Time must be written with upper case as first letters while yuletide doesn’t have to (unless maybe in a blog post title). I still struggle a lot with English grammar.

The holiday season may be seen as a religious celebration or, which I think has become prevailing, a special occasion for business. Yuletide is high activity in Business-to-Consumer (B2C) both for brick and mortar shops and for eCommerce, while Christmas Time is almost a stand still for Business-to-Business (B2B) as no one is able to make any decisions because it is the holiday season.

By the way: The only thing I wish for xmas is that people start to standardize on the terms used for the same concept. Not at least at Christmastide it is so disturbing when we don’t have any form of standardisation.

Bookmark and Share

Star Bucks

Occasionally there are stories in the press about how multinational companies don’t pay taxes accordingly to where they earn their money.

Lately there has been a row in the UK about that Starbucks despite being very successful officially are losing money in the UK and therefore don’t pay taxes in the UK. The Guardian’s latest entry on that here.

The Guardian article quotes a call for more international co-operations.

I wonder if that will be done as we can’t even agree on simple concepts as:

  • Having the same format for a date across the globe: Today is 13/12/2012 in most parts of the world but 12/13/2012 in the United States.
  • Using comma or period as decimal mark. I have said that 1,731 times in the UK and 1.731 times when I lived in Denmark.
  • Agreeing about if a house number comes before or after the street name:

UPU S42
and many many more fundamental things about presenting data.

Bookmark and Share

Some Kinds of Reference Data

The term ”reference data” and related Reference Data Management (RDM) is used commonly in the data quality and Master Data Management (MDM) realm.

As with most terms it may be used with slightly different meanings. Usually, but not necessarily always, reference data are core data entities defined outside a given organization.

I have come across the below discussed kinds of reference data:

Reference Data in Investment Banking

The term “reference data” is well established in investment banking. Reference data are core master data entities as counterparties, securities and currencies. These are the things you deal with in investment banking. They are not made up for a given bank or other single financial institution but are shared across the whole market and should optimally be the same to every institution at exactly the same point of time.

RDMSmall Reference Data

In Master Data Management in general we usually see reference data as value lists helping describing and standardizing internal master data.

One example will be a country list. A list of countries should be the same for every organization in the world. However available lists does differ though most variations usually don’t have any business impact as the academic question about if Antarctica should be in the list or not.

A list of codes describing to which industry a given company belongs is another example of reference data. As examined in the post What are they doing? you may choose to standardize on SIC codes or standardise on NACE codes or develop your own set of codes for that purpose.

Big Reference Data

In geography a country list is in the top levels of defining locations. Further deep we may have postal code systems within each country as ZIP codes in the United States, PLZ codes in Germany and PIN codes in India. Yet further deep we have every single valid postal address eventually all over the world. This is what I call big reference data.

A way of sourcing industry codes for your customers, suppliers and other business partners will be picking from or enriching from a business directory like for example the D&B WorldBase or any other of the many business directories around. Such directories may also be seen as big reference data.

The dramatic increase in the use of social media and related social network profiles has emerged as a new kind of big reference data serving as links to our internal master data.

Bookmark and Share

Hotel Rating Data Quality

Whether you are traveling for business or pleasure you like to stay in a hotel that suites your expectations.

What is good and what is bad differs between us individuals. But we may all belong to some type of stereotype depending on from where in the world we are from. For example, if I walk into an even modest rated American driven (managed) hotel anywhere in the world, I am pretty sure that there will be a bed much larger that I actually need. On a local driven hotel I’m not so sure.

The most common used hotel rating methodology are one to five stars rating systems. However, the classification criteria are not universal. They differ from country to country. Some countries have a public regulated system, in some countries the industry sets the standards and in some countries there are competing systems.

So, I can’t be sure that three stars in one country means the same as three stars in another country. One of my personal foremost requirements is that there is a WiFI available. In the Swiss criteria that will be only 2 out of 863 possible points. So I couldn’t be sure even on a five star hotel. Using the English criteria I will have to go for a four star hotel to be sure.

Besides official ratings social ratings has become more and more popular. Typically guests rates the hotels on the portal where they booked using a scale from 1 to 10 and you may add verbal descriptions about the appealing things and even more popular the appalling things.

Bookmark and Share

Naming the Olympians

The British newspaper The Guardian has a feature on their website where you can get data about the Olympians. Link here: London 2012 Olympic athletes: the full list.

Browsing the list is a good reminder of the world-wide diversity we have with person names.

The names are here formatted with the surname(s) followed by the given name(s). The surname is in upper case.

The sequence of names is for the Chinese and other East Asian Olympians like they are used to opposite to other Olympians from places where we have the first name being the given name and last name being our surname.

Having the surname in upper case also shows where Olympians have two surnames as it is custom in Spanish cultures.

And oh yes. The South African guy has JIM as his surname.

Finally from this screen shot there is a good question. Is JIANG Wenwen superb at both synchronized swimming and track cycling – or is it two different Olympians with the same name. Some names are very common in China. A little goggling tells me it is two different persons. The synchronized swimmer is more related to her twin sister and swimming partner JIANG Tingting.

Let’s check if there is more than one “John Smith”.

Nope.

But it could be fun if “Kim Smith” and “Kimberley Smith” came from the same country.

Many Olympians actually don’t have the names reflected in this sheet as many have names in a different alphabet or script system.

The Danish cycling rider “SORENSEN Nicki” actually share my last name, as we know him as “Nicki Sørensen”. The Serbs, Ukrainians and Russian Olympians have their original name in the Cyrillic alphabet, but they have been transliterated to the English alphabet and Olympians from countries with other script systems than an alphabet have had their names gone through a transcription to the (English) alphabet.

So, is the list bad data quality?

Bookmark and Share

The Big Tower of Babel

3 years ago one of the first blog posts on this blog was called The Tower of Babel.

This post was the first of many posts about multi-cultural challenges in data quality improvement. These challenges includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name some of the ones I have experienced.

When organizations are working internationally it may be tempting to build a new Tower of Babel imposing the same language for metadata (probably English) and the same standards for names, addresses and other master data (probably the ones of the country where the head quarter is).

However, building such a high tower may end up the same way as the Tower of Babel known from the old religious tales.

Alternatively a mapping approach may be technically a bit more complex but much easier when it comes to change management.

The mapping approach is used in the Universal Postal Unions’ (UPU) attempt to make a “standard” for worldwide addresses. The UPU S42 standard is mentioned in the post Down the Street. The S42 standard does not impose the same way of writing on envelopes all over the world, but facilitates mapping the existing ways into a common tagging mapped to a common structure.

Building such a mapping based “standard” for addresses, and other master data with international diversity, in your organization may be a very good way to cope with balancing the need for standardization and the risks in change management including having trusted and actionable master data.

The principle of embracing and mapping international diversity is a core element in the service I’m currently working with. It’s not that the instant Data Quality service doesn’t stretch into the clouds. Certainly it is a cloud service pulling data quality from the cloud. It’s not that that it isn’t big. Certainly it is based on big reference data.

Bookmark and Share

The Cases for UPPER CASE in Data Management

I remember some years ago when I started SMS’ing I had an old mobile phone that defaulted the text in upper case. After I while my son answered back: “Why are you always yelling at me in SMSes”.

So I learned that you can use lower case in SMSes as well, and only using all caps in SMSes, as in any other writing, usually means that YOU ARE YELLING.

Examining a text for upper case use can, together with polarity classifiers and all that jazz, be used today in sentiment analysis for example within social media data.

Within data parsing using words in upper case in person names may tell you something too. Especially in France it is common to indicate a surname with only upper case characters, so for example in the name “AUGUST Michel” the first name is the surname and the last name is the given name.

When matching company names a word in upper case may indicate an abbreviation. So “THE Ltd” and “The Happy Entrepreneur Ltd” may be a good match despite of a horrible edit distance.

In data migration within handling names from older systems where all caps have been used, it is common to try to make better looking names. “JOHN SMITH” will be “John Smith” and “SAM MCCLOUD” should be “Sam McCloud”. In environments with other alphabets than English national characters may be reintroduced as well. For example in a German context “JURGEN VON LOW” may come out as “Jürgen von Löw”.

What about you? Have you stumbled upon some fun with upper case in data management?

Bookmark and Share

Finding Me

Many people have many names and addresses. So have I.

A search for me within Danish reference sources in the iDQ tool gives the following result:

Green T is positive in the Danish Telephone Books. Red C is negative in the Danish Citizen hub. Green C is positive in the Danish Citizen Hub.

Even though I have left Denmark I’m still registered with some phone subscriptions there. And my phone company hasn’t fully achieved single customer view yet, as I’m registered there with two slightly different middle (sur)names.

Following me to the United Kingdom I’m registered here with more different names.

It’s not that I’m attempting some kind of fraud, but as my surname contains The Letter Ø, and that letter isn’t part of the English alphabet, my National Insurance Number (kind of similar to the Social Security Number in the US) is registered by the name “Henrik Liliendahl Sorensen”.

But as the United Kingdom hasn’t a single citizen view, I am separately registered at the National Health Service with the name “Henrik Sorensen”. This is due to a sloppy realtor, who omitted my middle (sur)name on a flat rental contract. That name was taken further by British Gas onto my electricity bill. That document is (surprisingly for me) my most important identity paper in the UK, and it was used as proof of address when registering for health service.

How about you, do you also have several identities?

Bookmark and Share

Your Point, My Comma

Spam mails can be great food for thought.

This morning I had this one in one of my many mailboxes:

So, the amount in question was:

It’s interesting to see how the spammer used points and commas in the large amount of money he wanted to trick me with. Don’t know if he was sloppy or had the problem of showing an amount to a not segmented audience of the world that are:

  • Using point as decimal mark and comma as thousand separator
  • Using comma as decimal mark and point as thousand separator

The use of a sign for decimal mark and thousand separators is indeed divided across the globe as seen on this map:

The blue countries are using point as decimal mark and comma as thousand separator and the green countries are doing the opposite.

Then there may be diversities within a country as in Canada there are always questions about Quebec, where they are following the French custom. India also has its own numerals with 100 groupings besides the English heritage.  

The pattern of a approximately one half world using one standard and approximately another half of the world using an opposite standard is seen in other notations as arranging person names, writing street addresses as well as place names and postal codes as told in the post Having the Right Element to the Left.

Bookmark and Share