Where is the Spot?

One of things we often struggle with in data quality improvement and master data management is postal addresses. Postal addresses have different formats around the world, names of streets are spelled alternatively and postal codes may be wrong, too short or suffer from other flaws.

An alternative way of identifying a place is a geocode and sometimes we may think: Hurray, geocodes are much better in uniquely identifying a place.

Well, unfortunately not necessarily so.

First of all geocodes may be expressed in different systems. The most used ones are:

  • Latitude and longitude: Even though the globe is not completely round, this system for most purposes is good for aligning positions with the real world.
  • UTM: When the world is reflected on a paper or on a computer screen it becomes flat. UTM reflects the world on a flat surface very well aligned with the metric system making distance calculations straight forward.
  • WGS: This is the system in use in many GPS devices and also the one behind Google Maps.

Next, where is the address exactly placed?

I have met at least three different approaches:

  • It could be where the building actually is and then if the precision is deep and/or the building is big on different places around the building.
  • It could be where the ground meets a public road. This is actually most often the case, as route planning is a very common use case for geocodes. The spot is fit for the purpose of use so to say.
  • It could, as reported in the post  Some Times Big Brother is Confused, be any place on (and beside) the street as many reference data sources interpolates numbers equally along the street or in other ways gets it wrong by keeping it simple.

Bookmark and Share

The Location Domain

When talking master data management we usually divide the discipline into domains, where the two most prominent domains are:

  • Customer, or rather party, master data management
  • Product, sometimes also named “things”, master data management

One the most frequent mentioned additional domains are locations.

But despite that locations are all around we seldom see a business initiative aimed at enterprise wide location data management under a slogan of having a 360 degree view of locations. Most often locations are seen as a subset of either the party master data or in some cases the product master data.  

Industry diversity

The need for having locations as focus area varies between industries.

In some industries like public transit, where I have been working a lot, locations are implicit in the delivered services. Travel and hospitality is another example of a tight connection between the product and a location. Also some insurance products have a location element. And do I have to mention real estate: Location, Location, Location.

In other industries the location has a more moderate relation to the product domain. There may be some considerations around plant and warehouse locations, but that’s usually not high volume and complex stuff.  

Locations as a main factor in exploiting demographic stereotypes are important in retailing and other business-to-consumer (B2C) activities. When doing B2C you often want to see your customer as the household where the location is a main, but treacherous, factor in doing so. We had a discussion on the house-holding dilemma in the LinkedIn Data Matching group recently.

Whenever you, or a partner of yours, are delivering physical goods or a physical letter of any kind to a customer, it’s crucial to have high quality location master data. The impact of not having that is of course dependent on the volume of deliveries.   

Globalization

If you ask me about London, I will instinctively think about the London in England. But there is a pretty big London in Canada too, that would be top of mind to other people. And there are other smaller Londons around the world.

Master data with location attributes does increasingly come in populations covering more than one country. It’s not that ambiguous place names don’t exist in single country sets. Ambiguous place names were the main driver behind that many countries have a postal code system. However the British, and the Canadians, invented a system including letters opposite to most other systems only having numbers typically with an embedded geographic hierarchy.

Apart from the different standards used around the possibilities for exploiting external reference data is very different concerning data quality dimensions as timeliness, consistency, completeness, conformity – and price.

Handling location data from many countries at the same time ruins many best practices of handling location data that have worked for handling location for a single country.

Geocoding

Instead of identifying locations in a textual way by having country codes, state/province abbreviations, postal codes and/or city names, street names and types or blocks and house numbers and names it has become increasingly popular to use geocoding as supplement or even alternative.

There are different types of geocodes out there suitable for different purposes. Examples are:

  • Latitude and longitude picturing a round world,
  • UTM X,Y coordinates picturing peels of the world
  • WGS84 X, Y coordinates picturing a world as flat as your computer screen.

While geocoding has a lot to offer in identifying and global standardization we of course has a gap between geocodes and everyday language. If you want to learn more then come and visit me at N55’’38’47, E12’’32’58.

Bookmark and Share

Geocoding from 100 Feet Under

I stumbled upon this image posted by Ellie K. on Google+

The title is World map of Flickr and Twitter locations and the legend is that red dots are locations of Flickr pictures, blue dots are locations of Twitter tweets and white dots are locations that have been posted to both.

You may be able to see your city following this link.

For example Copenhagen looks like this:

Here you have Copenhagen in Denmark to the left and Malmoe in Sweden to the right.

The strip between is the fixed link known as the Øresund Bridge.

However the connection isn’t entirely a bridge. If you look at a flyover picture you may think that there wasn’t money enough to finish the connection. Fortunately there was. The part closest to Copenhagen Airport is a 4 kilometer (2.5 miles) undersea tunnel.

So what puzzles me is the dots apparently representing Flickr uploads and tweets made from the tunnel. Are you able to upload to Flickr from down there? How are the tweets geocoded with that precision? My GPS never works when passing the tunnel.

(PS: I know you may geotag when back to surface)

Bookmark and Share

Down the Street

Having an address consisting of a house number and a street name, or vice versa, is the usual way of addressing in most parts of the world. This construct is also featured in the presentation of the Universal Postal Union’s (UPU) international standard initiative (S42):

(Click on image to see the presentation)

Somehow I always end up living at a place with issues in relation to this construct.

Our current address is (without unit):

“Kenny Drews Vej 27” which would be “27 Kenny Drews Way” in an Anglo-phone country.

But our area has a new style of block buildings with canals between as we like to pretend that we live in Venice or Amsterdam:

This means that the house numbers aren’t sequenced down the street, but is spread round the block as if we were living in Japan. Google maps have the position exactly as it is:

Number 27 on Kenny Drews Vej is actually much closer to two other streets, which makes it very difficult when people are visiting us the first time and for some also the second time.

But that’s because I, and some of our visitors, are old fashioned. As Prashanta Chan says in his blog post Geocoding: Accurate Location Master Data: It will be much better to invite folks to your geocode.

The same thing applies to when you want some goods delivered to your premises or want a taxi as close to your front door as possible.

And regarding letters delivered by the good old postman: They will probably all be sent electronically before the UPU S42 addressing mapping standard is adapted by everyone.

Bookmark and Share

No NOT NULL

A basic way of ensuring data quality in a database is to define that a certain attribute must be filled. This is done by specifying that the value “null” isn’t allowed or as said in SQL’ish: Setting the NOT NULL constraint.

A common data quality issue is that such constraints almost always are too rigid.

In my last post called Notes about the North Pole it was discussed that every place on earth has a latitude and a longitude except that the North Pole – and the South Pole – hasn’t a longitude. So if you have a table with geocodes you can’t set NOT NULL for the longitude if you (though very unlikely) should store the coordinates for the poles. Alternatively you could store 0 for longitude to make it complete – but then it would be very inaccurate. 360 degree inaccurate so to speak.

Another infrequent example from this blog is that every person in my country has a given (first) name and a family (last) name. But there are a few Royal Exceptions. So, no NOT NULL for the family name.

Related to people and places there are plenty of more frequent examples. If you only expect addresses form United States, Australia or India setting the NOT NULL for the state attribute seems wise. But expect foolish values in here when you get addresses from most other parts of the world. So, no NOT NULL for the state.  

A common variant of the mandatory state value is when you register for data quality webinars, white papers and so on. Most often you must select from a value list containing the United States of America – in some cases also mixed in with Canadian Provinces. The NULL option to be used by strangers may hide as “Not Applicable” way down the list among states beginning with N.

I usually select Alaska which is among the first states in the alphabetical order – which also brings me back close to the North Pole making my data close to 360 degree inaccuracy.     

Bookmark and Share

Valuable Inaccuracy

These days I’m involved in an activity in which you may say that we by creating data with questionable quality are making better information quality.

The business case is within public transit. In this particular solution passengers are using chip cards when boarding busses, but are not using the cards when alighting. This is a cheaper and smoother solution than the alternative in electronic ticketing, where you have both check-in and check-out. But a major drawback is the missing information about where passengers alighted, which is very useful information in business intelligence.

So what we do is that we where possible assume where the passenger alighted. If the passenger (seen as a chip card) within a given timeframe boarded another bus at a stop point which is on or near a succeeding stop point on the previous route, then we assume alighting was at that stop point though not recorded.

Two real life examples of doing so is where the passenger makes an interchange or where the passenger later on a day goes back from work, school or other regular activity.

An important prerequisite however is that we have good data quality regarding stop point locations, route assignments and other master data and their relations.    

Bookmark and Share

Location, Location, Location

Now, I am not going to write about the importance of location when selling real estates, but I am going to provide three examples about knowing about the location when you are doing data matching like trying to find duplicates in names and addresses.

Location uniqueness

Let’s say we have these two records:

  • Stefani Germanotta, Main Street, Anytown
  • Stefani Germanotta, Main Street, Anytown

The data is character by character exactly the same. But:

  • There is only a very high probability that it is the same real world individual if there is only one address on Main Street in Anytown.
  • If there are only a few addresses on Main Street in Anytown, you will still have a fair probability that this is the same individual.
  • But if there are hundreds of addresses on Main Street in Anytown, the probability that this is the same individual will be below threshold for many matching purposes.

Of course, if you are sending a direct marketing letter it is pointless sending both letters, as:

  • Either they will be delivered in the same mailbox.
  • Or both will be returned by postal service.

So this example highlights a major point in data quality. If you are matching for a single purpose of use like direct marketing you may apply simple processing. But if you are matching for multiple purposes of use like building a master data hub, you don’t avoid some kind of complexity.

Location enrichment

Let’s say we have these two records:

  • Alejandro Germanotta, 123 Main Street, Anytown
  • Alejandro Germanotta, 123 Main Street, Anytown

If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual.

But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith).

So this example highlights the importance of using external reference data in data matching.

Location geocoding

Let’s say we have these two records:

  • Gaga Real Estate, 1 Main Street, Anytown
  • L.  Gaga Real Estate, Central Square, Anytown

If you match using the street address, the match is not that close.

But if you assigned a geocode for the two addresses, then the two addresses may be very close (just around the corner) and your match will then be pretty confident.

Assigning geocodes usually serve other purposes than data matching. So this example highlights how enhancing your data may have several positive impacts.

Bookmark and Share

Real World Alignment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.


Bookmark and Share

Data Quality in the Cloud

In my previous post I advocated that Data Quality tools in the near future will exploit the huge data resources in the cloud in order to achieve having data of high quality by correctly reflecting the real world construct to which they refer.

I am well aware that this is based on an assumption that data in the cloud are accurate, timely and so on, which is of course not always the case – now. This will only come when a certain data source has a number of subscribers that require a certain level of data quality and perhaps contributes to correcting flaws.

I tried that out right before writing this post when I installed Google Earth on a new laptop. A journey where I shifted between being very impressed and then a bit disappointed.

First the site from where to install – either by position or my OS language – guessed that I am not English speaking. Unfortunately it changed to Dutch – and not Danish. Well, most Dutch words are either like German or English or at least urban slang. I went through. Inside the application most text has now changed to Danish – only with a few Dutch and English labels.

Knowing that the application hasn’t learned anything about me yet I started to type just my street address which is only 8 characters but global unique: “Lerås 13” (remember: house number after street name in my part of the world). The application answered promptly with my full address as first candidate and when clicking on that it took me from high above the earth right down to where I live. Impressing.

Well, the pointer was actually 40 meters NNE from the nearest corner of our premise – and in front of our garage I could recognize the grey car I had 2 years ago. Disappointing.

Postal Address Hierarchy, Granularity, Precision and History

Penny_blackIn my last blog post the term “single version of the truth” was discussed. Some prerequisites for having raw data stored in one version that meets all known purposes are that:

  • They are kept with the granularity needed for all purposes
  • They have the most advanced precisions with all purposes
  • They reflect all time states asked for regarding all purposes

In the following I will go through some challenges with postal addresses. Don’t take this as an attempt to list all challenges in the world around this subject – it is only what I have been up to.

Countries

The country is the highest level in the address hierarchy. A source of truth may be a list of ISO 2 character country codes. But there are other lists and between these lists there a different perceptions of the fact that even countries are internally in hierarchies. Some examples related to the Olympic contest as my last blog post was part of are:

  • York (the old one) is placed in England – or is it Great Britain – or is it United Kingdom?
  • Referring to United States of America may or may not include Puerto Rico, US Virgin Islands, Guam, Samoa and Northern Mariana Islands.
  • The Kingdom of Denmark is not Denmark but Denmark, Faroe Islands and Greenland.

An example of a very slow changing dimension in here is that US Virgin Islands was part of the Kingdom of Denmark until 1917.

I had a great deal of fun with country codes and names when setting up a data matching solution around the D&B WorldBase and the world picture kept in there opposite to what is contained in other data samples.

States

Some countries have states, some countries have provinces and some other countries don’t have states or provinces. In some countries the state is a mandatory part of a postal address like in the US. In other countries having states the state is not a part of a printed address like in Germany, but you may have other purposes for storing the data anyway.

Postal codes and districts

Often local postal code systems are translated to the term ZIP-code – but ZIP code is actually the name of the US system.

The granularity of postal code systems differs a lot around the world. The UK postal codes are very specific while a postal code in other countries may refer to a large city. In most countries the postal code system is a hierarchy of numbers. The UK system is different. The Irish is very different – no postal codes until now.

In many countries companies are assigned a postal code of their own. The same goes for post office box addresses. In France the name of the referring district is followed by the word CEDEX for these addresses. So, be careful when matching or grouping city names in French addresses. Paris not Cedex is the centre of the universe in that country.

Locations, streets, blocks, house names, whatever

A lot of different hierarchies in various levels exist around the world – and the custom sequence also varies. This is a too complex and comprehensive subject for a blog post. So I will only emphasis a few selected subjects:

  • Vanity addressing is a phenonemen not at least in the UK where keeping up appearances rules. Here you may have to include a lie in the single version of truth.
  • Coding rules in my home country Denmark as we have a way of assigning a unique code to every real world entity. It helps with automated taxation. So a main road in central Copenhagen may be known to people as “H.C. Andersens Boulevard” but is stored in any mature database as “1010148”.
  • When matching party entities don’t make a false negative with an entity having a visit (geographical) address versus an entity having a mail address.

Entrances

Entrance – most often referred to as house number – is where addressing meets geocoding. Here you by using geocodes can point to an exact value identifying an address. When comparing with other addresses you just have to make sure whether you are talking latitude/longitude in a round world or WGS84 x-y coordinates or other geographic coordinate systems in a flat world and whether we are pointing at the centre of the building, at the door, at the spot where a public road is reachable or it is interpolated values.

Units

Larger buildings, high rising buildings and skyscrapers are usually not one address but is an entrance having multiple family apartments and/or multiple business addresses. These may be presented in many formats and in many depths including floors, sides, door numbers, you name it.

Large business entities may occupy a range of entrances.

Some entrances may in first impression look like a single address occupied by a nuclear family, but are in fact a nursing home or a campus occupied by a number of named individuals living on the same address.

Data models

The postal (geographical and mailing) address elements are in many data models just some of the attributes in a party entity. By separating the postal address elements in a specific entity with granulated attributes you will be more aligned with the real world and thereby have a better chance of fulfilling all purposes with the raw data. One of the most obvious advantages will be history tracking as business’ and consumers/citizens relocates from time to time.

Bookmark and Share