Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

Location, Location, Location

Now, I am not going to write about the importance of location when selling real estates, but I am going to provide three examples about knowing about the location when you are doing data matching like trying to find duplicates in names and addresses.

Location uniqueness

Let’s say we have these two records:

  • Stefani Germanotta, Main Street, Anytown
  • Stefani Germanotta, Main Street, Anytown

The data is character by character exactly the same. But:

  • There is only a very high probability that it is the same real world individual if there is only one address on Main Street in Anytown.
  • If there are only a few addresses on Main Street in Anytown, you will still have a fair probability that this is the same individual.
  • But if there are hundreds of addresses on Main Street in Anytown, the probability that this is the same individual will be below threshold for many matching purposes.

Of course, if you are sending a direct marketing letter it is pointless sending both letters, as:

  • Either they will be delivered in the same mailbox.
  • Or both will be returned by postal service.

So this example highlights a major point in data quality. If you are matching for a single purpose of use like direct marketing you may apply simple processing. But if you are matching for multiple purposes of use like building a master data hub, you don’t avoid some kind of complexity.

Location enrichment

Let’s say we have these two records:

  • Alejandro Germanotta, 123 Main Street, Anytown
  • Alejandro Germanotta, 123 Main Street, Anytown

If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual.

But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith).

So this example highlights the importance of using external reference data in data matching.

Location geocoding

Let’s say we have these two records:

  • Gaga Real Estate, 1 Main Street, Anytown
  • L.  Gaga Real Estate, Central Square, Anytown

If you match using the street address, the match is not that close.

But if you assigned a geocode for the two addresses, then the two addresses may be very close (just around the corner) and your match will then be pretty confident.

Assigning geocodes usually serve other purposes than data matching. So this example highlights how enhancing your data may have several positive impacts.

Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.


Bookmark and Share

Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share

Having the right element to the left

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

  • In some cultures like in east Asia the family name is written first and the given name is written last.
  • Some notations indicate that the given name isn’t the first element:
    • “DUPONT Michel” is a custom French way of telling, that the family name is the first element
    • “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

  • In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
  • In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
  • The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Bookmark and Share

Postal Address Hierarchy, Granularity, Precision and History

Penny_blackIn my last blog post the term “single version of the truth” was discussed. Some prerequisites for having raw data stored in one version that meets all known purposes are that:

  • They are kept with the granularity needed for all purposes
  • They have the most advanced precisions with all purposes
  • They reflect all time states asked for regarding all purposes

In the following I will go through some challenges with postal addresses. Don’t take this as an attempt to list all challenges in the world around this subject – it is only what I have been up to.

Countries

The country is the highest level in the address hierarchy. A source of truth may be a list of ISO 2 character country codes. But there are other lists and between these lists there a different perceptions of the fact that even countries are internally in hierarchies. Some examples related to the Olympic contest as my last blog post was part of are:

  • York (the old one) is placed in England – or is it Great Britain – or is it United Kingdom?
  • Referring to United States of America may or may not include Puerto Rico, US Virgin Islands, Guam, Samoa and Northern Mariana Islands.
  • The Kingdom of Denmark is not Denmark but Denmark, Faroe Islands and Greenland.

An example of a very slow changing dimension in here is that US Virgin Islands was part of the Kingdom of Denmark until 1917.

I had a great deal of fun with country codes and names when setting up a data matching solution around the D&B WorldBase and the world picture kept in there opposite to what is contained in other data samples.

States

Some countries have states, some countries have provinces and some other countries don’t have states or provinces. In some countries the state is a mandatory part of a postal address like in the US. In other countries having states the state is not a part of a printed address like in Germany, but you may have other purposes for storing the data anyway.

Postal codes and districts

Often local postal code systems are translated to the term ZIP-code – but ZIP code is actually the name of the US system.

The granularity of postal code systems differs a lot around the world. The UK postal codes are very specific while a postal code in other countries may refer to a large city. In most countries the postal code system is a hierarchy of numbers. The UK system is different. The Irish is very different – no postal codes until now.

In many countries companies are assigned a postal code of their own. The same goes for post office box addresses. In France the name of the referring district is followed by the word CEDEX for these addresses. So, be careful when matching or grouping city names in French addresses. Paris not Cedex is the centre of the universe in that country.

Locations, streets, blocks, house names, whatever

A lot of different hierarchies in various levels exist around the world – and the custom sequence also varies. This is a too complex and comprehensive subject for a blog post. So I will only emphasis a few selected subjects:

  • Vanity addressing is a phenonemen not at least in the UK where keeping up appearances rules. Here you may have to include a lie in the single version of truth.
  • Coding rules in my home country Denmark as we have a way of assigning a unique code to every real world entity. It helps with automated taxation. So a main road in central Copenhagen may be known to people as “H.C. Andersens Boulevard” but is stored in any mature database as “1010148”.
  • When matching party entities don’t make a false negative with an entity having a visit (geographical) address versus an entity having a mail address.

Entrances

Entrance – most often referred to as house number – is where addressing meets geocoding. Here you by using geocodes can point to an exact value identifying an address. When comparing with other addresses you just have to make sure whether you are talking latitude/longitude in a round world or WGS84 x-y coordinates or other geographic coordinate systems in a flat world and whether we are pointing at the centre of the building, at the door, at the spot where a public road is reachable or it is interpolated values.

Units

Larger buildings, high rising buildings and skyscrapers are usually not one address but is an entrance having multiple family apartments and/or multiple business addresses. These may be presented in many formats and in many depths including floors, sides, door numbers, you name it.

Large business entities may occupy a range of entrances.

Some entrances may in first impression look like a single address occupied by a nuclear family, but are in fact a nursing home or a campus occupied by a number of named individuals living on the same address.

Data models

The postal (geographical and mailing) address elements are in many data models just some of the attributes in a party entity. By separating the postal address elements in a specific entity with granulated attributes you will be more aligned with the real world and thereby have a better chance of fulfilling all purposes with the raw data. One of the most obvious advantages will be history tracking as business’ and consumers/citizens relocates from time to time.

Bookmark and Share