What’s In a Given Name?

I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.

When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name.  And when you have done this a few times you also learn the risks of doing so.

Here is some of the learning I have been through:

Gender

Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture.  The subject was discussed on DataQualityPro here.

Salutation

In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.

So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.

If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.

Age

In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.

I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.

Nicknames

Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.


Bookmark and Share

Real World Alignment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.


Bookmark and Share

The Slurry Project

When cleansing party master data it is often necessary to typify the records in order to settle if it is a business entity, a private consumer, a department (or project) in a business, an employee at a business, a household or some kind of dirt, test, comic name or other illegible name and address.

Once I made such a cleansing job for a client in the farming sector. When I browsed the result looking for false positives in the illegible group this name showed up:

  • The Slurry Project (in Danish: Gylleprojektet)

So, normally it could be that someone called a really shitty project a bad name or provided dirty data for whatever reason. But in the context of the farming sector it makes a good name for a project dealing with better exploitation of slurry in growing crops.

A good example of the need for having the capability to adjust the bad word lists according to the context when cleansing data.


Bookmark and Share

Algorithm Envy

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.


Bookmark and Share

Royal Exceptions

I am not a royalist, but anyway: Today 16th April 2010 is the 70 years birthday of Queen Margrethe II of Denmark. Congratulations Your Majesty.

Having a queen (or king) and a royal family is a good example of that there are always exceptions. As a matter related to data quality: I would say that every person in our country has a first (given) name and a last (family) name. But the royal family hasn’t a last name – only they have some first names like those of Her Majesty being Margrethe Alexandrine Þórhildur Ingrid. (By the way: The third name is actually Icelandic; I guess that explains the ash cloud sent as a greeting from there.)

There are always exceptions. We may define data quality validation rules from here to doomsday – there will always be exceptions. We may write down business rules from now to eternity – tomorrow you will encounter the first exception. Data quality (and democracy) is never perfect – but it’s worth striving for.

Beyond Home Improvement

During my many years in customer master data quality improvement I have worked with a lot of clients having data from several countries. In almost every case the data has been prioritized in two pots:

  • Master Data referring to domestic customers
  • Master Data referring to foreign customers

Even though the enterprise defines itself as an international organization, the term domestic still in a lot of cases is easily assigned to the country where a headquarter is situated and where the organization was born.

Signs of this include:

  • Data formats are designed to fit domestic customers
  • Internal reference data are richer for domestic locations
  • External reference data services are limited to domestic customers

The high prioritizing of domestic data is of course natural for historical reasons, because domestic customers almost certainly are the largest group, and because the rules are common to most delegates in a data quality program.

If we accept the fact that improving data quality will be reflected in an improved bottom line, there is still a margin you may improve by not stopping when having optimal procedures for domestic data.

One way of dealing with this in an easy way is to apply general formats, services and rules that may work for data from all over the world, and this approach may in some cases be the best considering costs and benefits.

But I have no doubt that achieving the best data quality with customer master data is done by exploiting the specific opportunities that exist for each country / culture.

Examples are:

  • The completeness and depth for address (location) data available in each country is very different – so are the rules of the postal service’s operating there
  • Public sector company and citizen registration practice also differs why the quality of external reference data is different and so are the rules of access to the data.
  • Using local character sets, script systems, naming conventions and addressing formats besides (or instead of) what applies to that of the headquarter helps with data quality through real world alignment

My guess is that we will see services in cloud in the near future helping us making the global village also come true for master data quality.

Bookmark and Share

Standardise this, standardize that

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

  • Compare the original data rows using fuzzy logic techniques
  • Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

  • In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
  • Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
  • In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

  • The culture of the real world entity that the data represents
  • The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Bookmark and Share

Splitting names

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training

SplitSome of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

  • String manipulation
  • Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
  • Matching with address directories, used for checking if the address is a private residence or a business address.
  • Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
  • Matching with consumer/citizen directories, used for checking which names are known on an address.
  • Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

  • A: The ones the computer may split automatically with an accepted failure rate of false positives
  • B: The dubious ones, selected for human inspection
  • C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

  • “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
  • “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
  • “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
  • “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
  • “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Bookmark and Share