What’s In a Given Name?

I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.

When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name.  And when you have done this a few times you also learn the risks of doing so.

Here is some of the learning I have been through:

Gender

Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture.  The subject was discussed on DataQualityPro here.

Salutation

In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.

So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.

If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.

Age

In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.

I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.

Nicknames

Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.


Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.


Bookmark and Share

Returns from Investing in a Data Quality Tool

The classic data quality business case is avoiding sending promotion letters and printed materials to duplicate prospects and customers.

Even as e-commerce moves forward and more complex data quality business cases as those related to multi-purpose master data management becomes more important I will like to take a look at the classic business case by examining some different kind of choices for a data quality tool.

As you may be used to all different kind of currencies as EUR, USD, AUD, GBP and so on I will use the fictitious currency SSB (Simple Stupid Bananas).

Let’s say we have a direct marketing campaign with these facts:

  • 100,000 names and addresses, ½ of them also with phone number
  • Cost per mail is 3 SSB
  • Response is 4,500 orders with an average profit of 100 SSB

From investigating a sample we know that 10% of the names and addresses are duplicates with slightly different spellings.

So from these figures we know that the cost of a false negative (a not found actual duplicate) is 3 SSB. Savings of a true positive is then also 3 SSB.

The cost of a false positive (a found duplicate that actually isn’t a duplicate) is a possible missing order worth: 4,500 / (100,000 * 90 %) * 100 SSB = 5 SSB.

Now let’s examine 3 options for tools for finding duplicates:

A: We already have Excel

B: Buying the leader of the pack data quality tool

C: Buying an algorithm based dedupe tool

A: We already have Excel

You may first sort 100,000 rows by address and look for duplicates this way. Say you find 2,000 duplicates. Then sort 98,000 rows by surname and look for duplicates. Say you find 1,000 duplicates. Then sort 97,000 rows by given name. Say you find 1,000 duplicate. Finally sort 48,000 rows by phone number. Say you find 1,000 duplicates.

If a person can look for duplicates in 1,000 rows per hour (without making false positives) we will browse totally 343,000 sorted rows in 343 hours.

Say you hire a student for that and have the Subject Matter Expert explaining, controlling and verifying the process using 15 hours.

Costs are:

  • 343 student hours each 15 SSB = 5.145 SSB
  • 15 SME hours each 50 SSB = 750 SSB

Total costs are 5.895 SSB.

Total savings are 5,000 true positives each 3 SSB = 15.000 SSB, making a positive ROI = 9.105 SSB in each campaign.

Only thing is that it will take one student more than 2 months (without quitting) to do the job.

B: Buying the leader of the pack data quality tool

Such a tool may have all kind of data quality monitoring features, may be integrated smoothly with ETL functionality and so on. For data matching it may use so called match codes. Doing that we may expect that the tool will find 7,500 duplicates where 7,000 are true positives and 500 are false positives.

Costs may be:

  • Tool license fee is 50.000 SSB
  • Training fee is 7.000 SSB
  • 80 hours external consultancy each 125 SSB  = 10.000 SSB
  • 60 IT hours for training and installation each 50 SSB = 3.000 SSB
  • 100 SME hours for training and configuration each 50 SSB = 5.000 SSB

Total costs are 75.000 SSB

Savings per campaign are 7,000 * 3 SSB – 500* 5 SSB = 18.500 SSB.

A positive ROI will show up after the 5th campaign.

C: Buying an algorithm based dedupe tool

By using algorithm based data matching such a tool depending on the threshold setting may find 9,100 duplicates where 9,000 are true positives and 100 are false positives.

Costs may be:

  • Tool license fee is 5.000 SSB
  • 8 hours external consultancy for a workshop each 125 SSB  = 1.000 SSB
  • 15 SME hours for training, configuration and pushing the button each 50 SSB = 750 SSB

Total costs are 6.750 SSB

Savings per campaign are 9,000 * 3 SSB – 100* 5 SSB = 26.500 SSB

A remarkable ROI will show up in the 1st campaign.


Bookmark and Share

Algorithm Envy

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.


Bookmark and Share

Data Matching 101

Following up on my post no. 100 I can’t resist making a post having 101 in the title. I’ll use 101 in the meaning of an introduction to a subject. As “Data Quality 101” and “MDM 101” is already widely discussed I think “Data Matching 101” is a good title.

Data matching deals with the dimension of data quality I like to call uniqueness. I use uniqueness because it is the positive term describing the state we want to bring our data to – opposite to duplication which is the state we want to change. Just like the other dimensions of data quality also describes the desired states such as accuracy, consistency and timeliness.

Data matching is besides data profiling the activity within data quality that has been automated the most.  No wonder since duplicates in especially master data and master data not being aligned with the real world is costing organizations incredible amounts of money. Finding duplicates among millions (or even thousands) of records by manual means is impossible. The same is true for matching with directories with timely descriptions of the real world. You have to use a computerized approach controlled by exactly that amount of manual verification that makes your return on investment positive.

Matching names and addresses (party master data) is the most common area of data matching. Matching product master data is probably going to be the next big thing in matching. I have also been involved in matching location data and timetables.

A computerized approach to data matching may include some different techniques like parsing and standardization, using synonyms, assigning match codes, advanced algorithms and probabilistic learning.

All that is best explained with examples. Therefore I am happy to do a webinar called “The Art of Data Matching” as part of a series of free webinars on eLearningCurve. The webinar will be a sightseeing looking at examples on challenges and solutions in the data matching world.

Date and time: Well, these are matching examples of expressing the moment the webinar starts:

  • Friday 06/04/10 12pm EDT
  • Friday 04/06/10 18:00 Central European Summer Time
  • Sydney, Sat Jun 5 2:00 AM

Link to the eLearningCurve free webinar here.

Bookmark and Share

Post no. 100

This is post number 100 on this blog. Besides that this is a time for saying thank you to those who have read this blog, those who have re-tweeted the posts and not at least those who have commented on the posts on this blog, it is also time for a recapitulation on my opinions (based on my experiences and observations) about data quality.

Let me emphasize three points:

  • Fit for purpose versus real world alignment
  • Diversity in data quality
  • The role of technology in data quality improvement

Fit for purpose versus real world alignment

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

This theme is so far covered in 19 posts and pages including:

Diversity in data quality

International and multi-cultural aspects of data quality improvement have been a favorite topic of mine for a long time.

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

I have made 15 posts on diversity in data quality so far including:

The role of technology in data quality improvement

Being a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start. I have always strived to be a mixed skilled person. As I have tried single handed to build a data quality tool – or to be more specific a data matching tool – I do of course write a lot about data quality technology.

This blog includes 37 posts on data quality technology so far including:

Bookmark and Share

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from telegraph.co.uk.

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share

Data Quality and World Food

I have touched the analogy between food (quality) and data (quality) several times before for example in the posts “Bon Appétit” and “Under New Master Data Management”.

Why not continue down that road?

Let’s have a look at some local food that has become popular around the world.

寿司

Imagine you go to a restaurant where you order a fish dish. When starting to consume your dinner you realize that the fish hasn’t been boiled, fried or in any other way exposed to heat. Then I guess it is perfectly normal to shout out: THE FISH IS RAW – and demanding apologies from the chef, the head waiter, Gordon Ramsey or anyone else in charge. Unless of course if you are in a sushi restaurant where the famous Japanese dish that may include raw fish is prepared.

Köttbullar

Köttbullar is the Swedish word for meatballs. This had rightfully stayed as a fact only known to Swedes if it wasn’t for cheap furniture sold around the world by IKEA. By reasons still unclear to me IKEA has chosen to serve Köttbullar in the store cafeterias and even sell the stuff along with the particle board furniture on their e-commerce sites.

Pizza

Italian originated dish usually brought to you by someone on a bike or in extreme cases in a very old car.

McChicken

Selling food of different kind in the form as a burger works in the United States – and by reasons that I can’t explain even in France.

Data Quality analogies

Well, let’s just say that data quality tools and services:

  • May be regarded very different around the world,
  • Usually are sold along with tools and services made for something completely different,
  • Are brought to you in various ways by local vendors and
  • By reasons I can’t explain often are made for use in the United States (no other pun intended but pure admiration of execution).

Bon appétit.

Bookmark and Share

What a Happy Day

I have got a lot of good news today.

First a Nigerian gentleman wants to deposit 40.5 M $ on my bank account and 35 % is for me to keep. Wow.

Next it seems that data quality improvement and master data management is not about technology at all. Practically you only need smart people doing smart processes. Wow.

Nobody actually needs my assistance that much and soon I will have plenty of money in the bank.

Breaking through an open door

This is perhaps a road I have been down before for example lately in the post The Myth about a Myth.

But it is a pet peeve of mine.

Why are some people always reminding us that this and that must be seen in a business context?

Of course everything we do in our professional life within data quality, master data management, business intelligence and so on must be seen in a business context. Again, I have never seen any people taking the opposite stance.

I am aware that playing the “business context” card is a friendly reminder when say some people become too excited about a tool. But remember, every tool is originally made by people to solve a business challenge and if the tool continues to exist it has probably done that several times.

It may be that tools are over exposed in our business issue discussions due to that some people are doing their job:

  • Vendors are naturally pushing their tools – it’s a business issue
  • Analysts talks about tools and vendors – it’s a business issue
  • Conference organizers invites vendors to make sponsorships and tool exhibitions – it’s a business issue

But I don’t think you are breaking through anything when reminding anyone about the business context. Everyone knows that already.  Take it to the next level.