Big Trouble with Big Names

An often seen issue in party master data management is handling information about your most active customers, suppliers and other roles of interest. These are often big companies with many faces.

I remember meeting that problem way back in the 80’s when I was designing a solution for the Danish Maritime Authorities.  

In relation to a ship there are three different main roles:

  • The owner of the ship, who has some legal rights and obligations
  • The operator of ship, who has responsibilities regarding the seaworthiness of the ship
  • The employer, who has responsibilities regarding the seamen onboard the ship

Sometimes these roles don’t belong to the same company (or person) for a given ship. That real world reality was modeled all right. But even if it practically is the same company, then the roles are materialized very different for each role. I remember this was certainly the case with the biggest ship-owner in Denmark (and also by far the biggest company in Denmark) being the A.P. Moller – Maersk Group.

We really didn’t make a golden record for that golden company in my time on the project.

Bookmark and Share

Business Directory Match: Global versus Local

When doing data quality improvement in business-to-business party master data an often used shortcut is matching your portfolio of business customers with a business directory and preferably picking new customers from the directory in the future.

If you are doing business in more than one country you will have some considerations about what business directory to use like engaging with a local business directory for each country or engaging with a single business directory covering all countries in question.

There are pro’s and con’s.

One subject is conformity. I have met this issue a couple of times. A business directory covering many countries will have a standardized way of formatting the different elements like a postal address, whereas a local (national) business directory will use best practice for the particular country.

An example from my home country Denmark:

The Dun & Bradstreet WorldBase is a business directory holding 170 million business entities from all over the world. A Danish street address is formatted like this:

Address Line 1 = Hovedgaden 12 A, 4. th

Observe that Denmark belongs to that half of the earth where house numbers are written after the street name.

In a local business directory (based on the public registry) you will be able to get this format:

Street name = Hovedgaden
Street code = 202 4321
House number = 012A
Floor = 04
Side/door = TH

Here you get an atomized address with metadata for the atomized elements and the unique address coding used in Denmark.

Bookmark and Share

Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

  • Name (could be given name and surname or a company name)
  • House number
  • Street name
  • Postal code
  • City name
  • Phone number
  • E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

You may:

  • If you only have a few rows do this process within a few hours and possibly find all the duplicates
  • If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

The Little Match Girl

The short story (or fairy tale) The Little Match Girl (or The Litlle Match Seller) by Hans Christian Andersen is a sad story with a bad ending, so it shouldn’t actually belong here on this blog where I will try to tell success stories about data quality improvement resulting in happy databases.

However, if I look at the industry of making data matching tools (and data matching technology is a large part of data quality tools) I wonder if the future has ever been that bright.

There are many tools for data matching out there.

Some tool vendors have been acquired by big players in the data management realm as:

  • IBM acquired Accential Software
  • SAS Institute acquired DataFlux
  • Informatica acquired Similarity Systems and Identity Systems
  • Microsoft acquired Zoomix
  • SAP acquired Fuzzy Informatik and Business Objects that acquired FirstLogic
  • Experian acquired QAS
  • Tibco acquired Netrics

(the list may not be complete, just what immediately comes to my mind).

The rest of the pack is struggling with selling matches in the cold economic winter.

There is another fairy tale similar to The Little Match Girl called The Star Money collected by the Brothers Grimm. This story has a happy ending. Here the little girl gives here remaining stuff away for free and is rewarded with money falling down from above. Perhaps this is like The Coming of Age of Open Source as told in a recent Talend blog post?

Well, open source is first expected to break the ice in the Frozen Quadrant in 2012.

Bookmark and Share

A Data Quality Appliance?

Today it was announced that IBM is to acquire Netezza, a data warehouse appliance vendor.

5 years ago I guess the interest for data warehouse appliances was very sparse. I guess this because I attended a session held by Netezza at the 2005 London Information Management conference. We were 3 people in the room: The presenter, a truly interested delegate and me. I was basically in the room because I was the next speaker in the room and wanted to see how things worked out. For the record: It was a good session, I learned a lot about appliances.  

Probably therefore I noticed a piece from 2007 where Philip Howard of Bloor wrote about The scope for appliances. In this article Phillip Howard also suggested other types of appliances, for example a data quality (data matching) appliance.  

I have been around some implementations where we could use the power of an appliance when we have to match a lot of rows. The Achilles’ heel in data matching is candidate selection and often you have to restrict on your methods in order to maintain a reasonable performance.

But I wonder if I ever will see an on promise data quality (data matching) appliance or it will be placed in the cloud. Or maybe there already is one out there? If so, please tell about it.    

Bookmark and Share

The Sound of Soundex

The probably oldest and most used error tolerant algorithm in searching and data matching is a phonetic algorithm called Soundex. If you are not familiar with Soundex: Wikipedia to the rescue here.

In the LinkedIn group Data Matching we seem to have an ongoing discussion about the usefulness of Soundex. Link to the discussion here – if you are not already a member: Please join, spammers are dealt with, though it is OK to brag about your data matching superiority.

To sum up the discussion on Soundex I think we at this stage may conclude:

  • Soundex is of course very poor compared to the more advanced algorithms, but it may be better than nothing (which will be exact searching and matching)
  • Soundex (or a variant of Soundex) may be used for indexing in order to select candidates to be scored with better algorithms.

Let’s say you are going to match 100 rows with names and addresses against a table with 100 million rows with names and addresses and let’s say that the real world individual behind the 100 rows is in fact represented among the 100 million, but not necessary spelled the same.

Your results may be as this:

  • If you use exact automated matching you may find 40 matching rows (40 %).
  • If you use automated matching with (a variant of) Soundex you may find 95 matching rows, but only 70 rows (70 %) are correct matches (true positives) as 25 rows (25 %) are incorrect matches (false positives).
  • If you use automated matching with (a variant of) Soundex indexing and advanced algorithm for scoring you may find 75 matching rows where 70 rows (70 %) are correct matches (true positives) and 5 rows (5 %) are incorrect matches (false positives).
  • By tuning the advanced algorithm you may find 67 matching rows where 65 rows (65 %) are correct matches (true positives) and 2 rows (2 %) are incorrect matches (false positives).

So when using Soundex you will find more matching rows but you will also find more manual work in verifying the results. Adding an advanced algorithm may reduce the manual work or eliminate manual work at the cost of some not found matches (false negatives) and the risk of a few wrong matches (false positives).

PS: I have a page about other Match Techniques including standardization, synonyms and probabilistic learning.

PPS: When googling for if the title of this blog has been used before I found this article from a fellow countryman.

Bookmark and Share

Game, Set, Match

Tennis is one of the sports I practiced a lot when I was young and still like to play when possible.

As a consequence I guess I also like to follow world class tennis not at least now where we finally got a Dane competing for the big titles. I’m thinking about Caroline Wozniacki who is seeded as number one in the ongoing US Open Grand Slam tournament.

So, as an excuse to write a blog post about it I have come up with these connections between Caroline and Data Matching.

The name:

Wozniacki isn’t exactly a Nordic name as she is the daughter of native-born Polish parents. In fact, if the Polish naming practice should be followed her surname should be Wozniacka; the female form of the name. But as practiced in Western countries she has inherited a genderless family name.  Good for matching.

The bet:

Bets on sports event is like scoring in data matching. You are not 100 % sure but rely on probability. Odds for Caroline winning the US Open opening round matches are as 1.01 and 1.02 = 98 – 99 % certainty = pretty sure. But odds get higher as the tournament proceeds to final rounds and it can go either way.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share

Follow Friday Data Quality

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hash tag.

My username on twitter is @hlsdk.

Sometimes I notice tweeps I follow are recommending the username @hldsk or @hsldk or other usernames with my five letters swapped.

It could be they meant me? – but misspelled the username. Or they meant someone else with a username close to mine?

As the other usernames wasn’t taken I have taken the liberty to create some duplicate (shame on me) profiles and have a bit of (nerdish) fun with it:

@hsldk

For this profile I have chosen the image being the Swedish Chef from the Muppet show. To make the Swedish connection real the location on the profile is set as “Oresund Region”, which is the binational metropolitan area around the Danish capital Copenhagen and the 3rd largest Swedish city Malmoe as explained in the post The Perfect Wrong Answer.

@hldsk

For this profile I have chosen the image being a gorilla originally used in the post Gorilla Data Quality.

This Friday @hldsk was recommended thrice.

But I think only by two real life individuals: Joanne Wright from Vee Media and Phil Simon who also tweets as his new (one-man-band I guess) publishing company.

What’s the point?

Well, one of my main activities in business is hunting duplicates in party master databases.

What I sometimes find is that duplicates (several rows representing the same real world entity) have been entered for a good reason in order to fulfill the immediate purpose of use.

The thing with Phil and his one-man-band company is explained further in the post So, What About SOHO Homes.

By the way, Phil is going to publish a book called The New Small. It’s about: How a New Breed of Small Businesses is Harnessing the Power of Emerging Technologies.

Bookmark and Share

3 out of 10

Just before I left for summer vacation I noticed a tweet by MDM guru Aaron Zornes saying:

This is a subject very close to me as I have worked a lot with business directory matching during the last 15 years not at least matching with the D&B WorldBase.

The problem is that if you match your B2B customers, suppliers and other business partners with a business directory like the D&B WorldBase you could naively expect a 100% match.

If your result is only a 30% hit rate the question is: How many among the remaining 70% are false negatives and how many are true negatives.

True negatives

There may be a lot of reasons for true negatives, namely:

  • Your business entity isn’t listed in the business directory. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordic countries and others have a tight public registration of companies and then it is less tight from countries in North America, other European countries and the rest of the world.
  • Your supposed business entity isn’t a business entity. Many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.
  • Uniqueness may be different defined in the business directory and your table to be matched. This includes the perception of hierarchies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also the different roles as those of small business owners are a challenge. The same is true about roles as franchise takers and the use of trading styles.

False negatives

In business directory matching the false negatives are those records that should have been matched by an automated function, but isn’t.

The number of false negatives is a measure of the effectiveness of the automated matching tool(s) and rules applied. Big companies often use the magic quadrant leaders in data quality tools, but these aren’t necessary the best tools for business directory matching.

Personally I have found that you need a very complex mix of tools and rules for getting a decent match rate in business directory matching, including combining both deterministic and probabilistic matching. Some different techniques are explained in more details here.

Bookmark and Share