Testing a Data Matching Tool

Many technical magazines have tests of a range of different similar products like in the IT world comparing a range of CPU’s or a selection of word processors. The tests are comparing measurable things as speed, ability to actually perform a certain task and an important thing as the price.

With enterprise software as data quality tools we only have analyst reports evaluating the tools on far less measurable factors often given a result very equivalent to stating the market strength. The analysts haven’t compared the actual speed; they have not tested the ability to do a certain task nor taken the price into consideration.  

A core feature in most data quality tools is data matching. This is the discipline where data quality tools are able to do something considerable better than if you use more common technology as database managers and spreadsheets, like told in the post about deduplicating with a spreadsheet.

In the LinkedIn data matching group we have on several occasions touched the subject of doing a once and for all benchmark of all data quality tools in the world.

My guess is that this is not going to happen. So, if you want to evaluate data quality tools and data matching is the prominent issue and you don’t just want a beauty contest, then you have to do as the queen in the fairy tale about The Princess and the Pea: Make a test.

Some important differentiators in data matching effectiveness may narrow down the scope for your particular requirements like:

  • Are you doing B2C (private names and addresses), B2B (business names and addresses) or both?
  • Do you only have domestic data or do you have international data with diversity issues?
  • Will you only go for one entity type (like customer or product) or are you going for multi-entity matching?

Making a proper test is not trivial.

Often you start with looking at the positive matches provided by the tool by counting the true positives compared to the false positives. Depending on the purposes you want to see a very low figure for false positives against true positives.

Harder, but at least as important, is looking at the negatives (the not matched ones) as explained in the post 3 out of 10.  

Next two features are essential:

  • In what degree are you able to tune the match rules preferable in a user friendly way not requiring too much IT expert involvement?
  • Are you able to evaluate dubious matches in a speedy and user friendly way as shown in the post called When computer says maybe?

A data matching effort often has two phases:

  • An initial match with all current stored data maybe supported by matching with external reference data. Here speed may be important too. Often you have to balance high speed with poor results. Try it.
  • Ongoing matching assisting in data entry and keeping up with data coming from outside your jurisdiction. Here using data quality tools acting as service oriented architecture components is a great plus including reusing the rules from the initial match. Has to be tested too.

And oh yes, from my experience with plenty of data quality tool evaluation processes: Price is an issue too. Make sure to count both license costs for all the needed features and consultancy needed experienced from your tests.

Bookmark and Share

Golden Copy Musings

In a recent blog post by Jim Harris called Data Quality is not an Act, it is a Habit the term “golden mean” was mentioned.   

As I commented, mentioning the “golden mean” made me think about the terms “golden copy” and “golden record” which are often used terms in data quality improvement and master data management.

In using these terms I think we mostly are aiming on achieving extreme uniqueness. But we should rather go for symmetry, proportion, and harmony.

The golden copy subject is very timely for me as I this weekend is overseeing the execution of the automated processes that create a baseline for a golden copy of party master data at a franchise operator for a major brand in car rental.

In car rental you are dealing with many different party types. You have companies as customers and prospects and you have individuals being contacts at the companies, employees using the cars rented by the companies and individuals being private renters. A real world person may have several of these roles. Besides that we have cases of mixed identities.

During a series of workshops we have worked with defining the rules for merge and survivorship in the golden copy. Though we may be able to go for extreme uniqueness in identifying real world companies and persons this may not necessary serve the business needs and, like it or not, be capable of being related back into the core systems used in daily business.

Therefore this golden copy is based on a beautiful golden mean exposing symmetry, proportion, and harmony.

Bookmark and Share

Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

  • Name (could be given name and surname or a company name)
  • House number
  • Street name
  • Postal code
  • City name
  • Phone number
  • E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

You may:

  • If you only have a few rows do this process within a few hours and possibly find all the duplicates
  • If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

Out of Facebook

Some while ago it was announced that Facebook signed up member number 500,000,000.

If you are working with customer data management you will know that this doesn’t mean that 500,000,000 distinct individuals are using Facebook. Like any customer table the Facebook member table will suffer from a number of different data quality issues like:

  • Some individuals are signed up more than once using different profiles.
  • Some profiles are not an individual person, but a company or other form of establishment.
  • Some individuals who created a profile are not among us anymore.

Nevertheless the Facebook member table is a formidable collection of external reference data representing the real world objects that many companies are trying to master when doing business-2- consumer activities.

For those companies who are doing business-2-business activities a similar representation of real world objects will be the +70,000,000 profiles on LinkedIn plus profiles in other social business networks around the world which may act as external reference data for the business contacts in the master data hubs, CRM systems and so on.

Customer Master Data sources will expand to embrace:

  • Traditional data entry from field work like a sales representative entering prospect and customer master data as part of Sales Force Automation.
  • Data feed and data integration with traditional external reference data like using a business directory. Such integration will increasingly take place in the cloud and the trend of governments releasing public sector data will add tremendously to this activity.
  • Self registration by prospects and customers via webforms.
  • Social media master data captured during social CRM and probably harvested in more and more structured ways as a new wave of exploiting external reference data.

Doing “Social Master Data Management” will become an integrated part of customer master data management offering both opportunities for approaching a “single version of the truth” and some challenges in doing so.

Of course privacy is a big issue. Norms vary between countries, so do the legal rules. Norms vary between individuals and by the individuals as a private person and a business contact. Norms vary between industries and from company to company.

But the fact that 500,000,000 profiles has been created on Facebook in a very few years by people from all over world shows that people are willing to share and that much information can be collected in the cloud. However no one wants to be spammed by sharing and indeed there have been some controversies around how data in Facebook is handled. 

Anyway I have no doubt that we will see less data entering clerks entering the same information in each company’s separate customer tables and that we increasingly will share our own master data attributes in the cloud.

Bookmark and Share

Follow Friday Data Quality

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hash tag.

My username on twitter is @hlsdk.

Sometimes I notice tweeps I follow are recommending the username @hldsk or @hsldk or other usernames with my five letters swapped.

It could be they meant me? – but misspelled the username. Or they meant someone else with a username close to mine?

As the other usernames wasn’t taken I have taken the liberty to create some duplicate (shame on me) profiles and have a bit of (nerdish) fun with it:

@hsldk

For this profile I have chosen the image being the Swedish Chef from the Muppet show. To make the Swedish connection real the location on the profile is set as “Oresund Region”, which is the binational metropolitan area around the Danish capital Copenhagen and the 3rd largest Swedish city Malmoe as explained in the post The Perfect Wrong Answer.

@hldsk

For this profile I have chosen the image being a gorilla originally used in the post Gorilla Data Quality.

This Friday @hldsk was recommended thrice.

But I think only by two real life individuals: Joanne Wright from Vee Media and Phil Simon who also tweets as his new (one-man-band I guess) publishing company.

What’s the point?

Well, one of my main activities in business is hunting duplicates in party master databases.

What I sometimes find is that duplicates (several rows representing the same real world entity) have been entered for a good reason in order to fulfill the immediate purpose of use.

The thing with Phil and his one-man-band company is explained further in the post So, What About SOHO Homes.

By the way, Phil is going to publish a book called The New Small. It’s about: How a New Breed of Small Businesses is Harnessing the Power of Emerging Technologies.

Bookmark and Share

360° Share of Wallet View

I have found this definition of Share of Wallet on Wikipedia:

Share of Wallet is the percentage (“share”) of a customer’s expenses (“of wallet”) for a product that goes to the firm selling the product. Different firms fight over the share they have of a customer’s wallet, all trying to get as much as possible. Typically, these different firms don’t sell the same but rather ancillary or complementary product.

Measuring your share of given wallets – and your performance in increasing it – is a multi-domain master data management exercise as you have to master both a 360° view of customers and a 360° view of products.

With customer master data you are forced to handle uniqueness (consolidate duplicates) of customers and handle hierarchies of customers, which is further explained in the post 360° Business Partner View.

With product master data you are not only forced to categorize your own products and handle hierarchies  within, but you also need to adapt to external categorizations in order to getting access to external data available for spending probably on a high level for a segment of customers but sometimes even possible down to the single customer.

Location master data may be important here for geographical segmentations and identification.

My educated guess is that companies will increasing rely on having better data quality and master data management processes and infrastructure in order to measure precise shares of wallets and thereby gain advantages in a stiff competition rather than relying on gut feelings and best guesses.

Bookmark and Share

What a Lovely Day

As promised earlier today, here is the first post in an endless row of positive posts about success in data quality improvement.

This beautiful morning I finished yet another of these nice recurring jobs I do from time to time: Deduplicating bunches of files ready for direct marketing making sure that only one, the whole one and nothing but one unique message reaches a given individual decision maker, be that in the online or offline mailbox.

Most jobs are pretty similar and I have a fantastic tool that automates most of the work. I only have the pleasure to learn about the nature of the data and configure the standardisation and matching process accordingly in a user friendly interface. After the automated process I’m enjoying looking for any false positives and checking for false negatives. Sometimes I’m so lucky that I have the chance to repeat the process with a slightly different configuration so we reach the best result possible.

It’s a great feeling that this work reduces the costs of mailings at my clients, makes them look more smart and professional and facilitates that correct measure of response rates that is so essential in planning future even better direct marketing activities.

But that’s not all. I’m also delighted to be able to have a continuing chat about how we over time may introduce data quality prevention upstream at the point of data entry so we don’t have to do these recurring downstream cleansing activities any more. It’s always fascinating going through all the different applications that many organisations are running, some of them so old that I didn’t dream about they existed anymore. Most times we are able to build a solution that will work in the given landscape and anyway soon the credit crunch is totally gone and here we go.

I’ll be back again with more success from the data quality improvement frontier very soon.

Bookmark and Share

Four Different Data Matching Stage Types

One of the activities I do in my leisure time is cycling. As a consequence I guess I also like to watch cycling on TV (or on the computer), not at least the cycling sport paramount of the year: Le Tour de France.

In Le Tour de France you basically have four different types of stages:

  • Time trial
  • Stages on flat terrain
  • Stages through hilly landscape
  • Stages in the high mountains

Some riders are specialists in one of the stage types and some riders are more all-around types.

With automated data matching, which is what I do the most in my business time, there are basically also four different types of processes:

  • Internal deduplication of rows inside one table
  • Removal of rows in one table which also appears in another table
  • Consolidation of rows from several tables
  • Reference matching with rows in one table against another (big) table

Internal deduplication

Examples of data matching objectives here is finding duplicates in names and addresses before sending a direct mail or finding the same products in a material master.

The big question in this type of process is if you are able to balance between not making any false positives (being too aggressive) while not leaving to many to many false negatives behind (losing the game). You also have to think about survivorship when merging into a golden record.

In Le Tour de France the overall leader who gets the yellow jersey has to make a good time trial.

Removal

Here the examples of data matching objectives will be eliminating nixies (people who don’t want offerings by mail) before sending a direct mail or eliminating bad payers (people you don’t want to offer a credit).

Probably the easiest process everyone can do – but in the end of the day some are better sprinters than others.

The best sprinter in Le Tour de France gets the green jersey.

Consolidation

When migrating databases and/or building a master data hub you often have to merge rows from several different tables into a golden copy.

Here you often see the difficulty of making data fit for the immediate purpose of use and at the same time be aligned with the real world in order to also being able to handle the needs that arises tomorrow.

Often some of the young riders in Le Tour de France makes an escape when climbing the hills and gets the white jersey.

Reference match

Doing business directory matching has been a focus area of mine including making a solution for match with the D&B worldbase. The worldbase holds over 165 million rows representing business entities from all over the world.

The results from automated matching with such directories may vary a lot like you see huge time differences in Le Tour de France when the riders faces the big mountains. Here the best climber gets the polka dotted jersey.

Bookmark and Share

Data Quality is an Ingredient, not an Entrée

Fortunately it is more and more recognized that you don’t get success with Business Intelligence, Customer Relationship Management, Master Data Management, Service Oriented Architecture and many more disciplines without starting with improving your data quality.

But it will be a big mistake to see Data Quality improvement as an entrée before the main course being BI, CRM, MDM, SOA or whatever is on the menu. You have to have ongoing prevention against having your data polluted again over time.

Improving and maintaining data quality involves people, processes and technology. Now, I am not neglecting the people and process side, but as my expertise is in the technology part I will like to mention some the technological ingredients that help with keeping data quality at a tasty level in your IT implementations.

Mashups

Many data quality flaws are (not surprisingly) introduced at data entry. Enterprise data mashups with external reference data may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

External ID’s

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

Error tolerant search

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for introducing duplicates.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

Bookmark and Share

Seeing Is Believing

One of my regular activities as a practice manager at a data quality tool vendor is making what we call a ”Test Report”.

Such a “Test Report” is a preferable presale activity regardless of if we are against a competitor or the option of doing nothing (or no more) to improve data quality. In the latter case I usually name our competitor “Laissez-Faire”.

The most test reports I do is revolving around the most frequent data quality issue being duplicates in party master data – names and addresses.

Looking at what an advanced data matching tool can do with your customer master data and other business partner registries is often the decisive factor for choosing to implement the tool.

I like to do the test with a full extract of all current party master data.

A “Test Report” has two major outcomes:

  • Quantifying the estimated number of different types of duplicates, which is the basis for calculating expected Return on Investment for implementing such an advanced data matching tool.
  • Qualifying both some typical and some special examples in order to point at the tuning efforts needed both for an initial match and the recommended ongoing prevention.

When participating in follow up meetings I have found that discussions around what a tool can do (and not do) is much more sensible when backed up by concrete numbers and concrete examples with your particular data.

Bookmark and Share