Data Matching 101

Following up on my post no. 100 I can’t resist making a post having 101 in the title. I’ll use 101 in the meaning of an introduction to a subject. As “Data Quality 101” and “MDM 101” is already widely discussed I think “Data Matching 101” is a good title.

Data matching deals with the dimension of data quality I like to call uniqueness. I use uniqueness because it is the positive term describing the state we want to bring our data to – opposite to duplication which is the state we want to change. Just like the other dimensions of data quality also describes the desired states such as accuracy, consistency and timeliness.

Data matching is besides data profiling the activity within data quality that has been automated the most.  No wonder since duplicates in especially master data and master data not being aligned with the real world is costing organizations incredible amounts of money. Finding duplicates among millions (or even thousands) of records by manual means is impossible. The same is true for matching with directories with timely descriptions of the real world. You have to use a computerized approach controlled by exactly that amount of manual verification that makes your return on investment positive.

Matching names and addresses (party master data) is the most common area of data matching. Matching product master data is probably going to be the next big thing in matching. I have also been involved in matching location data and timetables.

A computerized approach to data matching may include some different techniques like parsing and standardization, using synonyms, assigning match codes, advanced algorithms and probabilistic learning.

All that is best explained with examples. Therefore I am happy to do a webinar called “The Art of Data Matching” as part of a series of free webinars on eLearningCurve. The webinar will be a sightseeing looking at examples on challenges and solutions in the data matching world.

Date and time: Well, these are matching examples of expressing the moment the webinar starts:

  • Friday 06/04/10 12pm EDT
  • Friday 04/06/10 18:00 Central European Summer Time
  • Sydney, Sat Jun 5 2:00 AM

Link to the eLearningCurve free webinar here.

Bookmark and Share

Relational Data Quality

Most of the work related to data quality improvement I do is done with data in relational databases and is aimed at creating new relations between data. Examples (from party master data) are:

  • Make a relation between a postal address in a customer table and a real world address (represented in an official address dictionary).
  • Make a relation between a business entity in a vendor table and a real world business (represented in a business directory most often derived from an official business register).
  • Make a relation between a consumer in one prospect table and a consumer in another prospect table because they are considered to represent the same real world person.

When striving for multi-purpose data quality it is often necessary to reflect further relations from the real world like:

  • Make a relation in a database reflecting that two (or more) persons belongs to the same household (on the same real world address)
  • Make a relation in the database reflecting that two (or more) companies have the same (ultimate) mother.

Having these relations done right is fundamental for any further data quality improvement endeavors and all the exciting business intelligence stuff. In doing that you may continue to have more or less fruitful discussions on say the classic question: What is a customer?

But in my eyes, in relation to data quality, it doesn’t matter if that discussion ends with that a given row in your database is a customer, an old customer, a prospect or something else. Building the relations may even help you realize what that someone really is. Could be a sporadic lead is recognized as belonging to the same household as a good customer. Could be a vendor is recognized as being a daughter company of a hot prospect. Could be someone is recognized as being fake. And you may even have some business intelligence that based on the relations may report a given row as a customer role in one context and another role in another context.

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from telegraph.co.uk.

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share

Aadhar (or Aadhaar)

The solution to the single most frequent data quality problem being party master data duplicates is actually very simple. Every person (and every legal entity) gets an unique identifier which is used everywhere by everyone.

Now India jumps the bandwagon and starts assigning a unique ID to the 1.2 billion people living in India. As I understand it the project has just been named Aadhar (or Aadhaar). Google translate tells me this word (आधार) means base or root – please correct if anyone knows better.

In Denmark we have had such an identifier (one for citizens and one for companies) for many years. It is not used by everyone everywhere – so you still are able to make money being a data quality professional specializing in data matching.

The main reason that the unique citizen identifier is not used all over is of course privacy considerations. As for the unique company identifier the reason is that data quality often are defined as fit for immediate purpose of use.

Bookmark and Share

Merging Customer Master Data

One of the most frequent assignments I have had within data matching is merging customer databases after two companies have been merged.

This is one of the occasions where it doesn’t help saying the usual data quality mantras like:

  • Prevention and root cause analysis is a better option
  • Change management is a critical factor in ensuring long-term data quality success
  • Tools are not important

It is often essential for the new merged company to have a 360 degree view of business partners as soon as possible in order to maximize synergies from the merger. If the volumes are above just a few thousand entities it is not possible to obtain that using human resources alone. Automated matching is the only realistic option.

The types of entities to be matched may be:

  • Private customers – individuals and households (B2C)
  • Business customers (B2B) on account level, enterprises, legal entities and branches
  • Contacts for these accounts

I have developed a slightly extended version of this typification here.

One of the most common challenges in merging customer databases is that hierarchy management may have been done very different in the past within the merging bodies. When aligning different perceptions I have found that a real world approach often fulfils the different reasoning.

The fuzziness needed for the matching is basically dependent on the common unique keys available in the two databases. These are keys as citizen ID’s (whatever labeled around the world) and public company ID’s (the same applies). Matching both databases with an external source (per entity type) is an option. “Duns Numbering” is probably the most common known type of such an approach. Maintaining a solution for assigning Duns Numbers to customer files from the D&B WorldBase is by the way one of my other assignments as described here.

The automated matching process may be divided into these three steps:

During my many years of practice in doing this I have found that the result from the automated process may vary considerable in quality and speed depending on the tools used.

Bookmark and Share

Matchback and Master Data Management

The term matchback is used by marketers for the process of determining which marketing activity that triggered a given purchase. In these times where multichannel marketing and sale is embraced by more and more companies, doing matchback is becoming more and more complicated.

The core functionality in matchback is the good old data matching, like: Does the name and address in a catalogue sending match (with a certain similarity) the name and address of a new buyer? But you also have to ask questions as: Is this buyer in fact a new buyer or did he buy before – in this channel or in another channel? Was this buyer also included in a concurrent email campaign? If private: Is the new buyer in the same household as an old buyer? If business: Does the new buyer belong to the same company family tree as the old buyer? Was the contact actually a contact at an old business customer?

Answering these questions will be a totally mess if you don’t have a solid party master data management program in place. You need to:

  • Store (or at least reference) all party entities from all channels in one single so called golden copy
  • Identify the same real world entities
  • Build the hierarchies necessary for current and possible future uses of data

Doing matchback is only one of many activities setting the requirements for party master data management program within an enterprise. And by the way: When that is up and running next thing you need is to manage your product master data the same way in order to make further analysis’s – and probably you also need to have a better structure and data quality with your location master data.

I keep my notes about Master Data Management here.

Bookmark and Share

Enterprise Data Mashup and Data Matching

A mashup is a web page or application that uses or combines data or functionality from two or many more external sources to create a new service. Mashups can be considered to have an active role in the evolution of social software and Web 2.0. Enterprise Mashups are secure, visually rich web applications that expose actionable information from diverse internal and external information sources. So says Wikipedia.

I think that Enterprise Mashups will need data matching – and data matching will improve from data mashups.

The joys and challenges of Enterprise Mashups was recently touched in the post “MDM Mashups: All the Taste with None of the Calories” by Amar Ramakrishnan of Initiate. Data needs to be cleansed and matched before being exposed in an Enterprise Mashup. An Enterprise Mashup is then a fast way to deliver Master Data Management results to the organization.

Party Data Matching has typically been done in these two often separated contexts:

  • Matching internal data like deduplicating and consolidating
  • Matching internal data against an external source like address correction and business directory matching

Increased utilization of multiple functions and multiple sources – like a mashup – will help making better matching. Some examples I have tried includes:

  • If you know whether an address is unique or not this information is used to settle a confidence of an individual or household duplicate.
  • If you know if an address is a single residence or a multiple residence (like a nursing home or campus) this information is used to settle a confidence of an individual or household duplicate.
  • If you know the frequency of a name (in a given country) this information is used to settle a confidence of a private, household or contact duplicate.

As many data quality flaws (not surprisingly) are introduced at data entry, mashups may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

Also the rise of social media adds new possibilities for mashup content during data entry, data maintenance and for other uses of MDM / Enterprise Mashups. Like it or not, your data on Facebook, Twitter and not at least LinkedIn are going to be matched and mashed up.

Bookmark and Share

Which came first, the chicken or the egg?

The most common symbol for Easter, which is just around the corner in countries with Christian cultural roots, is the decorated egg.  What a good occasion to have a little “which came first” discussion.

So, where do you start if you want better information quality: Data Governance or Data Quality improvement?

In order to look at it exemplified with something that is known to nearly everyone’s business, let’s look at party master data where we face the ever recurring questing: What is a customer? Do you have to know the precise answer to that question (which looks like a Data Governance exercise) before correcting your party master data (which often is a Data Quality automation implementation).

I think this question is closely related to the two ways of having high quality data:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

In my eyes the first way, make data fit for their intended uses, is probably the best way if you aim for information quality in one or two silos, but the second way, alignment with the real world, is the best and less cumbersome way, if you aim for enterprise wide information quality where data are fit for current and future multiple purposes.

So, starting with Data Governance and then long way down the line applying some Data Quality automation like Data Profiling and Data Matching  seems to be the way forward in if you go for intended use.

On the other hand, if you go for real world alignment it may be best that you start with some Data Profiling and Data Matching in order to realize what the state of your data is and make the first corrections towards having your party master data aligned with the real world. From there you go forward with an interactive Data Governance and Data Quality automation (never ending) journey which includes discovering what a customer role really is.

Bookmark and Share

What is a best-in-class match engine?

Latest in connection with that TIBCO acquires data matching vendor Netrics the term best-in-class match engine has been attached to the Netrics product.

First: I have no doubt that the Netrics product is a capable match engine – I know that from discussions in the LinkedIn Data Matching group and here on this blog.

Next: I don’t think anyone knows what product is the best match engine, because I don’t think that all match engines have been benchmarked with a representative set of data.

There are of course on top the matching capabilities with different entity types to consider. Here party master data (like customer data) are covered by most products whereas capabilities with other entity types (be that considered same same or not) are far less exposed.

As match engine products are acquired and integrated in suites the core matching capabilities somehow becomes mixed up with a lot of other capabilities making it hard to compare the match engine alone.

Some independent match engines work stand alone and some may be embedded into other applications.

These may then be the classes to be best in:

  • Match engines in suites
  • Embedded match engines (for say SAP, MS CRM and so on)
  • Stand alone match engines

Many match engines I have seen are tuned to deal with data from the country (culture) where they are born and had their first triumphs. As the US market is still far the largest for match engines the nomination of best match engine resembles when a team becomes World Champions in American Football. International/multi-cultural capabilities will become more and more important in data matching. But indeed we may define a class for each country (culture).

In the old days I have heard that one match engine was best for marketing data and another match engine was best for credit risk management. I think these days are over too. With Master Data Management you have to embrace all data purposes.

Some match engines are more successful in one industry. The biggest differentiator in match effectiveness is with B2C and/or B2B data. B2C is the easiest, B2B is more complex and embracing both is in my eyes a must for being considered best-in-class – unless we define separate classes for B2C, B2B and both.

As some matching techniques are deterministic and some are probabilistic the evaluation on the latter one will be based on data already processed in a given instance, as the matching gets better and better as the self learning element is warmed up.

So, yes, an endless religious-like discussion I reopened here.

Bookmark and Share

Double Falshood

Always remember to include Shakespeare in a blog, right?

Now, it is actually disputable if Shakespeare has anything to do with the title of this blog post. Double Falshood is the (first part of the) title of a play claimed to be based on a lost play by Shakespeare (and someone else). The only fact that seems to be true in this story is that the plot of the play(s) is based on an episode in Don Quixote by Cervantes.  “The Ingenious Hidalgo Don Quixote of La Mancha”, which is the full name of the novel, is probably best known for the attack on the windmills by don Quijote (the Spanish version of the name).

All this confusion about sorting out who, what, when and where, and the feeling of tilting at windmills, seems familiar in the daily work in trying to fix master data quality.

And indeed “double falsehood” may be a good term for the classic challenge in the data quality kind of deduplication, which is to avoid false positives and false negatives at the same time.

Now, back to work.