Entity Resolution and Big Data

FingerprintThe Wikipedia article on Identity Resolution has this catch on the difference between good old data matching and Entity Resolution:

”Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Laboratory for Advanced Research in Entity Resolution and Information Quality:

  • Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
  • Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
  • Utilizes non-matching, asserted linking (associate) information in addition to direct matching
  • Uncovers non-obvious relationships and association networks (i.e. who’s associated with whom)”

I have a gut feeling that Data Matching and Entity (or Identity) Resolution will melt together in the future as expressed in the post Deduplication vs Identity Resolution.

If you look at the above mentioned factors that distinguish data matching from identity resolution, some of the often mentioned features in the new big data technology shine through:

  • Working with unstructured and semi-structured data is probably the most mentioned difference between working with small data versus working with big data.
  • Working with associations is a feature of graph databases or other similar technologies as mentioned in the post Will Graph Databases become Common in MDM?

So, in the quest of expanding matching small data to evolve into Entity (or Identity) Resolution we will be helped by general developments in working with big data.

Bookmark and Share

Hello Leading MDM Vendor

This morning I received messages from a leading MDM vendor about an upcoming webinar the 12th September.

INFA 01

As we have the 3rd October today this is strange and the vendor of course sent out a correction later today:

INFA 02

That’s OK. Shit happens. Even at data quality and MDM vendors marketing departments.

I am probably a kind of a strange person been living in two countries lately, so I got the original message and the correction both to my Scandinavian identity from the vendor’s Scandinavian body:

INFA 03

As well as to my UK identity from the vendor’s UK body:

INFA 04

That’s OK. Getting a 360 degree view of migrating persons is difficult as discussed in the post 180 Degree Prospective Customer View isn’t Unusual.

Both (double) messages have a salutation.

UK:

INFA 05

Scandinavian:

INFA 06

Being Mr. Sorensen in the UK is OK. Using Mister and surname fits with an English stiff upper lip and The Letter ø could be o in the English alphabet.

I’m not sure if Dear Mr. Sørensen is OK in a Scandinavian context. Hello Henrik would be a better fit.

Bookmark and Share

Somehow Deduplication won’t Stick

psychographic MDM18 years ago I cruised into the data quality realm when making my first deduplication tool. Then it was an attempt to solve a business case of two companies who were considering merging and wanted to know the intersection of customers. So far, so good.

Since then I have worked intensively with deduplication and other data matching tools and approaches and also co-authored a leading eLearning course on the matter as seen here.

Deduplication capability is a core feature of many data quality tools and indeed the probably most mentioned data quality pain is lack of uniqueness not at least in party master data management.

However, most deduplication efforts don’t in my experience stick. Yes, we can process a file ready for direct marketing and purge the messages that might end up in the same offline or online inbox despite of spelling differences. But taking it from there and use the techniques in achieving a single customer view is another story. Some obstacles are:

In the comments to the latter 3 year old post the intersection (and non-intersection) of Entity Resolution and Master Data Management (MDM) was discussed.

During my latest work I have become more and more convinced that achieving a single view of something is a lot about entity resolution as expressed in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

The Good, Better and Best Way of Avoiding Duplicates

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

  • Downstream data matching
  • Real time duplicate check
  • Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

iDQ mashup
instant Data Quality

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

  • If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
  • If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

Bookmark and Share

Matching for Multiple Purposes

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

HierarchyFor that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Bookmark and Share

Know Your Fan

A variant of the saying “Know Your Customer” for a football club will be “Know Your Fan” and indeed fans are customers when they buy tickets. If they can.

FC Copenhagen

FC Copenhagen cruised into stormy waters when they apparently cancelled all purchases for the upcoming Champions League (European soccer club paramount tournament) clashes against Real Madrid, Juventus and Galatasaray if the purchasers didn’t have a Danish sounding name. The reason was to prevent mixing fans of the different clubs, but surely this poorly thought screening method wasn’t received well among the FC Copenhagen fans not called Jensen, Nielsen or Sørensen.

The story is told in English here on Times of India.

Actually methods of verifying identities are available and cheap in Denmark so I’m surprised to see FC Copenhagen caught offside in this situation.

Bookmark and Share

Hierarchical Data Matching

A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.

HierarchyOne of the trends mentioned is hierarchical data matching.

The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

I usually divide a data matching process into three main steps:

  • Candidate selection
  • Match scoring
  • Match destination

(More information on the page: The Art of Data Matching)

Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.

In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.

Bookmark and Share

Know Your (Foreign Luxury Bag) Customer

Gucci BagA story featured a lot in the media the last days is the incident where one of richest women on the planet, Oprah Winfrey, was told that she couldn’t afford the handbag she wanted to look at in a Zürich shop. Was it racism or a misunderstanding because Oprah isn’t good at speaking German?

Either way it was for sure an example of bad things happening when you don’t know your customer. This story also highlights the issues we have with foreign customers as Oprah may not be just as famous in Zürich as in New York.

We have these challenges in customer master data management all over as described in the post Know Your Foreign Customer.

And oh: Maybe it’s time to start a sister blog called Liliendahl on Fashion. This is my second post on luxury handbags. The first post was called Data Quality Luxury.

Bookmark and Share

180 Degree Prospective Customer View isn’t Unusual

My eMail inbox is collecting received mails from several eMail accounts and therefore it’s not unusual to have duplicate messages in there.

This morning I had two eMails coming in to two different eMail accounts probably part of the same campaign but with different messages:

180 degree

Apparently I have landed in two different segments with two different eMail accounts: One technology oriented and one sales and marketing oriented.

Record linking of sparse subscription profiles isn’t easy and even Informatica, a big player in Master Data Management and Data Quality solutions, have land to be covered in this game.

Bookmark and Share

Where the Streets have Two Names

As told in post The Art in Data Matching a common challenge in matching names and addresses is that in some parts of the world the streets have more than one name at the same time because more than one language is in use.

We have the same challenge when building functionality for rapid addressing, being functionality that facilitates fast and quality assured entry of addresses supported by reference data that knows about postal codes / cities and street names.

The below example is taken from the instant Data Quality tool address form:

Finish Swedish

The Finnish capital Helsinki also has an official name in Swedish being Helsingfors and the streets in Helsinki/Helsingfors have both Finnish and Swedish names. So when you start typing a letter suggestions could be in both Finnish and Swedish.

What challenges have you encountered with street names in multiple languages?

Bookmark and Share