Duplicates – Page 4 – Liliendahl on Data Quality

Hello Leading MDM Vendor

3rd October 20133rd October 2013Henrik Gabs Liliendahl10 Comments

This morning I received messages from a leading MDM vendor about an upcoming webinar the 12^thSeptember.

As we have the 3^rd October today this is strange and the vendor of course sent out a correction later today:

That’s OK. Shit happens. Even at data quality and MDM vendors marketing departments.

I am probably a kind of a strange person been living in two countries lately, so I got the original message and the correction both to my Scandinavian identity from the vendor’s Scandinavian body:

As well as to my UK identity from the vendor’s UK body:

That’s OK. Getting a 360 degree view of migrating persons is difficult as discussed in the post 180 Degree Prospective Customer View isn’t Unusual.

Both (double) messages have a salutation.

UK:

Scandinavian:

Being Mr. Sorensen in the UK is OK. Using Mister and surname fits with an English stiff upper lip and The Letter ø could be o in the English alphabet.

I’m not sure if Dear Mr. Sørensen is OK in a Scandinavian context. Hello Henrik would be a better fit.

Somehow Deduplication won’t Stick

29th September 2013Henrik Gabs Liliendahl5 Comments

18 years ago I cruised into the data quality realm when making my first deduplication tool. Then it was an attempt to solve a business case of two companies who were considering merging and wanted to know the intersection of customers. So far, so good.

Since then I have worked intensively with deduplication and other data matching tools and approaches and also co-authored a leading eLearning course on the matter as seen here.

Deduplication capability is a core feature of many data quality tools and indeed the probably most mentioned data quality pain is lack of uniqueness not at least in party master data management.

However, most deduplication efforts don’t in my experience stick. Yes, we can process a file ready for direct marketing and purge the messages that might end up in the same offline or online inbox despite of spelling differences. But taking it from there and use the techniques in achieving a single customer view is another story. Some obstacles are:

There is a fear of getting it all wrong as told in post Beware of False Positives in Data Matching.
For many good reasons business processes requires deliberate duplicates as reported in the post Entity Revolution vs Entity Evolution.

In the comments to the latter 3 year old post the intersection (and non-intersection) of Entity Resolution and Master Data Management (MDM) was discussed.

During my latest work I have become more and more convinced that achieving a single view of something is a lot about entity resolution as expressed in the post The Good, Better and Best Way of Avoiding Duplicates.

The Good, Better and Best Way of Avoiding Duplicates

22nd September 2013Henrik Gabs Liliendahl1 Comment

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

Downstream data matching
Real time duplicate check
Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

Matching for Multiple Purposes

12th September 201312th September 2013Henrik Gabs Liliendahl2 Comments

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching? The question should actually be: ‘Which is better for you, for your specific needs?’ Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

For that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Hierarchical Data Matching

13th August 2013Henrik Gabs Liliendahl9 Comments

A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.

One of the trends mentioned is hierarchical data matching.

The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

I usually divide a data matching process into three main steps:

Candidate selection
Match scoring
Match destination

(More information on the page: The Art of Data Matching)

Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.

In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.

180 Degree Prospective Customer View isn’t Unusual

5th August 2013Henrik Gabs Liliendahl3 Comments

My eMail inbox is collecting received mails from several eMail accounts and therefore it’s not unusual to have duplicate messages in there.

This morning I had two eMails coming in to two different eMail accounts probably part of the same campaign but with different messages:

Apparently I have landed in two different segments with two different eMail accounts: One technology oriented and one sales and marketing oriented.

Record linking of sparse subscription profiles isn’t easy and even Informatica, a big player in Master Data Management and Data Quality solutions, have land to be covered in this game.

Is Data Cleansing Bad for Data Matching?

28th June 2013Henrik Gabs Liliendahl19 Comments

Today I stumbled upon an article from Australia on BMC: Medical Informatics and Decision Making. The article is called The effect of data cleaning on record linkage quality.

The result of the described research is:

“Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.”

This resonates very well with my experience too. Usually I like to match with both original data and standardized (cleansed) data in order to exploit the best of both approaches.

What are your experiences?

Multi-Channel Data Matching

4th April 20134th April 2013Henrik Gabs LiliendahlLeave a comment

Most data matching activities going on are related to matching customer, other rather party, master data.

In today’s business world we see data matching related to party master data in those three different channels types:

Offline is the good old channel type where we have the mother of all business cases for data matching being avoiding unnecessary costs by sending the same material with the postman twice (or more) to the same recipient.
Online has been around for some time. While the cost of sending the same digital message to the same recipient may not be a big problem, there are still some other factors to be considered, like:
- Duplicate digital messages to the same recipient looks like spam (even if the recipient provided different eMail addresses him/her self).
- You can’t measure a true response rate
Social is the new channel type for data matching. Most business cases for data matching related to social network profiles are probably based on multi-channel issues.

The concept of having a single customer view, or rather single party view, involves matching identities over offline, online and social channels, and typical elements used for data matching are not entirely the same for those channels as seen in the figure to the right.

Most data matching procedures are in my experience quite simple with only a few data elements and no history track taking into considering. However we do see more sophisticated data matching environments often referred to as identity resolution, where we have historical data, more data elements and even unstructured data taking into consideration.

When doing multi-channel data matching you can’t avoid going from the popular simple data matching environments to more identity resolution like environments.

Some advices for getting it right without too much complication are:

Emphasize on data capturing by getting it right the first time. It helps a lot.
Get your data models right. Here reflecting the real world helps a lot.
Don’t reinvent the wheel. There are services for this out here. They help a lot.

Read more about such a service in the post instant Single Customer View.

Big Data and Data Matching

2nd April 2013Henrik Gabs Liliendahl2 Comments

Data matching has been an established discipline for many years and most data quality tools have more or less sophisticated features for data matching as well as many MDM (Master Data Management) platforms have data matching capabilities.

BigDataQuality — The LinkedIn Big Data Quality group

In a way the data matching realm has become slightly dull the recent years. People don’t get excited anymore over a discussion about if deterministic matching or probabilistic matching is the right way. Soundex is old, edit distance has been around for ages and matchcodes may have outlived themselves.

So, it’s good to see a new beast turning up. Data matching with big data.

It may be about deduplicating (deduping) volumes that is bigger than traditional data matching can handle. You know: Dedoop’ing.

But it is also very much about matching big data with small data, first and foremost master data. And having well matched master data. Kimmo Kontra wrote a good post about that recently. The post is called Big Grease, Big Data, and Big Apple – manholes and MDM.

The case presented by Kimmo holds many exciting implementations of data matching like for example proximity matching of locations.

Coma, Wetsuit and Dedoop

20th March 2013Henrik Gabs LiliendahlLeave a comment

The sehr geehrte damen und herren at Universität Leipzig (Leipzig University) are doing a lot of research in the data management realm and puts some good efforts in naming the stuff.

Here are some of the inventions:

COMA is a system for flexible Combination Of schema Matching Approaches. Let’s hope the thing is still alive.

WETSUIT (Web EnTity Search and fUsIon Tool) is a new powerful mashup tool – and what a nice seven letter abbreviation not sticking only to the first letters.

Dedoop (Deduplication with Hadoop) is a prototype for entity matching for big data. Big phonetic Dedupe will be around of course.

Well, you should expect fuzzy abbreviations from this city, as Leipzig means “settlement where the linden trees stand”.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph