Duplicates – Page 8 – Liliendahl on Data Quality

Location, Location, Location

24th June 201029th May 2012Henrik Gabs Liliendahl6 Comments

Now, I am not going to write about the importance of location when selling real estates, but I am going to provide three examples about knowing about the location when you are doing data matching like trying to find duplicates in names and addresses.

Location uniqueness

Let’s say we have these two records:

Stefani Germanotta, Main Street, Anytown
Stefani Germanotta, Main Street, Anytown

The data is character by character exactly the same. But:

There is only a very high probability that it is the same real world individual if there is only one address on Main Street in Anytown.
If there are only a few addresses on Main Street in Anytown, you will still have a fair probability that this is the same individual.
But if there are hundreds of addresses on Main Street in Anytown, the probability that this is the same individual will be below threshold for many matching purposes.

Of course, if you are sending a direct marketing letter it is pointless sending both letters, as:

Either they will be delivered in the same mailbox.
Or both will be returned by postal service.

So this example highlights a major point in data quality. If you are matching for a single purpose of use like direct marketing you may apply simple processing. But if you are matching for multiple purposes of use like building a master data hub, you don’t avoid some kind of complexity.

Location enrichment

Let’s say we have these two records:

Alejandro Germanotta, 123 Main Street, Anytown
Alejandro Germanotta, 123 Main Street, Anytown

If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual.

But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith).

So this example highlights the importance of using external reference data in data matching.

Location geocoding

Let’s say we have these two records:

Gaga Real Estate, 1 Main Street, Anytown
L. Gaga Real Estate, Central Square, Anytown

If you match using the street address, the match is not that close.

But if you assigned a geocode for the two addresses, then the two addresses may be very close (just around the corner) and your match will then be pretty confident.

Assigning geocodes usually serve other purposes than data matching. So this example highlights how enhancing your data may have several positive impacts.

Real World Alignment

16th June 201029th May 2012Henrik Gabs LiliendahlLeave a comment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

Either they are fit for their intended uses
Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.

Returns from Investing in a Data Quality Tool

13th June 201018th June 2010Henrik Gabs Liliendahl11 Comments

The classic data quality business case is avoiding sending promotion letters and printed materials to duplicate prospects and customers.

Even as e-commerce moves forward and more complex data quality business cases as those related to multi-purpose master data management becomes more important I will like to take a look at the classic business case by examining some different kind of choices for a data quality tool.

As you may be used to all different kind of currencies as EUR, USD, AUD, GBP and so on I will use the fictitious currency SSB (Simple Stupid Bananas).

Let’s say we have a direct marketing campaign with these facts:

100,000 names and addresses, ½ of them also with phone number
Cost per mail is 3 SSB
Response is 4,500 orders with an average profit of 100 SSB

From investigating a sample we know that 10% of the names and addresses are duplicates with slightly different spellings.

So from these figures we know that the cost of a false negative (a not found actual duplicate) is 3 SSB. Savings of a true positive is then also 3 SSB.

The cost of a false positive (a found duplicate that actually isn’t a duplicate) is a possible missing order worth: 4,500 / (100,000 * 90 %) * 100 SSB = 5 SSB.

Now let’s examine 3 options for tools for finding duplicates:

A: We already have Excel

B: Buying the leader of the pack data quality tool

C: Buying an algorithm based dedupe tool

A: We already have Excel

You may first sort 100,000 rows by address and look for duplicates this way. Say you find 2,000 duplicates. Then sort 98,000 rows by surname and look for duplicates. Say you find 1,000 duplicates. Then sort 97,000 rows by given name. Say you find 1,000 duplicate. Finally sort 48,000 rows by phone number. Say you find 1,000 duplicates.

If a person can look for duplicates in 1,000 rows per hour (without making false positives) we will browse totally 343,000 sorted rows in 343 hours.

Say you hire a student for that and have the Subject Matter Expert explaining, controlling and verifying the process using 15 hours.

Costs are:

343 student hours each 15 SSB = 5.145 SSB
15 SME hours each 50 SSB = 750 SSB

Total costs are 5.895 SSB.

Total savings are 5,000 true positives each 3 SSB = 15.000 SSB, making a positive ROI = 9.105 SSB in each campaign.

Only thing is that it will take one student more than 2 months (without quitting) to do the job.

B: Buying the leader of the pack data quality tool

Such a tool may have all kind of data quality monitoring features, may be integrated smoothly with ETL functionality and so on. For data matching it may use so called match codes. Doing that we may expect that the tool will find 7,500 duplicates where 7,000 are true positives and 500 are false positives.

Costs may be:

Tool license fee is 50.000 SSB
Training fee is 7.000 SSB
80 hours external consultancy each 125 SSB = 10.000 SSB
60 IT hours for training and installation each 50 SSB = 3.000 SSB
100 SME hours for training and configuration each 50 SSB = 5.000 SSB

Total costs are 75.000 SSB

Savings per campaign are 7,000 * 3 SSB – 500* 5 SSB = 18.500 SSB.

A positive ROI will show up after the 5^th campaign.

C: Buying an algorithm based dedupe tool

By using algorithm based data matching such a tool depending on the threshold setting may find 9,100 duplicates where 9,000 are true positives and 100 are false positives.

Costs may be:

Tool license fee is 5.000 SSB
8 hours external consultancy for a workshop each 125 SSB = 1.000 SSB
15 SME hours for training, configuration and pushing the button each 50 SSB = 750 SSB

Total costs are 6.750 SSB

Savings per campaign are 9,000 * 3 SSB – 100* 5 SSB = 26.500 SSB

A remarkable ROI will show up in the 1^stcampaign.

Citizen ID within seconds

1st June 201020th March 2012Henrik Gabs Liliendahl8 Comments

Here is a picture of my grandson Jonas taken minutes after his was born. He has a ribbon around his wrist showing his citizen ID which has just been assigned. There is even a barcode with it on the ribbon.

Now, I have mixed feelings about that. It is indeed very impersonal. But as a data quality professional I do realize that this is a way of solving a problem at the root. Duplicate master data in healthcare is a serious problem as Dylan Jones reported last year when he had a son in this article from DataQualityPro.

A unique citizen ID (National identification number) assigned in seconds after a birth have a lot of advantages. As said it is a foundation for data quality in healthcare from the very start of a life. Later when you get your first job you hand the citizen ID to your employer and tax is collected automatically. When the rest of the money is in the bank you are uniquely identified there. When you turn 18 you are seamlessly put on the electoral roll. Later your marriage is merely a relation in a government database between your citizen ID and the citizen ID of your beloved one.

Oh joy, Master Data Management at the very best.

Data Matching 101

25th May 201030th June 2010Henrik Gabs Liliendahl2 Comments

Following up on my post no. 100 I can’t resist making a post having 101 in the title. I’ll use 101 in the meaning of an introduction to a subject. As “Data Quality 101” and “MDM 101” is already widely discussed I think “Data Matching 101” is a good title.

Data matching deals with the dimension of data quality I like to call uniqueness. I use uniqueness because it is the positive term describing the state we want to bring our data to – opposite to duplication which is the state we want to change. Just like the other dimensions of data quality also describes the desired states such as accuracy, consistency and timeliness.

Data matching is besides data profiling the activity within data quality that has been automated the most. No wonder since duplicates in especially master data and master data not being aligned with the real world is costing organizations incredible amounts of money. Finding duplicates among millions (or even thousands) of records by manual means is impossible. The same is true for matching with directories with timely descriptions of the real world. You have to use a computerized approach controlled by exactly that amount of manual verification that makes your return on investment positive.

Matching names and addresses (party master data) is the most common area of data matching. Matching product master data is probably going to be the next big thing in matching. I have also been involved in matching location data and timetables.

A computerized approach to data matching may include some different techniques like parsing and standardization, using synonyms, assigning match codes, advanced algorithms and probabilistic learning.

All that is best explained with examples. Therefore I am happy to do a webinar called “The Art of Data Matching” as part of a series of free webinars on eLearningCurve. The webinar will be a sightseeing looking at examples on challenges and solutions in the data matching world.

Date and time: Well, these are matching examples of expressing the moment the webinar starts:

Friday 06/04/10 12pm EDT
Friday 04/06/10 18:00 Central European Summer Time
Sydney, Sat Jun 5 2:00 AM

Link to the eLearningCurve free webinar here.

Relational Data Quality

20th May 201019th June 2010Henrik Gabs Liliendahl2 Comments

Most of the work related to data quality improvement I do is done with data in relational databases and is aimed at creating new relations between data. Examples (from party master data) are:

Make a relation between a postal address in a customer table and a real world address (represented in an official address dictionary).
Make a relation between a business entity in a vendor table and a real world business (represented in a business directory most often derived from an official business register).
Make a relation between a consumer in one prospect table and a consumer in another prospect table because they are considered to represent the same real world person.

When striving for multi-purpose data quality it is often necessary to reflect further relations from the real world like:

Make a relation in a database reflecting that two (or more) persons belongs to the same household (on the same real world address)
Make a relation in the database reflecting that two (or more) companies have the same (ultimate) mother.

Having these relations done right is fundamental for any further data quality improvement endeavors and all the exciting business intelligence stuff. In doing that you may continue to have more or less fruitful discussions on say the classic question: What is a customer?

But in my eyes, in relation to data quality, it doesn’t matter if that discussion ends with that a given row in your database is a customer, an old customer, a prospect or something else. Building the relations may even help you realize what that someone really is. Could be a sporadic lead is recognized as belonging to the same household as a good customer. Could be a vendor is recognized as being a daughter company of a hot prospect. Could be someone is recognized as being fake. And you may even have some business intelligence that based on the relations may report a given row as a customer role in one context and another role in another context.

Aadhar (or Aadhaar)

2nd May 201021st June 2010Henrik Gabs Liliendahl9 Comments

The solution to the single most frequent data quality problem being party master data duplicates is actually very simple. Every person (and every legal entity) gets an unique identifier which is used everywhere by everyone.

Now India jumps the bandwagon and starts assigning a unique ID to the 1.2 billion people living in India. As I understand it the project has just been named Aadhar (or Aadhaar). Google translate tells me this word (आधार) means base or root – please correct if anyone knows better.

In Denmark we have had such an identifier (one for citizens and one for companies) for many years. It is not used by everyone everywhere – so you still are able to make money being a data quality professional specializing in data matching.

The main reason that the unique citizen identifier is not used all over is of course privacy considerations. As for the unique company identifier the reason is that data quality often are defined as fit for immediate purpose of use.

Enterprise Data Mashup and Data Matching

6th April 20107th July 2010Henrik Gabs Liliendahl3 Comments

A mashup is a web page or application that uses or combines data or functionality from two or many more external sources to create a new service. Mashups can be considered to have an active role in the evolution of social software and Web 2.0. Enterprise Mashups are secure, visually rich web applications that expose actionable information from diverse internal and external information sources. So says Wikipedia.

I think that Enterprise Mashups will need data matching – and data matching will improve from data mashups.

The joys and challenges of Enterprise Mashups was recently touched in the post “MDM Mashups: All the Taste with None of the Calories” by Amar Ramakrishnan of Initiate. Data needs to be cleansed and matched before being exposed in an Enterprise Mashup. An Enterprise Mashup is then a fast way to deliver Master Data Management results to the organization.

Party Data Matching has typically been done in these two often separated contexts:

Matching internal data like deduplicating and consolidating
Matching internal data against an external source like address correction and business directory matching

Increased utilization of multiple functions and multiple sources – like a mashup – will help making better matching. Some examples I have tried includes:

If you know whether an address is unique or not this information is used to settle a confidence of an individual or household duplicate.
If you know if an address is a single residence or a multiple residence (like a nursing home or campus) this information is used to settle a confidence of an individual or household duplicate.
If you know the frequency of a name (in a given country) this information is used to settle a confidence of a private, household or contact duplicate.

As many data quality flaws (not surprisingly) are introduced at data entry, mashups may help during data entry, like:

An address may be suggested from an external source.
A business entity may be picked from an external business directory.
Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

Also the rise of social media adds new possibilities for mashup content during data entry, data maintenance and for other uses of MDM / Enterprise Mashups. Like it or not, your data on Facebook, Twitter and not at least LinkedIn are going to be matched and mashed up.

Reduplication: The next big thing?

1st April 20107th July 2010Henrik Gabs Liliendahl6 Comments

Today I got a very exciting Master Data Management assignment. Usually I do deduplication processes which means that two or more rows in a database are merged into one golden record because the original rows represents the same real world entity.

But in this case we are going to split one row into several rows with random keys (a so called MNUID = Messy Non-Unique IDentifier). Also names and addresses have to be misspelled in different ways so they are not easily recognized as being the same.

My client, the Danish Tax Authorities, has for years tried to develop methods for taxation above 100% and has finally reached this simple but very efficient method. Until now you as one person or one company pay up to 60% tax, but now each duplicate row will pay 60%. Hereby in phase one you may in fact pay 120%, but in later phases this will be extended to larger duplicate groups paying much higher percentages.

Already some foreign tax authorities have shown deep interest in this model (called Intelligent Reduplication for Supertaxation). First of all our Scandinavian neighbors are very interested, but eventually it may spread to the rest of the world.

What is a best-in-class match engine?

26th March 201021st June 2010Henrik Gabs Liliendahl9 Comments

Latest in connection with that TIBCO acquires data matching vendor Netrics the term best-in-class match engine has been attached to the Netrics product.

First: I have no doubt that the Netrics product is a capable match engine – I know that from discussions in the LinkedIn Data Matching group and here on this blog.

Next: I don’t think anyone knows what product is the best match engine, because I don’t think that all match engines have been benchmarked with a representative set of data.

There are of course on top the matching capabilities with different entity types to consider. Here party master data (like customer data) are covered by most products whereas capabilities with other entity types (be that considered same same or not) are far less exposed.

As match engine products are acquired and integrated in suites the core matching capabilities somehow becomes mixed up with a lot of other capabilities making it hard to compare the match engine alone.

Some independent match engines work stand alone and some may be embedded into other applications.

These may then be the classes to be best in:

Match engines in suites
Embedded match engines (for say SAP, MS CRM and so on)
Stand alone match engines

Many match engines I have seen are tuned to deal with data from the country (culture) where they are born and had their first triumphs. As the US market is still far the largest for match engines the nomination of best match engine resembles when a team becomes World Champions in American Football. International/multi-cultural capabilities will become more and more important in data matching. But indeed we may define a class for each country (culture).

In the old days I have heard that one match engine was best for marketing data and another match engine was best for credit risk management. I think these days are over too. With Master Data Management you have to embrace all data purposes.

Some match engines are more successful in one industry. The biggest differentiator in match effectiveness is with B2C and/or B2B data. B2C is the easiest, B2B is more complex and embracing both is in my eyes a must for being considered best-in-class – unless we define separate classes for B2C, B2B and both.

As some matching techniques are deterministic and some are probabilistic the evaluation on the latter one will be based on data already processed in a given instance, as the matching gets better and better as the self learning element is warmed up.

So, yes, an endless religious-like discussion I reopened here.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph