Data Matching – Page 20 – Liliendahl on Data Quality

Dealing with annoying customers

21st March 201023rd June 2010Henrik Gabs Liliendahl2 Comments

No, this is not a blog post about how to handle customers that unjustly complaints about everything.

This is a blog post about how to maintain high quality data in customer databases.

When doing that, there are some types of party entities that are more difficult to handle than others. In general B2B (business) entities are more complex than B2C (consumer/citizen) entities. Some of the B2B types I have spent more time with than others are the following:

Restaurants are some of the more demanding guests in our databases:

They do change owner more often than most other business entities making them a new legal entity each time which is important for some business contexts like credit risk.
On the other hand it’s the same address despite a new owner, which makes it being the same entity in the eyes of other business contexts like logistics.
In many cases you may have a name (trade style) of the restaurant and another official name of the business – a variant of this is when the restaurant is franchised.

Public sector bodies can’t be sliced the same way as private entities:

Often it is hard to state if a business partner belongs to a narrow defined or a broader defined unit within a governmental or local authority.
Public sector bodies tend to have long names that may be used with different inclusion of words, sequence of words and abbreviations of words.

Global enterprises may be seen as one or as thousands of customers:

The need for hierarchy management is obvious when it comes to handle data about business partners that belongs to a global enterprise – risk management, 1-1 marketing, sales force automation and so on will use the same data in many different ways.
Company family trees are useful but treacherous. A mother and a daughter may be very close connected with lots of shared services or it may be a strictly matter of ownership with no operational ties at all.

These are some of the facts of life that make it fun and not trivial when you are conducting data matching and other activities in order to achieve and maintain high quality of customer master data.

What is Data Quality anyway?

17th March 201022nd July 2011Henrik Gabs Liliendahl21 Comments

The above question might seem a bit belated after I have blogged about it for 9 months now. But from time to time I ask myself some questions like:

Is Data Quality an independent discipline? If it is, will it continue to be that?

Data Quality is (or should) actually be a part of a lot of other disciplines.

Data Governance as a discipline is probably the best place to include general data quality skills and methodology – or to say all the people and process sides of data quality practice. Data Governance is an emerging discipline with an evolving definition, says Wikipedia. I think there is a pretty good chance that data quality management as a discipline will increasingly be regarded as a core component of data governance.

Master Data Management is a lot about Data Quality, but MDM could be dead already. Just like SOA. In short: I think MDM and SOA will survive getting new life from the semantic web and all the data resources in the cloud. For that MDM and SOA needs Data Quality components. Data Quality 3.0 it is.

You may then replace MDM with CRM, SCM, ERP and so on and here by extend the use of Data Quality components from not only dealing with master data but also transaction data.

Next questions: Is Data Quality tools an independent technology? If it is, will it continue to be that?

It’s clear that Data Quality technology is moving from being stand alone batch processing environments, over embedded modules to, oh yes, SOA components.

If we look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have.

In the recent years there has been a lot of consolidation in the market around Data Integration, Master Data Management and Data Quality which certainly is telling that the market need Data Quality technology as components in a bigger scheme along with other capabilities.

But also some new pure Data Quality players are established – and I think I often see some old folks from the acquired entities at these new challengers. So independent Data Quality technology is not dead and don’t seem to want to be that.

When computer says maybe

11th March 201019th June 2010Henrik Gabs Liliendahl9 Comments

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches. Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
Check-out and check-in for multiuser environments
Presenting a ranked range of computer selected candidates
Color coding elements in matched candidates – like:
- green for (near) exact name,
- blue for a close name and
- red for a far from similar name
Possibility for marking:
- as a manual positive match,
- as a manual negative match (with reason) or
- as questionable for later or supervisor inspection (with comments)
Entering a match found by other methods
Removing one or several members from a duplicate group
Splitting a duplicate group into two groups
Selecting survivorship
Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

Do you mean deduplication or deduplication?

10th March 201030th June 2010Henrik Gabs LiliendahlLeave a comment

The term deduplication may be two different things in computing:

The storage kind of deduplication
The data quality kind of deduplication

The storage kind of deduplication refers to reducing the data volumes stored and backed up by finding exactly the same file (or other assemblies of data I guess) and eliminate all but one copy.

The data quality kind of deduplication is about finding entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

The result of the data quality kind of deduplication may be that all but one duplicate row are eliminated, but most often we actually will add more bytes by linking the duplicate rows and perhaps make a new golden record.

This disambiguation sometimes leads to mixing it all up.

I remember some years ago when I started as employee number no 1 in Omikron Data Quality in the Nordics we made a meeting booking campaign. This was done by a telemarketing bureau. They booked a lot of meetings for me including one at a company that was very interested in tools for deduplication.

It was a very strange meeting until that we after 12 minutes and 34 seconds concluded, that indeed there are two kinds of deduplication in computing.

Also I noticed lately that a leading vendor of the data quality kind of deduplication tools promoted their product by referring to articles on cost savings and more related to the storage kind of deduplication.

Cultural Stereotypes, Matching Engines and an Oscar

9th March 20109th March 2010Henrik Gabs Liliendahl2 Comments

Normally I’m not that fond of using cultural stereotypes, but nevertheless prompted by a conversation lately (and inspired by the Oscar show) I came to think about the following scenarios:

Indian Style

I have heard that in India you don’t say no if someone asks you to do something. So a Bollywood story could be:

A boss calls in a product manger. He asks him to make a data matching engine that produces no false positives and no false negatives. The product manager knows it is impossible, but can’t say no. The product manager says it may be complicated, but when told they can double the team he goes back to the developers and initiates the project.

After a month the boss calls the product manger and asks if they are finished. The product manager replies: “Well, we have come a long way, but there are still some unresolved issues and some testing to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have solved the previous issues, but we have run into some new problems and some more testing has to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have ….

Danish Habits

In Denmark we have a good compensation from the state if we lose our jobs and anyway we are confident that we will find another one. So the short story (we are good at short films) could be:

The boss calls in the product manager and says “Hi Kim, it’s been decided we will make a matching engine that produces no false positives and no false negatives”.

The product manager leans forward, slams the provided business plan onto the table and says: “If you want such a product you can make it yourself” and leaves the room.

The American Way

It’s my impression, that in the United States you (mostly) do what you are told to do. So here the Hollywood story could be:

The boss calls in the product manager and says “Chris, I have got a great idea: We will make a matching engine that produces no false positives and no false negatives”.

The product manager replies: “That’s impossible”.

The boss says: “Chris, I didn’t ask you about your opinion but told you to make the product”.

The product manager: “You’re the boss”.

The product manager returns to the team. They work hard to make a matching engine with some configurable settings as:

No false positives, but false negatives are allowed (recommended)
No false negatives, but false positives are allowed
No false positives and no false negatives

The boss is satisfied with how the product looks like. He passes it on to marketing. Marketing contacts the analysts. The analysts are excited about the product features and writes about how this great product (from this well established company) will change the game of data matching.

Standardise this, standardize that

7th March 20107th March 2011Henrik Gabs Liliendahl1 Comment

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

Compare the original data rows using fuzzy logic techniques
Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

The culture of the real world entity that the data represents
The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Having the right element to the left

27th February 201029th May 2012Henrik Gabs Liliendahl9 Comments

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

In some cultures like in east Asia the family name is written first and the given name is written last.
Some notations indicate that the given name isn’t the first element:
- “DUPONT Michel” is a custom French way of telling, that the family name is the first element
- “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Candidate Selection in Deduplication

21st February 201019th June 2010Henrik Gabs Liliendahl13 Comments

When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.

We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.

Lets look at the following example. We have 2 names and addresses:

Banca di Toscana Società per azioni
Machiavelli 12
IT 51234 Firenze

Vanca di Toscana SpA
12, Via Niccolò Machiavelli
Florence
Italy

A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:

The ISO country code for Italy is IT
Florence is the English name for the city called Firenze in Italian
In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
“Società per azioni” with the acronym SpA or S.p.A is an Italian legal form

But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.

If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.

I have worked with the following techniques for overcoming this challenge:

Parsing and standardization

The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:

IT, 51234, Via Niccolo Machiavelli, 12

Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.

This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.

Fuzzy search

Here you make use of the same fuzzy techniques used in similarity assignment when searching.

Probabilistic learning

If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.

Hybrid

As always in data quality automation, using all different techniques in a given implementation makes your margins better.

Deploying Data Matching

18th February 20102nd July 2010Henrik Gabs Liliendahl2 Comments

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Data Quality Tools Revealed

14th February 201022nd June 2010Henrik Gabs Liliendahl11 Comments

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph