Double Falshood

Always remember to include Shakespeare in a blog, right?

Now, it is actually disputable if Shakespeare has anything to do with the title of this blog post. Double Falshood is the (first part of the) title of a play claimed to be based on a lost play by Shakespeare (and someone else). The only fact that seems to be true in this story is that the plot of the play(s) is based on an episode in Don Quixote by Cervantes.  “The Ingenious Hidalgo Don Quixote of La Mancha”, which is the full name of the novel, is probably best known for the attack on the windmills by don Quijote (the Spanish version of the name).

All this confusion about sorting out who, what, when and where, and the feeling of tilting at windmills, seems familiar in the daily work in trying to fix master data quality.

And indeed “double falsehood” may be a good term for the classic challenge in the data quality kind of deduplication, which is to avoid false positives and false negatives at the same time.

Now, back to work.

Who is Responsible for Data Quality?

No, I am not going to continue some of the recent fine debates on who within a given company is data owner, accountable and responsible for data quality.

My point today is that many views on data ownership, the importance of upstream prevention and  fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company.

This is in the business world today not true in many cases.

Examples:

Direct marketing campaigns

Making a direct marketing campaign and sending out catalogues is often an eye opener for the quality of data in your customer and prospect master files. But such things are very often outsourced.

Your company extracts a file with say 100.000 names and addresses from your databases and you pay a professional service provider a fee for each row for doing the rest of the job.

Now the service provider could do you the kind favour of carefully deduplicating the file, eliminate the 5.000 purge candidates and bring you the pleasant message that the bill will be reduced by 5 %.

Yes I know, some service providers actually includes deduplication in their offerings. And yes, I know, they are not always that interested in using an advanced solution for that.

I see the business context here – but unfortunately it’s not your business.

Factoring

Sending out invoices is often a good test on how well customer master data is entered and maintained. But again, using an outsourced service for that like factoring is becoming more common.

Your company hands over the name and address, receives the most of the money, and the data is out of sight.

Now the factoring service provider has a pretty good interest in assuring the quality of the data and aligning the data with a real world entity.

Unfortunately this can not be done upstream, it’s a downstream batch process probably with no signalling back to the source.

Customer self service

Today data entry clerks are rapidly being replaced as the customer is doing all the work themselves on the internet. Maybe the form is provided by you, maybe – as often with hotel reservations – the form is provided by a service provider.

So here you basically either have to extend your data governance all the way to your customers living room or office or in some degree (fortunately?) accept that the customer owns the data.

Bookmark and Share

When computer says maybe

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches.  Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

  • Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
  • Check-out and check-in for multiuser environments
  • Presenting a ranked range of computer selected candidates
  • Color coding elements in matched candidates – like:
    • green for (near) exact name,
    • blue for a close name and
    • red for a far from similar name
  • Possibility for marking:
    • as a manual positive match,
    • as a manual negative match (with reason) or
    • as questionable for later or supervisor inspection (with comments)
  • Entering a match found by other methods
  • Removing one or several members from a duplicate group
  • Splitting a duplicate group into two groups
  • Selecting survivorship
  • Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

Do you mean deduplication or deduplication?

The term deduplication may be two different things in computing:

  • The storage kind of deduplication
  • The data quality kind of deduplication

The storage kind of deduplication refers to reducing the data volumes stored and backed up by finding exactly the same file (or other assemblies of data I guess) and eliminate all but one copy.

The data quality kind of deduplication is about finding entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

The result of the data quality kind of deduplication may be that all but one duplicate row are eliminated, but most often we actually will add more bytes by linking the duplicate rows and perhaps make a new golden record.

This disambiguation sometimes leads to mixing it all up.

I remember some years ago when I started as employee number no 1 in Omikron Data Quality in the Nordics we made a meeting booking campaign. This was done by a telemarketing bureau. They booked a lot of meetings for me including one at a company that was very interested in tools for deduplication.

It was a very strange meeting until that we after 12 minutes and 34 seconds concluded, that indeed there are two kinds of deduplication in computing.

Also I noticed lately that a leading vendor of the data quality kind of deduplication tools promoted their product by referring to articles on cost savings and more related to the storage kind of deduplication.

Bookmark and Share

Candidate Selection in Deduplication

When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.

We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.

Lets look at the following example. We have 2 names and addresses:

Banca di Toscana Società per azioni
Machiavelli 12
IT 51234 Firenze

Vanca di Toscana SpA
12, Via Niccolò Machiavelli
Florence
Italy

A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:

  • The ISO country code for Italy is IT
  • Florence is the English name for the city called Firenze in Italian
  • In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
  • In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
  • “Società per azioni” with the acronym SpA or S.p.A is an Italian legal form

But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.

If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.

I have worked with the following techniques for overcoming this challenge:

Parsing and standardization

The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:

IT, 51234, Via Niccolo Machiavelli, 12

Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.

This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.

Fuzzy search

Here you make use of the same fuzzy techniques used in similarity assignment when searching.

Probabilistic learning

If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.

Hybrid

As always in data quality automation, using all different techniques in a given implementation makes your margins better.

Deploying Data Matching

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Bookmark and Share

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share

Select Company_ID from External_Source where possible

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

  • 2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
  • 3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
  • 4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
  • 5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
  • n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

Create Table Homo_Sapiens

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

  • Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
  • Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

  • Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
  • Identity Resolution is about the same but  – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
  • Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.

Bookmark and Share

A New Year Resolution

Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.

Just before Christmas 2009 I made such a mistake in my professional work.

It’s not that I don’t have a lot of excuses. Sure I have.

The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.

I was very busy before Christmas – but accepted this minor trivial assignment.

When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.

But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.

The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.

Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.

Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.

Bookmark and Share