Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from telegraph.co.uk.

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share

Data Quality: The Movie

Learning from courses, books, articles and so on is good – but sometimes a bit like watching a movie and then realizing that the real world – especially your world – isn’t exactly as in the movie.

Examples:

The parking experience:

The movie: You are going to visit someone in a huge building in the centre of a large city. You take your car to the front of the building and smoothly place the car on the free parking spot next to the main entrance.

Real life: You drive round and round for ages until finally you find a free parking spot hardly in walking distance from your destination.

My life: I have during my 30 years in the IT business visited a lot of companies and spent time in the IT departments. Nobody does everything by the book. Not even close.

Maybe large companies within financial services are those who in my experience are within some distance of doing something by the book. This is probably because most books about IT seem to be written by folks who had their experiences from working in large financial service businesses.

(And no, I have absolutely no documentation on that. It is just a gut feeling).

Hitting them hard:

The movie: You are a good guy observing a bad guy harassing a good looking girl. You engage the bad guy in an intense fist fight, you are hit over and over again, but in the end you win. The good looking girl thanks you by kissing your beautiful face.

Real life: Well, you may win the fight. But after that you have to go the hospital and have them fix your face – and during the following month any girl can’t look at you without feeling very bad.

My life: Recently I was involved in a data management project aimed at producing some new business intelligence results. Executive sponsorship was no problem, the CEO was the initiator. Objectives were pretty clear. High level business requirements were well known and not to forget, everyone was fully aware of the impact from data quality. The only issue was the absence of more concrete detailed requirements and business rules for reporting. And of course a political settled deadline.

Facing the business rule issue we took a data centric and test driven approach. We produced incremental results, verified test cases, negotiated business rules based on real data examples and in the end a first report came out. The result was far from expected in the sense that the numbers was expected to be different. We dived into data again, found an unexpected data quality issue, corrected accordingly. The result was still far from expected. Based on a specific expected result we dived into a section of data, made detailed reports and compared to real world. In the end it turned out that the report was right, the gut feeling perception of the real world had been wrong for a long time.

Now that’s a winner, right? Well, the project is on hold now for political reasons and also the project has a bad name for going over budget and deadline.

Looking great:

The movie: Morning scene from the nuclear family. Mommy is looking really great (stylish hair, perfect face) while cooking and serving a nice breakfast and helping the kids doing some last minute homework at the same time.

Real life: I think you know.

My life: Actually I have learned that you don’t have to strive for perfection. With data quality; don’t expect you are able to fix everything and having all data fit for every purpose of use at any time.

Bookmark and Share

Aadhar (or Aadhaar)

The solution to the single most frequent data quality problem being party master data duplicates is actually very simple. Every person (and every legal entity) gets an unique identifier which is used everywhere by everyone.

Now India jumps the bandwagon and starts assigning a unique ID to the 1.2 billion people living in India. As I understand it the project has just been named Aadhar (or Aadhaar). Google translate tells me this word (आधार) means base or root – please correct if anyone knows better.

In Denmark we have had such an identifier (one for citizens and one for companies) for many years. It is not used by everyone everywhere – so you still are able to make money being a data quality professional specializing in data matching.

The main reason that the unique citizen identifier is not used all over is of course privacy considerations. As for the unique company identifier the reason is that data quality often are defined as fit for immediate purpose of use.

Bookmark and Share

Data Quality from the Cloud

One of my favorite data quality bloggers Jim Harris wrote a blog post this weekend called “Data, data everywhere, but where is data quality?

I believe in that data quality will be found in the cloud (not the current ash cloud, but to put it plainer: on the internet). Many of the data quality issues I encounter in my daily work with clients and partners is caused by that adequate information isn’t available at data entry – or isn’t exploited. But information needed will in most cases already exist somewhere in the cloud. The challenge ahead is how to integrate available information in the cloud into business processes.

Use of external reference data to ensure data quality is not new. Especially in Scandinavia where I live, this has been in use for long because of the tradition with public sector recording data about addresses, citizens, companies and so on far more intensely than done in the rest of the world.  The Achilles Heel though has always been how to smoothly integrate external data into data entry functionality and other data capture processes and not to forget, how to ensure ongoing maintenance in order to avoid else inevitable erosion of data quality.

The drivers for increased exploitation of external data are mainly:

  • Accessibility, which is where the fast growing (semantic) information store in the cloud helps – not at least backed up by the world wide tendency of governments releasing public sector data
  • Interoperability where increased supply of Service Orientated Architecture (SOA) components will pave the way
  • Cost; the more subscribers to a certain source, the lower the price – plus many sources will simply be free

As said, smoothly integration into business processes is key – or sometimes even better, orchestrating business processes in a new way so that available and affordable information (from the cloud) is pulled into these business processes using only a minimum of costly on premise human resources.

Bookmark and Share

Enterprise Data Mashup and Data Matching

A mashup is a web page or application that uses or combines data or functionality from two or many more external sources to create a new service. Mashups can be considered to have an active role in the evolution of social software and Web 2.0. Enterprise Mashups are secure, visually rich web applications that expose actionable information from diverse internal and external information sources. So says Wikipedia.

I think that Enterprise Mashups will need data matching – and data matching will improve from data mashups.

The joys and challenges of Enterprise Mashups was recently touched in the post “MDM Mashups: All the Taste with None of the Calories” by Amar Ramakrishnan of Initiate. Data needs to be cleansed and matched before being exposed in an Enterprise Mashup. An Enterprise Mashup is then a fast way to deliver Master Data Management results to the organization.

Party Data Matching has typically been done in these two often separated contexts:

  • Matching internal data like deduplicating and consolidating
  • Matching internal data against an external source like address correction and business directory matching

Increased utilization of multiple functions and multiple sources – like a mashup – will help making better matching. Some examples I have tried includes:

  • If you know whether an address is unique or not this information is used to settle a confidence of an individual or household duplicate.
  • If you know if an address is a single residence or a multiple residence (like a nursing home or campus) this information is used to settle a confidence of an individual or household duplicate.
  • If you know the frequency of a name (in a given country) this information is used to settle a confidence of a private, household or contact duplicate.

As many data quality flaws (not surprisingly) are introduced at data entry, mashups may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

Also the rise of social media adds new possibilities for mashup content during data entry, data maintenance and for other uses of MDM / Enterprise Mashups. Like it or not, your data on Facebook, Twitter and not at least LinkedIn are going to be matched and mashed up.

Bookmark and Share

Which came first, the chicken or the egg?

The most common symbol for Easter, which is just around the corner in countries with Christian cultural roots, is the decorated egg.  What a good occasion to have a little “which came first” discussion.

So, where do you start if you want better information quality: Data Governance or Data Quality improvement?

In order to look at it exemplified with something that is known to nearly everyone’s business, let’s look at party master data where we face the ever recurring questing: What is a customer? Do you have to know the precise answer to that question (which looks like a Data Governance exercise) before correcting your party master data (which often is a Data Quality automation implementation).

I think this question is closely related to the two ways of having high quality data:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

In my eyes the first way, make data fit for their intended uses, is probably the best way if you aim for information quality in one or two silos, but the second way, alignment with the real world, is the best and less cumbersome way, if you aim for enterprise wide information quality where data are fit for current and future multiple purposes.

So, starting with Data Governance and then long way down the line applying some Data Quality automation like Data Profiling and Data Matching  seems to be the way forward in if you go for intended use.

On the other hand, if you go for real world alignment it may be best that you start with some Data Profiling and Data Matching in order to realize what the state of your data is and make the first corrections towards having your party master data aligned with the real world. From there you go forward with an interactive Data Governance and Data Quality automation (never ending) journey which includes discovering what a customer role really is.

Bookmark and Share

Standardise this, standardize that

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

  • Compare the original data rows using fuzzy logic techniques
  • Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

  • In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
  • Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
  • In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

  • The culture of the real world entity that the data represents
  • The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Bookmark and Share

Unpredictable Inaccuracy

Let’s look at some statements:

• Business Intelligence and Data Mining is based on looking into historical data in order to make better decisions for the future.

• Some of the best results from Business Intelligence and Data Mining are made when looking at data in different ways than done before.

• It’s a well known fact that Business Intelligence and Data Mining is very much dependent on the quality of the (historical) data.

• We all agree that you should not start improving data quality (like anything else) without a solid business case.

• Upstream prevention of poor data quality is superior to downstream data cleansing.

Unfortunately the wise statements above have some serious interrelated timing issues:

• The business case can’t be established before we start to look at the data in the different way.

• Data is already stored downstream when that happens.

• Anyway we didn’t know precisely what data quality issues we have in that context before trying out new possible ways of looking at data.

Solutions to these timing issues may be:

• Always try to have the data reflect the real world objects they represent as close as possible – or at least include data elements that makes enrichment from external sources possible.

• Accept that downstream data cleansing will be needed from time to time and be sure to have the necessary instruments for that.

Bookmark and Share

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share