Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from telegraph.co.uk.

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share

Who is Responsible for Data Quality?

No, I am not going to continue some of the recent fine debates on who within a given company is data owner, accountable and responsible for data quality.

My point today is that many views on data ownership, the importance of upstream prevention and  fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company.

This is in the business world today not true in many cases.

Examples:

Direct marketing campaigns

Making a direct marketing campaign and sending out catalogues is often an eye opener for the quality of data in your customer and prospect master files. But such things are very often outsourced.

Your company extracts a file with say 100.000 names and addresses from your databases and you pay a professional service provider a fee for each row for doing the rest of the job.

Now the service provider could do you the kind favour of carefully deduplicating the file, eliminate the 5.000 purge candidates and bring you the pleasant message that the bill will be reduced by 5 %.

Yes I know, some service providers actually includes deduplication in their offerings. And yes, I know, they are not always that interested in using an advanced solution for that.

I see the business context here – but unfortunately it’s not your business.

Factoring

Sending out invoices is often a good test on how well customer master data is entered and maintained. But again, using an outsourced service for that like factoring is becoming more common.

Your company hands over the name and address, receives the most of the money, and the data is out of sight.

Now the factoring service provider has a pretty good interest in assuring the quality of the data and aligning the data with a real world entity.

Unfortunately this can not be done upstream, it’s a downstream batch process probably with no signalling back to the source.

Customer self service

Today data entry clerks are rapidly being replaced as the customer is doing all the work themselves on the internet. Maybe the form is provided by you, maybe – as often with hotel reservations – the form is provided by a service provider.

So here you basically either have to extend your data governance all the way to your customers living room or office or in some degree (fortunately?) accept that the customer owns the data.

Bookmark and Share

Unpredictable Inaccuracy

Let’s look at some statements:

• Business Intelligence and Data Mining is based on looking into historical data in order to make better decisions for the future.

• Some of the best results from Business Intelligence and Data Mining are made when looking at data in different ways than done before.

• It’s a well known fact that Business Intelligence and Data Mining is very much dependent on the quality of the (historical) data.

• We all agree that you should not start improving data quality (like anything else) without a solid business case.

• Upstream prevention of poor data quality is superior to downstream data cleansing.

Unfortunately the wise statements above have some serious interrelated timing issues:

• The business case can’t be established before we start to look at the data in the different way.

• Data is already stored downstream when that happens.

• Anyway we didn’t know precisely what data quality issues we have in that context before trying out new possible ways of looking at data.

Solutions to these timing issues may be:

• Always try to have the data reflect the real world objects they represent as close as possible – or at least include data elements that makes enrichment from external sources possible.

• Accept that downstream data cleansing will be needed from time to time and be sure to have the necessary instruments for that.

Bookmark and Share

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share

2010 predictions

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.

Bookmark and Share

Phony Phones and Real Numbers

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

  • Completeness – Missing values
  • Precision – Inclusion of country codes, area codes, extensions
  • Reliability – Real world alignment, pseudo numbers: 1234.., 555…
  • Timeliness – Outdated and converted numbers
  • Conformity – Formatting of numbers
  • Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

  • Count of filled values
  • Minimum and maximum lengths
  • Represented formats – best inspected per country if international data
  • Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.

Ongoing Data Maintenance

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.

Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.

There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.

Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.

Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.

Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.

Figure 1: one year old prime information

PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.

Bookmark and Share

Upstream prevention by error tolerant search

Fuzzy matching techniques were originally developed for batch processing in order to find duplicates and consolidate database rows with no unique identifiers with the real world.

These processes have traditionally been implemented for downstream data cleansing.

As we know that upstream prevention is much more effective than tidy up downstream, real time data entry checking is becoming more common.

But we are able to go further upstream by introducing error tolerant search capabilities.

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for either introducing duplicates or challenge the real time checking.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

SOA componentImplementation of such features may be as embedded functionality in CRM and ERP systems or as my favourite term: SOA components. So besides classic data quality elements for monitoring and checking we can add error tolerant search to the component catalogue needed for a good MDM solution.

Bookmark and Share

Master Data meets the Customer

In the old days Master Data was predominately created, maintained and used by the staff in the organisation having these data. This is in many cases not the fact anymore. Besides exchanging data with partners in doing business, today the customer – and prospect – has become an important person to be considered when doing Data Governance and implementing technology around Master Data.

In the online world the customer works with your Master Data when:

  • The customer creates and maintains name, address and communication information by using registration functions
  • The customer searches for and reads product information on web shops and information sites

Having the prospects and customers helping with the name and address (party) data is apparently great news for lowering costs in the organisation. But in the long run you got yourself another silo with data and your Data Quality issues has become yet more challenging.

First thing to do is to optimise your registration forms. An important thing to consider here is that online is worldwide (unless you restrict your site to visitors from a single country). When doing business online with multi national customers then take care that the sequence, formats and labels are useful to everyone and that mandatory checks and other validations are in line with rules for the country in question.

External reference data may be used for lookup and validation integrated in the registration forms.

The concept of “one version of the truth” is a core element in most Master Data Management solutions. Doing deduplication within online registration have privacy considerations. When asking for personal data you can’t prompt “Possible duplicate found” and then present the data about someone else. Here you need more than one data quality firewall.

Many organisations are not just either offline or online but are operating in both worlds. To maintain the 360 degree view on customer in this situation you need strong data matching techniques capable of working with offline and online captured data. As the business case for online registration is very much about reducing staff involvement, this is about using technology and keeping human interaction to a minimum.

Search and navigationWhen a prospect comes to your site and tries to find information about your products, the first thing to do is very often using the search function. From deduplication of names and addresses we know that spelling is difficult and that sometimes we use other synonyms than used in the Master Data descriptions. Add to that the multi-cultural aspect. The solution here is that you use the same fuzzy search techniques that we use for data matching. This is a kind of reuse. I like that.

Bookmark and Share