What is Data Quality anyway?

The above question might seem a bit belated after I have blogged about it for 9 months now. But from time to time I ask myself some questions like:

Is Data Quality an independent discipline? If it is, will it continue to be that?

Data Quality is (or should) actually be a part of a lot of other disciplines.

Data Governance as a discipline is probably the best place to include general data quality skills and methodology – or to say all the people and process sides of data quality practice. Data Governance is an emerging discipline with an evolving definition, says Wikipedia. I think there is a pretty good chance that data quality management as a discipline will increasingly be regarded as a core component of data governance.

Master Data Management is a lot about Data Quality, but MDM could be dead already. Just like SOA. In short: I think MDM and SOA will survive getting new life from the semantic web and all the data resources in the cloud. For that MDM and SOA needs Data Quality components. Data Quality 3.0 it is.

You may then replace MDM with CRM, SCM, ERP and so on and here by extend the use of Data Quality components from not only dealing with master data but also transaction data.

Next questions: Is Data Quality tools an independent technology? If it is, will it continue to be that?

It’s clear that Data Quality technology is moving from being stand alone batch processing environments, over embedded modules to, oh yes, SOA components.

If we look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have.

In the recent years there has been a lot of consolidation in the market around Data Integration, Master Data Management and Data Quality which certainly is telling that the market need Data Quality technology as components in a bigger scheme along with other capabilities.

But also some new pure Data Quality players are established – and I think I often see some old folks from the acquired entities at these new challengers. So independent Data Quality technology is not dead and don’t seem to want to be that.

Bookmark and Share

Grandpa’s Story

Now I have become a grandfather it’s time for a blog post about lessons learned in life.

One of my favourite authors as a young man was Cyril Northcote Parkinson, the grand father of the famous Parkinson’s Law saying:

Work expands so as to fill the time available for its completion.

Early in my career I learned how true this is. My first experience was also like the statistics behind Parkinson’s Law from within public administration, but later I learned that private enterprises are just the same.

My first real job after graduation was at the Danish Tax Authorities. After having worked there a few years I was assigned on a mission to assist the Faroe Islands Financial Authorities in developing a modernised tax collection solution.

The Faroe Islands

For those readers that hate old people not sticking to the subject – please continue to the next headline.

For those readers who don’t have a clue about where on earth the Faroe Islands are: Well. 1000 years ago the Vikings sailed out from Scandinavia and finally made it to say hello to the Native Americans – 500 years before Columbus. When doing that they used islands in the Northern Atlantic as stepping stones. First British Isles, then Faroe Islands, Iceland, Greenland and finally Newfoundland at the American coast.

Just like Columbus found America by mistake, as he was actually looking for India, the Vikings probably also found America and the stepping stones by mistake when getting lost on the ocean during storms.

1/100

Back on track. The mission for the Faroe Island Authorities I joined in the early 1980’s seemed impossible. As the Faroese population is only 1/100 of the population of the continental Denmark there were of course only 1/100 of the resources available for making a solution doing exactly the same as the solution built for continental Denmark

But what I learned was that the solution actually was built using only those resources and in surprisingly short time (and with minimal help from me and my colleagues).

While I during my career have worked in both modest sized organisations and large organisations I have noticed numerous examples on how exactly the same task may consume resources not sized by the nature of the task but by the size of the organisation.

People and technology

Maybe this observation is an explanation to the ever recurring subject on whether people or technology is most important when doing projects like improving data quality. If the technology part is (close to) constant but the over-all resource consumption grows with the size of the organisation in question, well, then the people part becomes more and more important by the size of the organisation

Tool making

I have tried single handed to build a data quality tool – or to be more specific a data matching tool. At several occasions it has been benchmarked with products residing as leaders in the Gartner Magic Quadrant for data quality tools, and it didn’t come out short. Some of the features included in the product called SuperMatch are described in the post “When computer says maybe”.

It’s my impression, that if you look at tool vendors with many employees, it’s only a very small group of people who is actually working on the tool

Happy Pi Day

Today March 14 (or 3-14 when writing a date with the month before the day) is Pi Day.

Sometimes I use the analogy about squaring the circle when talking about how to get data quality right. In real life we have to make an approximate construction similar to when we square the circle – sometimes 22/7 or 3.14 instead of Pi with 15,000 decimals is OK and anyway the real fix is proved impossible.

Striving for the ultimate 360° view was discussed in a post here on the blog last year called “360° Business Partner View”.

Who is Responsible for Data Quality?

No, I am not going to continue some of the recent fine debates on who within a given company is data owner, accountable and responsible for data quality.

My point today is that many views on data ownership, the importance of upstream prevention and  fitness for purpose of use in a business context is based on an assumption that the data in a given company is entered by that company, maintained by that company and consumed by that company.

This is in the business world today not true in many cases.

Examples:

Direct marketing campaigns

Making a direct marketing campaign and sending out catalogues is often an eye opener for the quality of data in your customer and prospect master files. But such things are very often outsourced.

Your company extracts a file with say 100.000 names and addresses from your databases and you pay a professional service provider a fee for each row for doing the rest of the job.

Now the service provider could do you the kind favour of carefully deduplicating the file, eliminate the 5.000 purge candidates and bring you the pleasant message that the bill will be reduced by 5 %.

Yes I know, some service providers actually includes deduplication in their offerings. And yes, I know, they are not always that interested in using an advanced solution for that.

I see the business context here – but unfortunately it’s not your business.

Factoring

Sending out invoices is often a good test on how well customer master data is entered and maintained. But again, using an outsourced service for that like factoring is becoming more common.

Your company hands over the name and address, receives the most of the money, and the data is out of sight.

Now the factoring service provider has a pretty good interest in assuring the quality of the data and aligning the data with a real world entity.

Unfortunately this can not be done upstream, it’s a downstream batch process probably with no signalling back to the source.

Customer self service

Today data entry clerks are rapidly being replaced as the customer is doing all the work themselves on the internet. Maybe the form is provided by you, maybe – as often with hotel reservations – the form is provided by a service provider.

So here you basically either have to extend your data governance all the way to your customers living room or office or in some degree (fortunately?) accept that the customer owns the data.

Bookmark and Share

When computer says maybe

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches.  Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

  • Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
  • Check-out and check-in for multiuser environments
  • Presenting a ranked range of computer selected candidates
  • Color coding elements in matched candidates – like:
    • green for (near) exact name,
    • blue for a close name and
    • red for a far from similar name
  • Possibility for marking:
    • as a manual positive match,
    • as a manual negative match (with reason) or
    • as questionable for later or supervisor inspection (with comments)
  • Entering a match found by other methods
  • Removing one or several members from a duplicate group
  • Splitting a duplicate group into two groups
  • Selecting survivorship
  • Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

Do you mean deduplication or deduplication?

The term deduplication may be two different things in computing:

  • The storage kind of deduplication
  • The data quality kind of deduplication

The storage kind of deduplication refers to reducing the data volumes stored and backed up by finding exactly the same file (or other assemblies of data I guess) and eliminate all but one copy.

The data quality kind of deduplication is about finding entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

The result of the data quality kind of deduplication may be that all but one duplicate row are eliminated, but most often we actually will add more bytes by linking the duplicate rows and perhaps make a new golden record.

This disambiguation sometimes leads to mixing it all up.

I remember some years ago when I started as employee number no 1 in Omikron Data Quality in the Nordics we made a meeting booking campaign. This was done by a telemarketing bureau. They booked a lot of meetings for me including one at a company that was very interested in tools for deduplication.

It was a very strange meeting until that we after 12 minutes and 34 seconds concluded, that indeed there are two kinds of deduplication in computing.

Also I noticed lately that a leading vendor of the data quality kind of deduplication tools promoted their product by referring to articles on cost savings and more related to the storage kind of deduplication.

Bookmark and Share

Cultural Stereotypes, Matching Engines and an Oscar

Normally I’m not that fond of using cultural stereotypes, but nevertheless prompted by a conversation lately (and inspired by the Oscar show) I came to think about the following scenarios:

Indian Style

I have heard that in India you don’t say no if someone asks you to do something. So a Bollywood story could be:

A boss calls in a product manger. He asks him to make a data matching engine that produces no false positives and no false negatives. The product manager knows it is impossible, but can’t say no. The product manager says it may be complicated, but when told they can double the team he goes back to the developers and initiates the project.

After a month the boss calls the product manger and asks if they are finished. The product manager replies: “Well, we have come a long way, but there are still some unresolved issues and some testing to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have solved the previous issues, but we have run into some new problems and some more testing has to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have ….

Danish Habits

In Denmark we have a good compensation from the state if we lose our jobs and anyway we are confident that we will find another one. So the short story (we are good at short films) could be:

The boss calls in the product manager and says “Hi Kim, it’s been decided we will make a matching engine that produces no false positives and no false negatives”.

The product manager leans forward, slams the provided business plan onto the table and says: “If you want such a product you can make it yourself” and leaves the room.

The American Way

It’s my impression, that in the United States you (mostly) do what you are told to do. So here the Hollywood story could be:

The boss calls in the product manager and says “Chris, I have got a great idea:  We will make a matching engine that produces no false positives and no false negatives”.

The product manager replies: “That’s impossible”.

The boss says: “Chris, I didn’t ask you about your opinion but told you to make the product”.

The product manager: “You’re the boss”.

The product manager returns to the team. They work hard to make a matching engine with some configurable settings as:

  • No false positives, but false negatives are allowed (recommended)
  • No false negatives, but false positives are allowed
  • No false positives and no false negatives

The boss is satisfied with how the product looks like. He passes it on to marketing. Marketing contacts the analysts. The analysts are excited about the product features and writes about how this great product (from this well established company) will change the game of data matching.

Standardise this, standardize that

Data matching is about linking entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.

When matching we may:

  • Compare the original data rows using fuzzy logic techniques
  • Standardize the data rows and then compare using traditional exact logic

As suggested in the title of this blog post a common problem with standardization is that this may have two (or more) outcomes just like this English word may be spelled in different ways depending on the culture.

Not at least when working with international data you feel this pain. In my recent social media engagement I had the pleasure of touching this subject (mostly in relation to party master data) on several occasions, including:

  • In a comment to a recent post on this blog Graham Rhind says: Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis).
  • Rich Murnane made a post with a fantastic video with Derek Sivers telling about that while we in many parts of the world have named streets with building number assigned according to sequential positions, in Japan you have named blocks between unnamed streets with building numbers assigned according to established sequence.
  • In the Data Matching LinkedIn group Olga Maydanchik and I exchanged experiences on the problem that in American date format you write the month before the day in a date, while in European date format you write the day before the month.

In my work with international data I have often seen that determining what standard is used is depended on both:

  • The culture of the real world entity that the data represents
  • The culture of the person (organisation) that provided the data

So, the possible combination of standards applied to a given data set is made from where the data is, what elements is contained and who entered the data (which is often not carried on).

This is why I like to use both standardisation and standardization and fuzzy logic when selecting candidates and assigning similarity in data matching.

Bookmark and Share

Unpredictable Inaccuracy

Let’s look at some statements:

• Business Intelligence and Data Mining is based on looking into historical data in order to make better decisions for the future.

• Some of the best results from Business Intelligence and Data Mining are made when looking at data in different ways than done before.

• It’s a well known fact that Business Intelligence and Data Mining is very much dependent on the quality of the (historical) data.

• We all agree that you should not start improving data quality (like anything else) without a solid business case.

• Upstream prevention of poor data quality is superior to downstream data cleansing.

Unfortunately the wise statements above have some serious interrelated timing issues:

• The business case can’t be established before we start to look at the data in the different way.

• Data is already stored downstream when that happens.

• Anyway we didn’t know precisely what data quality issues we have in that context before trying out new possible ways of looking at data.

Solutions to these timing issues may be:

• Always try to have the data reflect the real world objects they represent as close as possible – or at least include data elements that makes enrichment from external sources possible.

• Accept that downstream data cleansing will be needed from time to time and be sure to have the necessary instruments for that.

Bookmark and Share

Bad word?: Data Owner

When reading a recent excellent blog post called “How to Assign a Data Owner” by Rayk Fenske I once again came to think about how I dislike the word owner in “Data Owner” and “Data Ownership”.

I am not alone. Recently Milan Kucera expressed the same feelings on DataQualityPro. I also remember that Paul Woodward from British Airways on MDM Summit Europe 2009 said: Data is owned by the entire company – not any individuals.

My thoughts are:

  • Owner is a good word where we strive for fit for a single purpose of use in one silo
  • Owner may be a word of choice where we strive for fit for single purposes of use in several silos
  • Owner is a bad word where we strive for fit for multiple purposes of use in several silos

Well, I of course don’t expect all the issues raised by Rayk will disappear if we are able to find a better term than “Data Owner”.

Nevertheless I will welcome better suggestions for coining what is really meant with “Data Ownership”.

Bookmark and Share