Eurovisions

Diversity in data quality is a recurring subject of mine. I think the issues with data quality and diversity resembles a recurring event in Europe being the yearly Eurovision Song Contest. This year the contest was held in Oslo the past week.

Every participating country brings a song. The text may be in any language which then mostly is either English or your different local language(s). Some songs have an international sound while other songs have a strong recognizable local sound. This year I noticed:

  • The winning song from Germany was in the international category, performed in English.
  • As UK songs usually have an international sound and are performed in English the British song handicapped itself with a +20 year old sound leading to a similar position in the finale.
  • Netherlands had a winning strategy with a local sound performed in Dutch. Big hit in Holland I think, but didn’t make it to the finale.

The voting process was as usual criticized as there is a tendency that neighboring countries favors each other such as done by Balkan countries – and the Viking nations.

Bookmark and Share

Multi-Entity Master Data Quality

Master Data is the core entities that describe the ongoing activities in an organization being:

  • Business partners (who)
  • Products (what)
  • Locations (where)
  • Timetables (when)

Many Master Data Management and Data Quality initiatives is in first place only focused on a single entity type, but sooner or later you are faced with dealing with all entity types and the data quality issues that arises from combining data from each entity type.

In my experience business partner data quality issues are in many ways similar cross all different industry verticals while product master data challenges may be different in many ways when comparing companies in various industry verticals. The importance of location data quality is very different, so are the questions about timetable data quality.

A journey in a multi-entity master data world

My latest experience in multi-entity master data quality comes from public transportation.

The most frequent business partner role here is of course the passengers. By the way (so to speak): A passenger may be a direct customer but the payer may also be someone else. But it doesn’t really change anything with the need for data quality whether the passenger is defined as a customer or not, you will regardless of that have to solve problems with uniqueness and real world alignment.

The product sold to a passenger is in the first place a travel document like a single ticket or an electronic card holding a season pass. But the service worth something for the passenger is a ride from point A to point B, which in many cases is delivered as a trip consisting of a series of rides from point A via point C (and D…) to point B. Having consistent hierarchies in reference data is a must when making data fit for multiple purposes of use in disciplines as fare collection, scheduling and so on.

Locations are mainly stop points including those at the start and end of the rides. These are identified both by a name and by geocoding – either as latitude and longitude on a round globe or by coordinates in a flat representation suitable for a map (on a screen). The distance between stops is important for grouping stops in areas suitable for interchange, e.g. bus stops on each side of a road or bus stops and platforms at a rail station. Working with the precision dimension of data quality is a key to accuracy here.

Timetables changes over time. It is essential to keep track of timetable validity in offline flyers, websites with passenger information, back office systems and on-board bus computers. Timeliness is as ever vital here.

Matching transactions made by drivers and passengers in numerous on-board computers, by employees in back office systems and coming from external sources with the various master data entities that describes the transaction correctly is paramount in an effective daily operation and the foundation for exploiting the data in order to make the right decisions for future services.

Bookmark and Share

Data Matching 101

Following up on my post no. 100 I can’t resist making a post having 101 in the title. I’ll use 101 in the meaning of an introduction to a subject. As “Data Quality 101” and “MDM 101” is already widely discussed I think “Data Matching 101” is a good title.

Data matching deals with the dimension of data quality I like to call uniqueness. I use uniqueness because it is the positive term describing the state we want to bring our data to – opposite to duplication which is the state we want to change. Just like the other dimensions of data quality also describes the desired states such as accuracy, consistency and timeliness.

Data matching is besides data profiling the activity within data quality that has been automated the most.  No wonder since duplicates in especially master data and master data not being aligned with the real world is costing organizations incredible amounts of money. Finding duplicates among millions (or even thousands) of records by manual means is impossible. The same is true for matching with directories with timely descriptions of the real world. You have to use a computerized approach controlled by exactly that amount of manual verification that makes your return on investment positive.

Matching names and addresses (party master data) is the most common area of data matching. Matching product master data is probably going to be the next big thing in matching. I have also been involved in matching location data and timetables.

A computerized approach to data matching may include some different techniques like parsing and standardization, using synonyms, assigning match codes, advanced algorithms and probabilistic learning.

All that is best explained with examples. Therefore I am happy to do a webinar called “The Art of Data Matching” as part of a series of free webinars on eLearningCurve. The webinar will be a sightseeing looking at examples on challenges and solutions in the data matching world.

Date and time: Well, these are matching examples of expressing the moment the webinar starts:

  • Friday 06/04/10 12pm EDT
  • Friday 04/06/10 18:00 Central European Summer Time
  • Sydney, Sat Jun 5 2:00 AM

Link to the eLearningCurve free webinar here.

Bookmark and Share

Post no. 100

This is post number 100 on this blog. Besides that this is a time for saying thank you to those who have read this blog, those who have re-tweeted the posts and not at least those who have commented on the posts on this blog, it is also time for a recapitulation on my opinions (based on my experiences and observations) about data quality.

Let me emphasize three points:

  • Fit for purpose versus real world alignment
  • Diversity in data quality
  • The role of technology in data quality improvement

Fit for purpose versus real world alignment

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

This theme is so far covered in 19 posts and pages including:

Diversity in data quality

International and multi-cultural aspects of data quality improvement have been a favorite topic of mine for a long time.

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

I have made 15 posts on diversity in data quality so far including:

The role of technology in data quality improvement

Being a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start. I have always strived to be a mixed skilled person. As I have tried single handed to build a data quality tool – or to be more specific a data matching tool – I do of course write a lot about data quality technology.

This blog includes 37 posts on data quality technology so far including:

Bookmark and Share

The Next Level

A quote about data quality from Thomas Redman says:

“It is a waste of effort to improve the quality (accuracy) of data no one ever uses.”

I have learned the quote from Jim Harris who mentioned the quote latest in his post: DQ-Tip: “There is no point in monitoring data quality…”

In a comment Phil Simon said: I love that. I’m jealous that I didn’t think of something so smart.

I’m guessing Phil was into some irony. If so, I can see why. The statement seems pretty obvious and at first glance you can’t imagine anyone taking the opposite stance: Let’s cleanse some data no one ever uses.

Also I think it was meant as being obvious in Redman’s book: Data Driven.

Well, taking it to the next level I can think of the following elaboration:

  1. If you found some data that no one ever uses you should not only avoid improving the quality of that data, you should actually delete the data and make sure that no one uses time and resources for entering or importing the same data in the future.
  2. That is unless the reason that no one ever uses the data is that the quality of the data is poor.  Then you must compare the benefits of improving the data against the costs of doing so. If costs are bigger, proceed with point 1. If benefits are bigger, go to point 3.
  3. It is not  a waste of effort to improve the quality of some data no one ever uses.

Bookmark and Share

Relational Data Quality

Most of the work related to data quality improvement I do is done with data in relational databases and is aimed at creating new relations between data. Examples (from party master data) are:

  • Make a relation between a postal address in a customer table and a real world address (represented in an official address dictionary).
  • Make a relation between a business entity in a vendor table and a real world business (represented in a business directory most often derived from an official business register).
  • Make a relation between a consumer in one prospect table and a consumer in another prospect table because they are considered to represent the same real world person.

When striving for multi-purpose data quality it is often necessary to reflect further relations from the real world like:

  • Make a relation in a database reflecting that two (or more) persons belongs to the same household (on the same real world address)
  • Make a relation in the database reflecting that two (or more) companies have the same (ultimate) mother.

Having these relations done right is fundamental for any further data quality improvement endeavors and all the exciting business intelligence stuff. In doing that you may continue to have more or less fruitful discussions on say the classic question: What is a customer?

But in my eyes, in relation to data quality, it doesn’t matter if that discussion ends with that a given row in your database is a customer, an old customer, a prospect or something else. Building the relations may even help you realize what that someone really is. Could be a sporadic lead is recognized as belonging to the same household as a good customer. Could be a vendor is recognized as being a daughter company of a hot prospect. Could be someone is recognized as being fake. And you may even have some business intelligence that based on the relations may report a given row as a customer role in one context and another role in another context.

Whether Weather Forecasting or Not

Predicting ROI from a data quality program (and many other business initiatives) is like predicting the weather. Probably you are able to guess if it is going to be good or bad, but most often you don’t exactly guess how well or bad it actually turned out.

Chances for predicting the weather right varies along with the time of year and your location. I have the pleasure of living in a place (Denmark) where the weather is pretty unpredictable.

Well, winter is usually cold and summer is warm.

We also know that if we have easterly winds coming in from the Russian Steppe during winter, it turns very cold. In summer that wind will make beautiful hot sunny days. Westerly winds in the winter coming in from the Atlantic Ocean means temperatures above freezing. In summer that wind often has some chill and rain with it.

But these are the main scenarios. Between those rough generalizations there is a myriad of factors, events and not fully understood processes that makes weather forecasting a chaotic discipline.

Making business cases for data quality programs have the same challenges. Well, at some spots on the globe (in some parts of the year) you can wake up every morning and be certain that it is going to be a hot sunny day. Likewise a lot of business activities will without any doubt benefit from better data quality – no further forecasting needed. In other cases it may be uncertain. Here you may rely on previous experiences (case studies by others) and your position. You may outline a business case and you could be right.

This morning at my place was forecasted to be mostly cloudy but dry. It is damned cloudy and raining a bit.

Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from telegraph.co.uk.

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share