Survey Data Laundering

There are a lot of different words for data quality improvement activities like data cleaning, data cleansing, data scrubbing and data hygiene.

Today I stumbled upon “data laundering” and the site that is owned by an old colleague of mine from way back when we were doing stuff not focused on data quality.

Joseph is specializing in laundering data from surveys. The issue is that surveys always have some unreliable responses that lead to wrong conclusions that again lead to wrong decisions.  This is a trail well known in data and information quality.

Unreliable responses resemble outliers in business intelligence. These are responses from respondents that provide answers distant from the most conceivable result. What I like about the presentation of the business value is that the example is about food: What we say that we eat and what we actually consume. Then there is a lot of math and even induction mechanism to support the proposition. Read all about it here.      

Bookmark and Share

Non-Obvious Entity Relationship Awareness

In a recent post here on this blog it was discussed: What is Identity Resolution?

One angle was the interchangeable use of the terms “Identity Resolution” and “Entity Resolution”. These terms can be seen as truly interchangeable, as that “Identity Resolution” is more advanced than “Entity Resolution” or as (my suggestion) that “Identity Resolution” is merely related to party master data, but “Entity Resolution” can be about all master data domains as parties, locations and products.

Another term sometimes used in this realm is “Non-Obvious Relationship Awareness”. Also this term is merely related to finding relationships between parties, for example individuals at a casino that seems to do better than the croupiers. Here’s a link to a (rather old) O’Reilly Radar post on Non-Obvious Relationship Awareness.

Going Multi-Domain

So “Non-Obvious Entity Relationship Awareness” could be about finding these hidden relationships in a multi-domain master data scope.

An example could be non-obvious relationships in a customer/product matrix.

The data supporting this discovery will actually not be found in the master data itself, but in transaction data probably being in an Enterprise Data Warehouse (EDW). But a multi-domain master data management platform will be needed to support the complex hierarchies and categorizations needed to make the discovery.   

One technical aspect of discovering such non-obvious relationships is how chains of keys are stored in the multi-domain master data hub.

Customer Master Data

The transactions or sums hereof in the data warehouse will have keys referencing customer accounts. These accounts can be stored in staging areas in the master data hub with references to a golden record for each individual or company in the real world. Depending on the identity resolution available the golden records will have golden relations to each other as they are forming hierarchies of households, company family trees, contacts within companies and their movements between companies and so on.

My guess as described in the post Who is working where doing what? is that this will increasingly include social media data.

Product Master Data

Some of the same transactions or sums hereof in the data warehouse will have keys referencing products. These products will exist in the master data hub as members of various hierarchies with different categorizations.

My guess is that future developments in this field will further embrace not just your own products but also competitor products and market data available in the cloud all attached to your hierarchies and categorizations.   

Bookmark and Share

Customer Product Matrix Management

A customer/product matrix is a way of describing the relationships between customer types and product types/attributes.  


Note: Please find some data quality related product descriptions in the post Data Quality and World Food.

Filling out the matrix may be based on prejudices, gut feelings, assumptions, surveys, focus groups or data.

If we go for data we may do this by collecting available historical data related to sales and inquiries made by persons belonging to each customer type regarding products belonging to each product type.  

In doing that correctly we need two kinds of master data management and data quality assurance in place:

  • Customer Data Integration (CDI) for assigning the accurate customer type in the real world related to the uniquely identified person in transactions coming from all sources – here based on location master data.
  • Product Information Management (PIM) for categorizing the relevant fit for purpose product type.

This reminds me about multi-domain master data management. Customer master data (or shall we say party master data), product master data and location master data used to figure out how to do business. I like it – both the master data management part and the mentioned product types.  

Bookmark and Share

Business and Pleasure

The data quality and master data management (MDM) realm has many wistful songs about unrequited love with “the business”.

This morning I noticed yet a tweet on twitter expressing the pain:

Here Gartner analyst Ted Friedman foresees the doom of MDM if we don’t get at least the traction from “the business” that BI (Business Intelligence) is getting.

In my eyes everything we do in Information Technology is about “the business”. Even computer games and digital entertainment is a core part of the respective industries. I also believe that IT is part of “the business”.

“The rest of the business” does see that some disciplines belong in the IT realm. This goes for database management, programming languages and network protocols. These disciplines are not doomed at all because it is so. “The rest of the business” couldn’t work today without these things around.

Certainly I have seen some IT based disciplines and related tools emerged and then been doomed during my years in the IT business. Anyone remembers case tools?   

With case tools I remember great expectations about business involvement in application design. But according to Wikipedia the main problems with case tools are (were): Inadequate standardization, unrealistic expectations, slow implementation and weak repository controls.

In other words: “The rest of the business” never really got in touch with the case tools because they didn’t work as supposed.

The business traction we see around BI (and the enabling tools) now is in my eyes very much about that the tools have matured, actually works, have become more user friendly and seems to create useful results for “the rest of the business”.

Data quality tools and MDM tools must continue to follow that direction too, because for sure: Data Quality tools and MDM tools does not solve any severe problems internally in the IT part of “the business”.

It’s my pleasure being part of that.

Bookmark and Share

Christmas Tree Options

Today the last Sunday before Christmas seems to be a good day for selecting a Christmas tree.

We are considering two different options:

  • As most times before we will find a tree as wide and high as possible for the room so it may be decorated with as much of different stuff we have collected during the years as well as some of the precious things passed down from previous generations. It will be cut over the root, but that’s not a problem since we will throw it away after Christmastide.
  • Another option is having a smaller tree still with the root on planted in a pot. We will then have to carefully select the decoration. The advantage is that it can be reused on the terrace during the year and then, a little taller, as Christmas tree again next year.   

Well, not that different from the considerations about data quality, data warehouse and business intelligence projects and programs from my workdays.

Bookmark and Share

Donkey Business

When I started focusing on data quality technology 15 years ago I had great expectations about the spread of data quality tools including the humble one I was fabricating myself.

Even if you tell me that tools haven’t spread because people are more important than technology, I think most people in the data and information quality realm think that the data and information quality cause haven’t spread as much as deserved.

Fortunately it seems that the interest in solving data quality issues is getting traction these days. I have noticed two main drivers for that. If we compare with the traditional means of getting a donkey to move forward, the one encouragement is like the carrot and the other encouragement is like the stick:

  • The carrot is business intelligence
  • The stick is compliance

With business intelligence there has been a lot things said and written about that business intelligence don’t deliver unless the intelligence is build on a solid valid data foundation. As a result I have noticed I’m being involved in data quality improvement initiatives around aimed as a foundation for delivering business decisions. One of my favorite data quality bloggers Jim Harris has turned that carrot a lot on his blog: Obsessive Compulsive Data Quality.  

Another favorite data quality blogger Ken O’Conner has written about the stick being compliance work on his blog, where you will find a lot of good points that Ken has learned from his extensive involvement in regulatory requirement issues.

These times are interesting times with a lot of requirements for solving data quality issues. As we all know, the stereotype donkey is not easily driven forward and we must be aware not making the burden to heavy:    

Bookmark and Share

Valuable Inaccuracy

These days I’m involved in an activity in which you may say that we by creating data with questionable quality are making better information quality.

The business case is within public transit. In this particular solution passengers are using chip cards when boarding busses, but are not using the cards when alighting. This is a cheaper and smoother solution than the alternative in electronic ticketing, where you have both check-in and check-out. But a major drawback is the missing information about where passengers alighted, which is very useful information in business intelligence.

So what we do is that we where possible assume where the passenger alighted. If the passenger (seen as a chip card) within a given timeframe boarded another bus at a stop point which is on or near a succeeding stop point on the previous route, then we assume alighting was at that stop point though not recorded.

Two real life examples of doing so is where the passenger makes an interchange or where the passenger later on a day goes back from work, school or other regular activity.

An important prerequisite however is that we have good data quality regarding stop point locations, route assignments and other master data and their relations.    

Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.

Bookmark and Share

Relational Data Quality

Most of the work related to data quality improvement I do is done with data in relational databases and is aimed at creating new relations between data. Examples (from party master data) are:

  • Make a relation between a postal address in a customer table and a real world address (represented in an official address dictionary).
  • Make a relation between a business entity in a vendor table and a real world business (represented in a business directory most often derived from an official business register).
  • Make a relation between a consumer in one prospect table and a consumer in another prospect table because they are considered to represent the same real world person.

When striving for multi-purpose data quality it is often necessary to reflect further relations from the real world like:

  • Make a relation in a database reflecting that two (or more) persons belongs to the same household (on the same real world address)
  • Make a relation in the database reflecting that two (or more) companies have the same (ultimate) mother.

Having these relations done right is fundamental for any further data quality improvement endeavors and all the exciting business intelligence stuff. In doing that you may continue to have more or less fruitful discussions on say the classic question: What is a customer?

But in my eyes, in relation to data quality, it doesn’t matter if that discussion ends with that a given row in your database is a customer, an old customer, a prospect or something else. Building the relations may even help you realize what that someone really is. Could be a sporadic lead is recognized as belonging to the same household as a good customer. Could be a vendor is recognized as being a daughter company of a hot prospect. Could be someone is recognized as being fake. And you may even have some business intelligence that based on the relations may report a given row as a customer role in one context and another role in another context.

Data Quality: The Movie

Learning from courses, books, articles and so on is good – but sometimes a bit like watching a movie and then realizing that the real world – especially your world – isn’t exactly as in the movie.


The parking experience:

The movie: You are going to visit someone in a huge building in the centre of a large city. You take your car to the front of the building and smoothly place the car on the free parking spot next to the main entrance.

Real life: You drive round and round for ages until finally you find a free parking spot hardly in walking distance from your destination.

My life: I have during my 30 years in the IT business visited a lot of companies and spent time in the IT departments. Nobody does everything by the book. Not even close.

Maybe large companies within financial services are those who in my experience are within some distance of doing something by the book. This is probably because most books about IT seem to be written by folks who had their experiences from working in large financial service businesses.

(And no, I have absolutely no documentation on that. It is just a gut feeling).

Hitting them hard:

The movie: You are a good guy observing a bad guy harassing a good looking girl. You engage the bad guy in an intense fist fight, you are hit over and over again, but in the end you win. The good looking girl thanks you by kissing your beautiful face.

Real life: Well, you may win the fight. But after that you have to go the hospital and have them fix your face – and during the following month any girl can’t look at you without feeling very bad.

My life: Recently I was involved in a data management project aimed at producing some new business intelligence results. Executive sponsorship was no problem, the CEO was the initiator. Objectives were pretty clear. High level business requirements were well known and not to forget, everyone was fully aware of the impact from data quality. The only issue was the absence of more concrete detailed requirements and business rules for reporting. And of course a political settled deadline.

Facing the business rule issue we took a data centric and test driven approach. We produced incremental results, verified test cases, negotiated business rules based on real data examples and in the end a first report came out. The result was far from expected in the sense that the numbers was expected to be different. We dived into data again, found an unexpected data quality issue, corrected accordingly. The result was still far from expected. Based on a specific expected result we dived into a section of data, made detailed reports and compared to real world. In the end it turned out that the report was right, the gut feeling perception of the real world had been wrong for a long time.

Now that’s a winner, right? Well, the project is on hold now for political reasons and also the project has a bad name for going over budget and deadline.

Looking great:

The movie: Morning scene from the nuclear family. Mommy is looking really great (stylish hair, perfect face) while cooking and serving a nice breakfast and helping the kids doing some last minute homework at the same time.

Real life: I think you know.

My life: Actually I have learned that you don’t have to strive for perfection. With data quality; don’t expect you are able to fix everything and having all data fit for every purpose of use at any time.

Bookmark and Share