History of Data Quality

When did the first data quality issue occur? Wikipedia says in the data quality article section titled history that it began with the mainframe computer in the United States of America.

Fellow data quality blogger Steve Sarsfield made a blog post a few years ago called A Brief History of Data Quality where it is said “Believe it or not, the concept of data quality has been touted as important since the beginning of the relational database”.

However, a predominant sentiment in the data quality realm is that data quality is not about technology. It is about people. People are the sinners of data quality flaws and as the main part of the problem people should also be the overwhelming part, if not the only part, of the solution.

So I guess data quality challenges were introduced when people showed up in the real world. How and when that happened is a matter of discussion as discussed in the blog post Out of Africa.

As explained in the post Movable Types the invention of movable types in printing some hundreds of years ago (the most important invention since someone invented the wheel for the first time) made a big boost in knowledge sharing among people – and also a big boost in data and information quality issues.

But I think the saying “To err is human, but to really foul things up you need a computer” is valid. Consequently I also think you may need a computer to help with cleaning up the mess and to prevent the mess from happening again. End of (hi)story.    

Bookmark and Share

Finding Finland

This is the fourth post in a series of short blog posts focusing on data quality related to different countries around the world. I am not aiming at presenting a single version of the full truth but rather presenting a few random observations that I hope someone living in or with knowledge about the country are able to clarify in a comment.

Let’s start with Finnish

Finland is situated in the North Eastern corner of Europe. The Finnish language is together with Estonian and Hungarian much longer south in Europe totally different from the neighboring countries languages which are Germanic or Slavic. Swedish is also an official language in Finland, and in some parts of Finland cities and streets have both (usually totally different) Finnish and Swedish names.

Galoshes

The by far largest company in Finland is the cell phone maker Nokia. Before the cell phone was invented Nokia made paper and galoshes – the old way of connecting people. Nokia also from 2006 to 2008 owned the data quality firm Identity Systems. It was sold to Informatica. I guess Identity Systems connected with the Gaelic Tiger firm Similarity Systems make up the data matching capabilities at Informatica.

Syslore

One of the remaining (relatively) larger independent data matching firms in the world is Syslore. Syslore is hiding in Finland.

Previous Data Quality World Tour blog posts:

Bookmark and Share

Survey Data Laundering

There are a lot of different words for data quality improvement activities like data cleaning, data cleansing, data scrubbing and data hygiene.

Today I stumbled upon “data laundering” and the site http://www.datalaundering.com that is owned by an old colleague of mine from way back when we were doing stuff not focused on data quality.

Joseph is specializing in laundering data from surveys. The issue is that surveys always have some unreliable responses that lead to wrong conclusions that again lead to wrong decisions.  This is a trail well known in data and information quality.

Unreliable responses resemble outliers in business intelligence. These are responses from respondents that provide answers distant from the most conceivable result. What I like about the presentation of the business value is that the example is about food: What we say that we eat and what we actually consume. Then there is a lot of math and even induction mechanism to support the proposition. Read all about it here.      

Bookmark and Share

Hierarchical Completeness

A common technique used when assessing data quality is data profiling. For example you may count different measures as number of fields in a table that have null values or blank values, distribution of filled length of a certain field, average values, highest values, lowest values and so on.

If we look at the most prominent entity types in master data management being customers and products you may certainly also profile your customer tables and product tables and indeed many data profiling tutorials use these common sort of tables as examples.

However, in real life profiling an entire customer table or product table will often be quite meaningless. You need to dig into the hierarchies in these data domains to get meaningful measures for your data quality assessment.

Customer master data

In profiling customer master data you must consider the different types of party master data as business entities, department entities, consumer entities and contact entities, as the demands for completeness will be different for each type. If your raw data don’t have a solid categorization in place, a prerequisite for data profiling will often be to make such a categorization before going any further.

If your customer data model isn’t too simple, as explained in post A Place in Time, your location data (like shipping addresses, billing addresses, visiting addresses) will be separated from your customer naming and identification data. This hierarchical structure must be considered in your data profiling.

For international customer data there will also be different demands and possibilities for completeness of customer data elements.    

Depending on your industry and way of doing business there may also be different demands for customer data related to different industry verticals, demographic groups and data sourced in different channels. However this may be a slippery ground, as current and not at least future requirements for multiple uses of the same master data may change the picture.   

Product master data

For most businesses the requirements for completeness and other data profiling measures will be very different depending on the product type.

Some requirements will only apply to a small range of products; other requirements apply to a broader range of products.

All in all the data profiling requirements is an integrated part of hierarchy management for product master data which make a very strong case for having data profiling capabilities implemented as part of a product information management (PIM) solution.

Multi-Domain Master Data Management

For master data management solutions embracing both customer data integration (CDI) and product information management (PIM) integrated capabilities for profiling customer master data, location master data and product master data as part of hierarchy management makes a lot of sense.

As improving data quality isn’t a one-off activity but a continuous program, so is the part being measuring the completeness of your master data of any kind.

Bookmark and Share

Where is the Business?

In technology enabled disciplines we often like to divide an organization into two distinct parts being IT (Information Technology) and “the business”.

I am aware that we do that to emphasize that our solutions has to be business centric opposite to technology centric. We mustn’t fall into the trap of discussing technology too early and certainly not selecting certain technology brands as the first step of our solutions.

A problem however is where to find “the business” in an organization. The top management surely represents all of the business (including the IT part of the business). But in order to find the so called subject matter experts we are looking down the levels in the organization where people don’t belong to “the business” but to sales, marketing, customer service, purchase, production, human resources, finance and so on.

Some technology enabled disciplines belong to a certain department. But disciplines as (enterprise wide) data quality and master data management are supposed to support most departments. The business. So where do we find the business? And who are we by the way?

Call them?

Assuming it doesn’t matter who we are: Let’s go find “the business”. I guess it doesn’t help calling the reception and ask them to put us through to “the business”. Actually the manned reception probably doesn’t exist today. And it will be surprising to get a machine asking:

  • Do you want to speak with IT? Press 1.
  • Do you want to speak with “the business”? Press 2.

If we are in my home country Denmark we also have a linguistic issue. If I ask google to translate “the business” from English to Danish I get the word “forretningen”. If I ask google to translate “forretningen” from Danish back to English I get the word “shop”. So calling “forretningen” will probably get me to the shop floor. Not a bad place, a true gemba, but maybe not the only one.

Everyone belongs to “the business”

In data quality and master data management there is a question used all over to exemplify a common challenge within these disciplines.

The question is: What is a customer?

The challenge is that people from different departments will have different definitions. Marketing defines a customer one way, sales tend to do it a bit different, finance sees it yet in another way and production has their view point. And the stereotype IT guy defines a customer as a row in the customer table.

So now we are asking for Alexander the Great from “the business” to come cutting the Gordian Knot.

That is probably not going to happen.

More likely someone from any business unit will be able to negotiate a proper conceptual solution covering all requirements from the different business units. And from what I see around it may often be someone who’s human resource master data record is related to the IT part of the business. Or was. The main point is having a holistic view of the business where everyone belongs.    

Bookmark and Share

Fitness Data

About a month ago I wrote about how my personal data was on-boarded in the local fitness club in the post called Right the First Time.

Since then I have actually succeeded in visiting the gym twice a week and used the amazing technology necessary to get me in action.

As a complete data geek I of course use the full TV screen on the machine not to watch TV but to display the full dashboard with key performance indicators related to my workout. These include:

  • Time done / remaining
  • Pulse with red alert when I’m over the healthy threshold for my age
  • Distance I would have gone if I wasn’t in the same fixed position
  • Calories burned

As with many data presentations we here have a mix of hard facts, like the time done, and then some assumed figures like calories burned. The machine doesn’t really measure the actual accurate burning but calculates the assumed burning as a function of power level, speed, my weight and age.  

It’s actually a question if I really want to know about the calories burned. My conclusion is yes. The time done is wasted anyway, the high pulse doesn’t last and the distance is virtual. So the calories burned fit the purpose of use. It keeps me going.   

Bookmark and Share

Extreme Data Quality

This blog post is inspired by reading a blog post called Extreme Data by Mike Pilcher. Mike is COO at SAND, a leading provider of columnar database technology.

The post circles around a Gartner approach to extreme data. While the concept of “Big Data” is focused on the volume of data the concept of “Extreme Data” also takes into account the velocity and the variety of data.

So how do we handle data quality with extreme data being data of great variety moving in high velocity and coming in huge volumes? Will we be able to chase down all root causes of eventual poor data quality in extreme data and prevent the issues upstream or will we have to accept the reality of downstream cleansing of data at the time of consumption?

We might add a sixth reason being the rise of extreme data to the current Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

Business and Pleasure

The data quality and master data management (MDM) realm has many wistful songs about unrequited love with “the business”.

This morning I noticed yet a tweet on twitter expressing the pain:

Here Gartner analyst Ted Friedman foresees the doom of MDM if we don’t get at least the traction from “the business” that BI (Business Intelligence) is getting.

In my eyes everything we do in Information Technology is about “the business”. Even computer games and digital entertainment is a core part of the respective industries. I also believe that IT is part of “the business”.

“The rest of the business” does see that some disciplines belong in the IT realm. This goes for database management, programming languages and network protocols. These disciplines are not doomed at all because it is so. “The rest of the business” couldn’t work today without these things around.

Certainly I have seen some IT based disciplines and related tools emerged and then been doomed during my years in the IT business. Anyone remembers case tools?   

With case tools I remember great expectations about business involvement in application design. But according to Wikipedia the main problems with case tools are (were): Inadequate standardization, unrealistic expectations, slow implementation and weak repository controls.

In other words: “The rest of the business” never really got in touch with the case tools because they didn’t work as supposed.

The business traction we see around BI (and the enabling tools) now is in my eyes very much about that the tools have matured, actually works, have become more user friendly and seems to create useful results for “the rest of the business”.

Data quality tools and MDM tools must continue to follow that direction too, because for sure: Data Quality tools and MDM tools does not solve any severe problems internally in the IT part of “the business”.

It’s my pleasure being part of that.

Bookmark and Share

The Art in Data Matching

I’ve just investigated a suspicious customer data match:

A Company on Kunstlaan no 99 in Brussel

was matched with high confidence with:

The Company on Avenue des Arts no 99 in Bruxelles

At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.

The diverse facts are:

  • Brussels is the Belgian capital
  • Belgium has two languages: French and Flemish (a variant of Dutch)
  • Some parts of the country is French, some parts is Flemish and the capital is both
  • Brussels is Bruxelles in French and Brussel in Flemish
  • Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
  • Laan is Flemish meaning Avenue (same origin as Lane I guess)
  • Avenue des Arts is French meaning Avenue of Art (French is easy)

Technically the computer in this case did as follows:

  • Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
  • Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
  • Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.

It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.

In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.

Bookmark and Share

Technology and Maturity

A recurring subject for me and many others is talking and writing about people, processes and technology including which one is most important, in what sequence they must be addressed and, which is my main concern, how they must be aligned.

As we practically always are referring to the three elements in the same order being people, processes and technology there is certainly an implicit sequence.

If we look at maturity models related to data quality we will recognize that order too.

In the low maturity levels people are the most important aspect and the subject that needs the first and most attention and people are the main enablers for starting moving up in levels.

Then in the middle levels processes are the main concerns as business process reengineering enables going up the levels.

At the top levels we see implemented technology as a main component in the description of being there.    

An example of the growing role of technology is (not surprisingly of course) in the data governance maturity model from the data quality tool vendor DataFlux.

One thing is sure though: You can’t move your organization from the low level to the high level by buying a lot of technology.

It is an evolutionary journey where the technology part comes naturally step by step by taking over more and more of the either trivial or extremely complex work done by people and where technology becomes an increasingly integrated and automated part of the business processes.

Bookmark and Share