Going Upstream in the Circle

One of the big trends in data quality improvement is going from downstream cleansing to upstream prevention. So let’s talk about Amazon. No, not the online (book)store, but the river. Also as I am a bit tired about that almost any mention of innovative IT is about that eShop.

A map showing the Amazon River drainage basin may reveal what may go to be a huge challenge in going upstream and solve the data quality issues at the source: There may be a lot of sources. Okay, the Amazon is the world’s largest river (because it carries more water to the sea than any other river), so this may be a picture of the data streams in a very large organization. But even more modest organizations have many sources of data as more modest rivers also have several sources.

By the way: The Amazon River also shares a source with the Orinoco River through the natural Casiquiare Canal, just as many organizations also shares sources of data.

Some sources are not so easy to reach as the most distant source of the Amazon being a glacial stream on a snowcapped 5,597 m (18,363 ft) peak called Nevado Mismi in the Peruvian Andes.

Now, as I promised that the trend on this blog should be about positivity and success in data quality improvement I will not dwell at the amount of work in going upstream and prevent dirty data from every source.

I say: Go to the clouds. The clouds are the sources of the water in the river. Also I think that cloud services will help a lot in improving data quality in a more easy way as explained in a recent post called Data Quality from the Cloud.

Finally, the clouds over the Amazon River sources are made from water evaporated from the Amazon and a lot of other waters as part of the water cycle. In the same way data has a cycle of being derived as information and created in a new form as a result of the actions made from using the information.

I think data quality work in the future will embrace the full data cycle: Downstream cleansing, upstream prevention and linking in the cloud.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

Did They Put a Man on the Moon?

Recently I have been reading some blog posts circling around having a national ID for citizens in the United States including a post from Steve Sarsfield and another post from Jeffrey Huth of Initiate.

In Denmark where I live we have had such a national ID for about half a century. So if you are a vendor with a great solution for data matching and master data management in healthcare and wants to approach a Danish prospect in healthcare (which are mainly public sector here), they will tell you, that the solutions looks really nice, but they don’t have that problem. You can’t stay many seconds as a patient in a Danish hospital before you are asked to provide your national ID. And if you came in inside your mother you will be given an ID for life within seconds after you are born.

The same national ID is the basis when we have elections. Some weeks before the authorities will push the button and every person with the right status and age gets a ballot. Therefore we are in disbelief when we every fourth year are following when United States elects a president and we learn about all the mess in voter registration.

Is that happening in the nation that put a man on the moon in 1969?. Or did they? Was it after all a studio recording?

Bookmark and Share

Real World Alignment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.


Bookmark and Share

A Really Bad Address

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.


Bookmark and Share

Citizen ID within seconds

Here is a picture of my grandson Jonas taken minutes after his was born. He has a ribbon around his wrist showing his citizen ID which has just been assigned. There is even a barcode with it on the ribbon.

Now, I have mixed feelings about that. It is indeed very impersonal. But as a data quality professional I do realize that this is a way of solving a problem at the root. Duplicate master data in healthcare is a serious problem as Dylan Jones reported last year when he had a son in this article from DataQualityPro.

A unique citizen ID (National identification number) assigned in seconds after a birth have a lot of advantages. As said it is a foundation for data quality in healthcare from the very start of a life. Later when you get your first job you hand the citizen ID to your employer and tax is collected automatically. When the rest of the money is in the bank you are uniquely identified there. When you turn 18 you are seamlessly put on the electoral roll. Later your marriage is merely a relation in a government database between your citizen ID and the citizen ID of your beloved one.

Oh joy, Master Data Management at the very best.


Bookmark and Share

Post no. 100

This is post number 100 on this blog. Besides that this is a time for saying thank you to those who have read this blog, those who have re-tweeted the posts and not at least those who have commented on the posts on this blog, it is also time for a recapitulation on my opinions (based on my experiences and observations) about data quality.

Let me emphasize three points:

  • Fit for purpose versus real world alignment
  • Diversity in data quality
  • The role of technology in data quality improvement

Fit for purpose versus real world alignment

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

This theme is so far covered in 19 posts and pages including:

Diversity in data quality

International and multi-cultural aspects of data quality improvement have been a favorite topic of mine for a long time.

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

I have made 15 posts on diversity in data quality so far including:

The role of technology in data quality improvement

Being a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start. I have always strived to be a mixed skilled person. As I have tried single handed to build a data quality tool – or to be more specific a data matching tool – I do of course write a lot about data quality technology.

This blog includes 37 posts on data quality technology so far including:

Bookmark and Share

The Next Level

A quote about data quality from Thomas Redman says:

“It is a waste of effort to improve the quality (accuracy) of data no one ever uses.”

I have learned the quote from Jim Harris who mentioned the quote latest in his post: DQ-Tip: “There is no point in monitoring data quality…”

In a comment Phil Simon said: I love that. I’m jealous that I didn’t think of something so smart.

I’m guessing Phil was into some irony. If so, I can see why. The statement seems pretty obvious and at first glance you can’t imagine anyone taking the opposite stance: Let’s cleanse some data no one ever uses.

Also I think it was meant as being obvious in Redman’s book: Data Driven.

Well, taking it to the next level I can think of the following elaboration:

  1. If you found some data that no one ever uses you should not only avoid improving the quality of that data, you should actually delete the data and make sure that no one uses time and resources for entering or importing the same data in the future.
  2. That is unless the reason that no one ever uses the data is that the quality of the data is poor.  Then you must compare the benefits of improving the data against the costs of doing so. If costs are bigger, proceed with point 1. If benefits are bigger, go to point 3.
  3. It is not  a waste of effort to improve the quality of some data no one ever uses.

Bookmark and Share

Whether Weather Forecasting or Not

Predicting ROI from a data quality program (and many other business initiatives) is like predicting the weather. Probably you are able to guess if it is going to be good or bad, but most often you don’t exactly guess how well or bad it actually turned out.

Chances for predicting the weather right varies along with the time of year and your location. I have the pleasure of living in a place (Denmark) where the weather is pretty unpredictable.

Well, winter is usually cold and summer is warm.

We also know that if we have easterly winds coming in from the Russian Steppe during winter, it turns very cold. In summer that wind will make beautiful hot sunny days. Westerly winds in the winter coming in from the Atlantic Ocean means temperatures above freezing. In summer that wind often has some chill and rain with it.

But these are the main scenarios. Between those rough generalizations there is a myriad of factors, events and not fully understood processes that makes weather forecasting a chaotic discipline.

Making business cases for data quality programs have the same challenges. Well, at some spots on the globe (in some parts of the year) you can wake up every morning and be certain that it is going to be a hot sunny day. Likewise a lot of business activities will without any doubt benefit from better data quality – no further forecasting needed. In other cases it may be uncertain. Here you may rely on previous experiences (case studies by others) and your position. You may outline a business case and you could be right.

This morning at my place was forecasted to be mostly cloudy but dry. It is damned cloudy and raining a bit.

Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share