Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

Diversity in Data Quality in 2010

Diversity in data quality is a favorite topic of mine and diversity has been my theme word in social media engagement this year.

Fortunately I’m not alone. Others have been writing about diversity in data quality in the past year. Here are some of the contributions I remember:

The Dutch data quality tool vendor Human Inference has a blog called Data Value Talk. Here several posts are about diversity in data quality including the post World Languages Day – Linguistic diversity rules in Switserland!

Another blog based in the Netherlands is from Graham Rhind. Graham (a Brit stranded in Amsterdam) is an expert in international issues with data quality and one of his blog posts this year is called Robert the Carrot.

The MDM Vendor IBM Initiate has a lively blog about Master Data Management and Data Quality. One of the posts this year was an introduction to a webinar. The post by Scott Schumacher (in which I’m proud to be mentioned) is called Join Us to Demystify Multi-Cultural Name Matching.

Rich Murnane posted a funny but learning video with Derek Sivers about Japanese addresses called What is the name of that block? (Again, thanks Rich for the mention).

In the eLearningCurve free webinar series there was a very educational session with Kathy Hunter called Overcoming the Challenges of Global Data.  There is also an interview with Kathy Hunter on the DataQualityPro site.

I also remember we debated the state of the art of data quality tools when it comes to international data in the post by Jim Harris called OOBE-DQ, Where Are You? As Jim mentions in his later post called Do you believe in Magic (Quadrants)?: “It must be noted that many vendors (including the “market leaders”) continue to struggle with their International OOBE-DQ”.

I guess that international capabilities in data quality tools and party master data management solutions will be on the agenda in 2011 as well.

Bookmark and Share

Automation

The article on Wikipedia about automation begins like this:

“Automation is the use of control systems and information technologies to reduce the need for human work in the production of goods and services. In the scope of industrialization, automation is a step beyond mechanization. Whereas mechanization provided human operators with machinery to assist them with the muscular requirements of work, automation greatly decreases the need for human sensory and mental requirements as well. Automation plays an increasingly important role in the world economy.

Automation has had a notable impact in a wide range of industries beyond manufacturing (where it began). Once-ubiquitous telephone operators have been replaced largely by automated telephone switchboards and answering machines.”

Often we discuss the role of technology in solving data and information quality issues. Viewpoints differ between:

  • Technology may be part of the problem, but should not be part of the solution
  • Tools may solve a certain part of the problems by automating else time consuming processes

I am deliberately not stating the extreme viewpoint that tools (or a certain tool) will solve everything, as I have never seen or heard that viewpoint as mentioned in the post Data Quality Tool Exaggerations.

So, given that range, my viewpoint is the second extreme viewpoint of the ones mentioned above.

If you surprisingly should have a more extreme viewpoint you may go to the OCDQ Blog post called What Does Data Quality Technology Want? and vote for the second option there.

Bookmark and Share

Referrers

I have earlier written about how search terms are a way people gets to my blog in the post Picture This.

Another way is being referred from other sources. Lately WordPress, which is my blog service, improved the statistics so the referring sources are consolidated which gives you much more meaningful information about your referrers.

My current all time statistics looks like this:

At the time the total number of pageviews was 46,263.

LinkedIn seems to be my main supplier of readers. I am regularly sharing my posts as status updates and as news items in different LinkedIn groups.

But I do think that the figures for Twitter is lying though as they are counted based on where from the tweets and re-tweets are read. Twitter is probably only the twitter site. Hootsuite is another way of reading and clicking on links to a blog in a tweet. People who read and click via TweetDeck is as I understand it not counted as a referring source as TweetDeck is a desktop application.

Though I write in English I do from time to time post user blogs and comments with links on Danish language sources as the local Computerworld and another IT online news site called Version2.   

When someone, which in my case mainly is Rich Murnane I think, StumblesUpon a blog post you sometimes get a lot of pageviews within an hour or so.

Else Jim Harris’s blog called OCDQ Blog is a constant source of referring either due to Jim’s kind links to my blog posts or my self-promoting links in my comments on Jim’s blog posts.

Bookmark and Share

Christmas Tree Options

Today the last Sunday before Christmas seems to be a good day for selecting a Christmas tree.

We are considering two different options:

  • As most times before we will find a tree as wide and high as possible for the room so it may be decorated with as much of different stuff we have collected during the years as well as some of the precious things passed down from previous generations. It will be cut over the root, but that’s not a problem since we will throw it away after Christmastide.
  • Another option is having a smaller tree still with the root on planted in a pot. We will then have to carefully select the decoration. The advantage is that it can be reused on the terrace during the year and then, a little taller, as Christmas tree again next year.   

Well, not that different from the considerations about data quality, data warehouse and business intelligence projects and programs from my workdays.

Bookmark and Share

Matching Down Under

As a data matching geek I always love reading about how others have made the great but fearful journey into the data matching world.

This week Wayne Colless of the Australian Attorney-General’s Department kindly made a document about data matching public on the DataQualityPro site. The full title is “Improving the Integrity of Identity Data – Data Matching Better Practice Guidelines, 2009”. Link here.

As Wayne explains in a discussion in the LinkedIn Data Matching group: Australia has no national unique identifier for individuals (such as the US SSN or the number recorded on national ID cards used in many other countries) that can be used, so the matching has to involve only non-unique values such as name, address and dates of birth.

The document gives a very thorough step by step guidance into matching individual’s names, addresses and birthdays. As the document says you may either build all the logic yourself or you may buy commercial software that does the same. But anyway you have to understand what the software does in order to tune the processes and set the thresholds meaningful to you.

As Australia is a nation mainly born through immigration the challenges with adapting the ruling Anglo-Saxon naming conventions to the reality of name formats coming from all over the world is very apparent. I like that the diversity issues is given a good thought in the document.

I also like that the document addresses a subject not mentioned as often as it should be, namely the challenges with embracing historical values in settling a match as seen in this figure taken from the document:

Whether you think you already know the dos and don’ts in data matching (and I guess you never know that) I really find the document worth reading.   

Bookmark and Share

Matching Light Bulbs

This morning I noticed this lightbulb joke in a tweet from @mortensax:

Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:

The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.

Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:

  • Light bulb, A 19, 130 Volt long life, 60 W
  • Incandescent lamp, 60 Watt, A19, 130V

while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):

  • Light bulb, 60 Watt, A 19, 130 Volt long life
  • Light bulb, 40 Watt, A 19, 130 Volt long life

Working with product data matching is indeed very enlightening.

Bookmark and Share

Now, where’s the undo button?

I have just read two blog posts about the dangers of deleting data in the good cause of making data quality improvements.

In his post Why Merging is Evil Scott Schumacher of IBM Initiate describes the horrors of using survivorship rules for merging two (or more) database rows recognized to reflect the same real world entity.

Jim Harris describes the insane practices of getting rid of unwanted data in the post A Confederacy Of Data Defects.

On a personal note I have just had a related experience from outside the data management world. We have just relocated from a fairly large house to a modest sized apartment. Due to the downsizing and the good opportunity given by the migration we wasted a lot of stuff in the process. Now we are in the process of buying replacements for these things we shouldn’t have thrown away.

As Scott describes in his post about merging, there is an alternate approach to merging being linking – with some computation inefficiency attached. Also in the cases described by Jim we often don’t dare to delete at the root, so instead we keep the original values and makes a new cleansed copy without the supposed unwanted data for the purpose at hand.

In my relocation project we could have rented a self-storage unit for all the supposed not so needed stuff as well.

It’s a balance. As in all things data quality there isn’t a single right or wrong answer to what to do. And there will always be regrets. Now, where’s the undo button?

Bookmark and Share

My 2011 To Do List

These days are classic times for predicting something about next year in a blog post. This year I will make some egocentric predictions about what I am going to do next year. Fortunately I think these activities are pretty representative for the trends in the data quality realm.

My three most important challenges in working with data and information quality improvement and master data management will be:

Multi-Domain Master Data Quality

There are some different disciplines and product offerings around as:

  • Data Quality tools
  • Customer Data Integration (CDI) solutions
  • Product Information Management (PIM) platforms

These disciplines and the related software packages used to solve the challenges are constantly maturing and expanded to embrace the problems as a whole.

Find more about the subject in my posts on Multi-Domain MDM.

Exploiting rich external reference data sources in the cloud

Working with external reference sources as a mean to improve data quality has been a focus area of mine for many years.

Recent developments in governments releasing rich sources of data will help with availability here, but new challenges will also arise, like working with conformity across data sources coming from many different countries in many different ways.

Much of the activity here will happen in the cloud.

See my take on the subject on the page Data Quality 3.0 and read about a concrete implementation in instant Data Quality.

Downstream data cleansing

Despite constant improvements with data quality tools and master data management solutions moving us from batch cleansing downstream to upstream prevention there will still be lots of reasons for doing downstream cleansing projects.

Here are the top 5 reasons.

I expect to be involved in at least one of each type next year.

Bookmark and Share