Tomorrow’s Data Quality Tool

In a blog post called JUDGEMENT DAY FOR DATA QUALITY published yesterday Forrester analyst Michele Goetz writes about the future of data quality tools.

Michele says:

“Data quality tools need to expand and support data management beyond the data warehouse, ETL, and point of capture cleansing.”

and continues:

“The real test will be how data quality tools can do what they do best regardless of the data management landscape.”

As described in the post Data Quality Tools Revealed there are two things data quality tools do better than other tools:

  • Data profiling and
  • Data matching

Some of these new challenges I have worked with within designing tomorrow’s data quality tools are:

  • open-doorPoint of capture profiling
  • Searching using data matching techniques
  • Embracing social networks

Point of capture profiling:

The sweet thing about profiling your data while you are entering your data is that analysis and cleansing becomes part of the on-boarding business process. The emphasis moves from correction to assistance as explained in the post Avoiding Contact Data Entry Flaws. Exploiting big external reference data sources within point of capture is a core element in getting it right before judgment day.

Searching using data matching techniques:

Error tolerant searching is often the forgotten capability when core features of Master Data Management solutions and data quality tools are outlined. Applying error tolerant search to big reference data sources is, as examined in the post The Big Search Opportunity, a necessity to getting it right before judgment day.

Embracing social networks:

The growth of social networks during the recent years has been almost unbelievable. Traditionally data matching has been about comparing names and addresses. As told in the post Addressing Digital Identity it will be a must to be able to link the new systems of engagement with the old systems of record in order to getting it right before judgment day.

How have you prepared for judgment day?

Bookmark and Share

The Cases for UPPER CASE in Data Management

I remember some years ago when I started SMS’ing I had an old mobile phone that defaulted the text in upper case. After I while my son answered back: “Why are you always yelling at me in SMSes”.

So I learned that you can use lower case in SMSes as well, and only using all caps in SMSes, as in any other writing, usually means that YOU ARE YELLING.

Examining a text for upper case use can, together with polarity classifiers and all that jazz, be used today in sentiment analysis for example within social media data.

Within data parsing using words in upper case in person names may tell you something too. Especially in France it is common to indicate a surname with only upper case characters, so for example in the name “AUGUST Michel” the first name is the surname and the last name is the given name.

When matching company names a word in upper case may indicate an abbreviation. So “THE Ltd” and “The Happy Entrepreneur Ltd” may be a good match despite of a horrible edit distance.

In data migration within handling names from older systems where all caps have been used, it is common to try to make better looking names. “JOHN SMITH” will be “John Smith” and “SAM MCCLOUD” should be “Sam McCloud”. In environments with other alphabets than English national characters may be reintroduced as well. For example in a German context “JURGEN VON LOW” may come out as “Jürgen von Löw”.

What about you? Have you stumbled upon some fun with upper case in data management?

Bookmark and Share

Data Quality at Terminal Velocity

Recently the investment bank Saxo Bank made a marketing gimmick with a video showing a BASE jumper trading foreign currency with the banks mobile app at terminal velocity (e.g. the maximum speed when free falling).

Today business decisions have to be taken faster and faster in the quest for staying ahead of competition.

When making business decisions you rely on data quality.

Traditionally data quality improvement has been made by downstream cleansing, meaning that data has been corrected long time after data capture. There may be some good reasons for that as explained in the post Top 5 Reasons for Downstream Cleansing.

But most data quality practitioners will say that data quality prevention upstream, at data capture, is better.

I agree; it is better.  Also, it is faster. And it supports faster decision making.

The most prominent domain for data quality improvement has always been data quality related to customer and other party master data. Also in this quest we need instant data quality as explained in the post Reference Data at Work in the Cloud.

Bookmark and Share

Broken Links

When passing the results of data cleansing activities back to source systems I have often encountered what one might call broken links, which have called for designing data flows that doesn’t go by book, doesn’t match the first picture of the real world and eventually prompts last minute alternate ways of doing things.

I have had the same experience when passing some real (and not real) world bridges lately.

The Trembling Lady: An Unsound Bridge

When walking around in London a sign on the Albert Bridge caught my eye. The sign instructs troops to break steps when marching over.

In researching the Albert Bridge on Wikipedia I learned that the bridge has an unsound construction that makes it vibrate not at least when a bunch of troops marches across in rhythm. The bridge has therefore got the nickname “The Trembling Lady”.

It’s an old sign. The bridge is an old bridge. But it’s still standing.

The same way we often have to deal with old systems running on unstable databases with unsound data models. That’s life. Though it’s not the way we want to see it, we most break the rhythm of else perfectly cleansed data as discussed in the post Storing a Single Version of the Truth.  

The Øresund Bridge: The Sound Link

The sound between the city of Malmö in Sweden and København (Copenhagen) in Denmark can be crossed by the Øresund Bridge. If looking at a satellite picture you may conclude that the bridge isn’t finished. That’s because a part of the link is in fact an undersea tunnel as told in the post Geocoding from 100 Feet Under.

Your first image about what can be done and what can’t be done isn’t always the way of the world. Dig into some more sources, find some more charts and you may find a way.

However, life isn’t always easy. Sometimes charts and maps can be deceiving.

Wodna: The Sound of Silence.

As reported in the post Troubled Bridge over Water I planned a cycling trip last summer. The route would take us across the Polish river Świna by a bridge I found on Google Maps.

When, after a hard day’s ride in the saddle, we reached the river, the bridge wasn’t there. We had to take a ferry across the river instead.

I maybe should have known. The bridge on the map was named Wodna. That is Polish for (something with) water.

Bookmark and Share

Marco Polo and Data Provenance

Besides being a data geek I am also interested in pre-modern history. So it’s always nice when I’m able to combine data management and history.

A recurring subject in historian circles is a suspicion saying that Explorer Marco Polo never actually went to China.

As said in the linked article from The Telegraph: “It is more likely that the Venetian merchant adventurer picked up second-hand stories of China, Japan and the Mongol Empire from Persian merchants whom he met on the shores of the Black Sea – thousands of miles short of the Orient”.

When dealing with data and ramping up data quality a frequent challenge is that some data wasn’t captured by the data consumer – not even by the organization using the data. Some of the data stored in company databases are second-hand data and in some cases the overwhelming part of data is captured outside the organization.

As with the book telling about Marco Polo’s (alleged) travels called “Description of the World” this doesn’t mean that you can’t trust anything. But maybe some data are mixed up a bit and maybe some obvious data are missing.

I have earlier touched this subject in the post Outside Your Jurisdiction and identified second-hand data as one of the Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

What’s best: Safe or sorry?

As I have now moved much closer to downtown I have now also changed my car accordingly, so two month ago I squeezed myself into a brand new city car, the Fiat Nuova Cinquecento.

(Un)fortunately the car dealer’s service department called the other day and said some part of the motor had to be replaced because there could be a problem with that part. The manufacturer must have calculated that it’s cheaper (and may be a better customer experience) to be proactive rather than being reactive and deal with the problem if it should occur with my car later.  

(Un)fortunately that’s not the way we usually do it with possible data problems. So, back to work again. Someone’s direct marketing data just crashed in the middle of a campaign.    

Bookmark and Share

Extreme Data Quality

This blog post is inspired by reading a blog post called Extreme Data by Mike Pilcher. Mike is COO at SAND, a leading provider of columnar database technology.

The post circles around a Gartner approach to extreme data. While the concept of “Big Data” is focused on the volume of data the concept of “Extreme Data” also takes into account the velocity and the variety of data.

So how do we handle data quality with extreme data being data of great variety moving in high velocity and coming in huge volumes? Will we be able to chase down all root causes of eventual poor data quality in extreme data and prevent the issues upstream or will we have to accept the reality of downstream cleansing of data at the time of consumption?

We might add a sixth reason being the rise of extreme data to the current Top 5 Reasons for Downstream Cleansing.

Bookmark and Share