Building an instant Data Quality Service for Quotes

In yesterday’s post called Introducing the Famous Person Quote Checker the issue with all the quotes floating around in social media about things apparently said by famous persons was touched.

The bumblebee can’t fly faster than the speed of light – Albert Einstein
The bumblebee can’t fly faster than the speed of light – Albert Einstein

If you were to build a service that could avoid postings with disputable quotes, what considerations would you have then? Well, I guess pretty much the same considerations as with any other data quality prevention service.

Here are three things to consider:

Getting the reference data right

Finding the right sources for say reference data for world-wide postal addresses was discussed in the post A Universal Challenge.

The same way, so to speak, it will be hard to find a single source of truth about what famous persons actually said. It will be a daunting task to make a registry of confirmed quotes.

Embracing diversity

Staying with postal addresses this blog has a post called Where the Streets have one Name but Two Spellings.

The same way, so to speak again, quotes are translated, transliterated and has gone through transcription from the original language and writing system. So every quote may have many true versions.

Where to put the check?

As examined in the post The Good, Better and Best Way of Avoiding Duplicates there are three options:

1)      A good and simple option could be to periodically scan through postings in social media and when a disputable quote is found sending an eMail to the culprit who did the posting. However, it’s probably too late, as even if you for example delete your tweet, the 250 retweets will still be out there. But it’s a reasonable way of starting marking up all the disputable quotes out there.

2)      A better option could be a real-time check. You type in a quote on a social media site and the service prompts you: “Hey Dude, that person didn’t say that”. The weak point is that you already did all the typing, and now you have to find a new quote. But it will work when people try to share disputable quotes.

3)    The best option would be that you start typing “If you can’t explain it simply… “ and the service prompts a likely quote as: “Everything should be as simple as it can be, but not simpler – Albert Einstein”.

Bookmark and Share

What’s New in The Data Quality Magic Quadrant?

The Gartner Magic Quadrant for Data Quality Tools 2013 is out. If you don’t want to pay Gartner’s fee for having a look, you can sign up for a free copy on one of the vendor’s websites for example here at Trillium Software Insights.

So, what’s new this year?

It is pretty much the same picture as last year with X88 as the only new intruder. Else the news is that some vendors “now appear under slightly different names”. And now Ted Friedman is the only author.

The most exciting part, in my eyes, is the words about how the market will develop. Some seen and foreseen trends are:

  • Information governance programs drive the need for data quality tools.
  • Cloud based deployments are gaining traction.
  • Growth expected for embracing less-structured data, not at least social data, by using big data techniques and sources.

That’s good news.

Data Quality Tools

Bookmark and Share

Somehow Deduplication won’t Stick

psychographic MDM18 years ago I cruised into the data quality realm when making my first deduplication tool. Then it was an attempt to solve a business case of two companies who were considering merging and wanted to know the intersection of customers. So far, so good.

Since then I have worked intensively with deduplication and other data matching tools and approaches and also co-authored a leading eLearning course on the matter as seen here.

Deduplication capability is a core feature of many data quality tools and indeed the probably most mentioned data quality pain is lack of uniqueness not at least in party master data management.

However, most deduplication efforts don’t in my experience stick. Yes, we can process a file ready for direct marketing and purge the messages that might end up in the same offline or online inbox despite of spelling differences. But taking it from there and use the techniques in achieving a single customer view is another story. Some obstacles are:

In the comments to the latter 3 year old post the intersection (and non-intersection) of Entity Resolution and Master Data Management (MDM) was discussed.

During my latest work I have become more and more convinced that achieving a single view of something is a lot about entity resolution as expressed in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

Think global from day one

The title of this post is taken from a blog post by Hans Peter Bech. The post is called Entering a Foreign Market – The 9 Steps to Success for Software Companies.

Decimal_mark

In the post Hans Peter says:

“German software companies having access to 7% of world demand and US based companies with a domestic market representing 38% of world demand often ignore the global perspective until forced to face the challenge. That’s very fortunate for the smaller companies from the smaller countries!”

This observation from the software market in general certainly also applies to software for data quality improvement and master data management as examined in the post 255 Reasons for Data Quality Diversity.

If you are a software company in the data management space the meaning of thinking global may apply to various activities as:

  • How the product is designed in respect to handling data from all over the world. Here thinking global from day one is crucial.
  • How the product is marketed to a world-wide audience. Here the global approach could wait a bit.

On the latter matter I have teased one of the magic quadrant data quality tool vendors, Trillium Software, for having used a date format only used in the United States on their blog. Maybe it’s a small matter and just me who is sensitive to this normal glitch. Anyway I’m pleased to congratulate Trillium Software on their new blog design with a world-wide fit date format. Check out the blog, which is a good one indeed, here.

Bookmark and Share

On Washing Rental Cars and Shared Data

Recently a tweet from Doug Laney of Gartner has been retweeted a lot:

Rented Car

As most analogies it may fit or maybe not fit seen in different perspectives. Actually rental cars are probably some of the most washed cars as the rental company wash and clean the car between every rental.

In the same way as rental cars usually are quite clean I have also found that sharing data is a powerful way to have clean data as told on the page about Data Quality 3.0. This is also the grounding concept behind the instant Data Quality solution I’m working with, where we have just released our iDQ™ MDM Edition.

Bookmark and Share

Where the Streets have one Name but Two Spellings

Last week’s post called Where The Streets have Two Names caught a lot of comments both on this blog and in LinkedIn groups as here on Data Quality Professionals and on The Data Quality Association, with a lot of examples from around the world on how this challenge actually exist more or less everywhere.

Recently I had the pleasure of experiencing a variant of the challenge when driving around in a rented car in the Saint Petersburg area in Russia. Here the streets usually only have one name but that may be presented in two different alphabets being the local Cyrillic or the Latin alphabet I’m used to which also was included in the reference data on the Sat Nav. So while it was nice for me to type destinations in Latin letters it was nice to have directions in Cyrillic in order to follow the progress on road signs.

So here standardization (or standardisation) to one preferred language, alphabet or script system isn’t the best solution. Best of breed solutions for handling addresses must be able to handle several right spellings for the same address.

Nevsky_Prospekt,_St_Petersburg,_street_sign
Street sign in Cyrillic with Latin subtitle

Bookmark and Share

Where the Streets have Two Names

As told in post The Art in Data Matching a common challenge in matching names and addresses is that in some parts of the world the streets have more than one name at the same time because more than one language is in use.

We have the same challenge when building functionality for rapid addressing, being functionality that facilitates fast and quality assured entry of addresses supported by reference data that knows about postal codes / cities and street names.

The below example is taken from the instant Data Quality tool address form:

Finish Swedish

The Finnish capital Helsinki also has an official name in Swedish being Helsingfors and the streets in Helsinki/Helsingfors have both Finnish and Swedish names. So when you start typing a letter suggestions could be in both Finnish and Swedish.

What challenges have you encountered with street names in multiple languages?

Bookmark and Share

The Data Enrichment ABC

A popular and indeed valuable method of avoiding decay of data quality in customer master data and other master data entities is setting up data enrichment services based on third party reference data sources. Examples of such services are:

  • Relocation updates like National Change Of Address services from postal services
  • Change of name, address and a variety of status updates from business directories and in some countries citizen directories too

When using such services you will typically want to consider the following options for how to deal with the updates:

A: Automatic Update

Here your internal master data will be updated automatically when a change is received from the external reference data source.

C: Excluded Update

Here an automated rule will exclude the update as there may be a range reasons for why you don’t want to update certain entity segments under certain circumstances.

B: Interactive Update

Here the update will require a form of manual intervention either to be fulfilled or excluded based on human decision.

An example will be if a utility supplier receives a relocation update for the occupier at an installation address. This will trigger/support a complex business process far beyond changing the billing address.

iDQ logo
iDQ

As explained in the post When Computer Says Maybe we need functionality within data quality tools and Master Data Management (MDM) solutions to support data stewards in cost effectively handling these situations and this certainly also applies to the B pot in data enrichment.

Right now I’m working with designing such data stewardship functionality within the instant Data Quality environment.

Bookmark and Share

The Internet of Things and the Fat-Finger Syndrome

When coining the term “the Internet of Things” Kevin Ashton said:

“The problem is, people have limited time, attention and accuracy—all of which means they are not very good at capturing data about things in the real world.”

Indeed, many many data quality flaws are due to a human typing the wrong thing. We usually don’t do that intentionally. We do it because we are human.

Typographical errors, and the sometimes dramatic consequences, are often referred to as the “fat-finger syndrome”.

As reported in the post Killing Keystrokes avoiding typing is a way forward for example by sharing data instead of typing in the same data (a little bit differently) within every organization.

IoT Data QualityThe Internet of Things, being common access to data provided by a huge number of well defined devices, is another development in avoiding typos.

It’s not that data coming from these devices can’t be flawed. As debated in the post Social Data vs Sensor Data there may be challenges in sensor data due to errors in a human setting up the sensors.

Also misunderstandings by humans in combining sensor data for analytics and predictions may cause consequences as bad as those based on the traditional fat-finger syndrome.

All in all I guess we won’t see a decrease in the need to address data quality in the future, we just will need to use different approaches, methodologies and tools to fight bad data and information quality.

Are you interested in what all this will be about? Why not joining the Big Data Quality group on LinkedIn?

Bookmark and Share

Rapid Addressing, Structured or Unstructured Approach

Systems supporting faster and more accurate registration of addresses are becoming more and more common along with that they are becoming better and better.

I have noticed a structured and an unstructured approach to rapid addressing – and hybrids of course.

Structured Approach

The general concept is that you target in on the address like this:

  • First you choose a country from a country list (unless it’s always the same country).
  • Then you select a state or province if a state or province is a mandatory part of an address in that country like it is in the United States, Canada, Australia and India
  • Then you type a postal code if the country has a postal code system. It may be suggested as you write.
  • Then you type a street if the country has thoroughfare based addressing. It may be suggested as you write. For some countries, like the United Kingdom, or part of a country the street is unique by the postal code.
  • Then you type a building number. May be suggested if present in reference data.
  • Then you type a unit or other section of building where applicable. May be suggested if present in reference data.

Rapid AddressingUnstructured Approach

You type in the sequence in a single string as it suites you and the system figures out along the way what matches and makes suggestions.

This approach may better fit the way the address is known to you, but does on the other hand sometimes require you to start again and thereby the rapidness disappears a bit.

Hybrid Approach

A common hybrid solution as that you select the country before going unstructured. That cures the worst system glitches.

What’s Your Approach?

What are your experiences as a user? Maybe you are developing rapid addressing and have had your considerations. Where do you stand?

Bookmark and Share