instant Data Quality

My last blog post was all about how data quality issues in most cases are being solved by doing data cleansing downstream in the data flow within an enterprise and the reasons for doing that.

However solving the issues upstream wherever possible is of course the better option. Therefore I am very optimistic about a project I’m involved in called instant Data Quality.

The project is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done. Doing that has two main purposes:

  • Data entry becomes more effective. Less cumbersome investigation and fewer keystrokes.
  • Data quality is safeguarded by better real world alignment.

The combination of a more effective business process that also results in better data quality seems to be good – like a sugar-coated vitamin pill. By the way: The vitamin pill metaphor also serves well as vitamin pills should be supplemented by a healthy life style. It’s the same with data management.

Implementing improved data quality by better real world alignment may go beyond the usual goal for data quality being meeting the requirements for the intended purpose of use.  This means that you instantly are getting more by doing less.

Bookmark and Share

Top 5 Reasons for Downstream Cleansing

I guess every data and information quality professional agrees that when fighting bad data upstream prevention is better than downstream cleansing.

Nevertheless most work in fighting bad data quality is done as downstream cleansing and not at least the deployment of data quality tools is made downstream were tools outperforms manual work in heavy duty data profiling and data matching as explained in the post Data Quality Tools Revealed.

In my experience the top 5 reasons for doing downstream cleansing are:

1) Upstream prevention wasn’t done

This is an obvious one. At the time you decide to do something about bad data quality the right way by finding the root causes, improving business processes, affect people’s attitude, building a data quality firewall and all that jazz you have to do something about the bad data already in the databases.

2) New purposes show up

Data quality is said to be about data being fit for purpose and meeting the business requirements. But new purposes will show up and new requirements have to be met in an ever changing business environment.  Therefore you will have to deal with Unpredictable Inaccuracy.

3) Dealing with external born data

Upstream isn’t necessary in your company as data in many cases is entered Outside Your Jurisdiction.

4) A merger/acquisition strikes

When data from two organizations having had different requirements and data governance maturity is to be merged something has to be done.  Some of the challenges are explained in the post Merging Customer Master Data.

5) Migration happens

Moving data from an old system to a new system is a good chance to do something about poor data quality and start all over the right way and oftentimes you even can’t migrate some data without improving the data quality. You only have to figure out when to cleanse in data migration.

Bookmark and Share

The Sound of Soundex

The probably oldest and most used error tolerant algorithm in searching and data matching is a phonetic algorithm called Soundex. If you are not familiar with Soundex: Wikipedia to the rescue here.

In the LinkedIn group Data Matching we seem to have an ongoing discussion about the usefulness of Soundex. Link to the discussion here – if you are not already a member: Please join, spammers are dealt with, though it is OK to brag about your data matching superiority.

To sum up the discussion on Soundex I think we at this stage may conclude:

  • Soundex is of course very poor compared to the more advanced algorithms, but it may be better than nothing (which will be exact searching and matching)
  • Soundex (or a variant of Soundex) may be used for indexing in order to select candidates to be scored with better algorithms.

Let’s say you are going to match 100 rows with names and addresses against a table with 100 million rows with names and addresses and let’s say that the real world individual behind the 100 rows is in fact represented among the 100 million, but not necessary spelled the same.

Your results may be as this:

  • If you use exact automated matching you may find 40 matching rows (40 %).
  • If you use automated matching with (a variant of) Soundex you may find 95 matching rows, but only 70 rows (70 %) are correct matches (true positives) as 25 rows (25 %) are incorrect matches (false positives).
  • If you use automated matching with (a variant of) Soundex indexing and advanced algorithm for scoring you may find 75 matching rows where 70 rows (70 %) are correct matches (true positives) and 5 rows (5 %) are incorrect matches (false positives).
  • By tuning the advanced algorithm you may find 67 matching rows where 65 rows (65 %) are correct matches (true positives) and 2 rows (2 %) are incorrect matches (false positives).

So when using Soundex you will find more matching rows but you will also find more manual work in verifying the results. Adding an advanced algorithm may reduce the manual work or eliminate manual work at the cost of some not found matches (false negatives) and the risk of a few wrong matches (false positives).

PS: I have a page about other Match Techniques including standardization, synonyms and probabilistic learning.

PPS: When googling for if the title of this blog has been used before I found this article from a fellow countryman.

Bookmark and Share

Data Quality Tools: The Cygnets in Information Quality

Since engaging in the social media community around data and information quality I have noticed quite a lot of mobbing going on pointed at data quality tools. The sentiment seems to be that data quality tools are no good and will play only a very little role, if any, in solving the data and information quality conundrum.

I like to think of data quality tools as being like the cygnet (the young swan) in the fairy tale “The Ugly Duckling” by Hans Christian Andersen. An immature clumsy flapper in the barnyard. And sure, until now tools have generally not been ready to fly, but been mostly situated in the downstream corner of the landscape.

Since last September I have been involved in making a new data quality tool. The tool is based on the principles described in the post Data Quality from the Cloud.

We have now seen the first test flights in the real world and I am absolutely thrilled about the testimonial sayings. Examples:

  • “It (the tool) is lean”.  I like that since lean is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful.
  • “It is gold”. I like to consider that as a calculated positive business case.
  • “It is the best thing happened in my period of employment”. I think happy people are essential to data quality.

Paraphrasing Andersen: I never dreamed there could be so much happiness, when I was working with ugly ducklings.

Bookmark and Share

Complicated Matters

A while ago I wrote a short blog post about a tweet from the Gartner analyst Ted Friedman saying that clients are disappointed with the ability to support wide deployment of complex business rules in popular data quality tools.

Speaking about popular data quality tools; on the DataFlux Community of Experts blog Founder of DataQualityPro Dylan Jones posted a piece this Friday asking: Are Your Data Quality Rules Complex Enough?

Dylan says: “Many people I speak to still rely primarily on basic data profiling as the backbone of their data quality efforts”.

The classic answers to the challenge of complex business rules are:

  • Relying on people to enforce complex business rules. Unfortunately people are not as consistent in enforcing complex rules as computer programs are.
  • Making less complex business rules. Unfortunately the complexity may be your competitive advantage.

In my eyes there is no doubt about that data quality tool vendors has a great opportunity in research and development of tools that are better at deploying complex business rules. In my current involvement in doing so we work with features as:

  • Deployment as Service Oriented Architecture components. More on this topic here.
  • Integrating multiple external sources. Further explained here.
  • Combining the best algorithms. Example here.

Bookmark and Share

3 out of 10

Just before I left for summer vacation I noticed a tweet by MDM guru Aaron Zornes saying:

This is a subject very close to me as I have worked a lot with business directory matching during the last 15 years not at least matching with the D&B WorldBase.

The problem is that if you match your B2B customers, suppliers and other business partners with a business directory like the D&B WorldBase you could naively expect a 100% match.

If your result is only a 30% hit rate the question is: How many among the remaining 70% are false negatives and how many are true negatives.

True negatives

There may be a lot of reasons for true negatives, namely:

  • Your business entity isn’t listed in the business directory. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordic countries and others have a tight public registration of companies and then it is less tight from countries in North America, other European countries and the rest of the world.
  • Your supposed business entity isn’t a business entity. Many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.
  • Uniqueness may be different defined in the business directory and your table to be matched. This includes the perception of hierarchies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also the different roles as those of small business owners are a challenge. The same is true about roles as franchise takers and the use of trading styles.

False negatives

In business directory matching the false negatives are those records that should have been matched by an automated function, but isn’t.

The number of false negatives is a measure of the effectiveness of the automated matching tool(s) and rules applied. Big companies often use the magic quadrant leaders in data quality tools, but these aren’t necessary the best tools for business directory matching.

Personally I have found that you need a very complex mix of tools and rules for getting a decent match rate in business directory matching, including combining both deterministic and probabilistic matching. Some different techniques are explained in more details here.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

Four Different Data Matching Stage Types

One of the activities I do in my leisure time is cycling. As a consequence I guess I also like to watch cycling on TV (or on the computer), not at least the cycling sport paramount of the year: Le Tour de France.

In Le Tour de France you basically have four different types of stages:

  • Time trial
  • Stages on flat terrain
  • Stages through hilly landscape
  • Stages in the high mountains

Some riders are specialists in one of the stage types and some riders are more all-around types.

With automated data matching, which is what I do the most in my business time, there are basically also four different types of processes:

  • Internal deduplication of rows inside one table
  • Removal of rows in one table which also appears in another table
  • Consolidation of rows from several tables
  • Reference matching with rows in one table against another (big) table

Internal deduplication

Examples of data matching objectives here is finding duplicates in names and addresses before sending a direct mail or finding the same products in a material master.

The big question in this type of process is if you are able to balance between not making any false positives (being too aggressive) while not leaving to many to many false negatives behind (losing the game). You also have to think about survivorship when merging into a golden record.

In Le Tour de France the overall leader who gets the yellow jersey has to make a good time trial.

Removal

Here the examples of data matching objectives will be eliminating nixies (people who don’t want offerings by mail) before sending a direct mail or eliminating bad payers (people you don’t want to offer a credit).

Probably the easiest process everyone can do – but in the end of the day some are better sprinters than others.

The best sprinter in Le Tour de France gets the green jersey.

Consolidation

When migrating databases and/or building a master data hub you often have to merge rows from several different tables into a golden copy.

Here you often see the difficulty of making data fit for the immediate purpose of use and at the same time be aligned with the real world in order to also being able to handle the needs that arises tomorrow.

Often some of the young riders in Le Tour de France makes an escape when climbing the hills and gets the white jersey.

Reference match

Doing business directory matching has been a focus area of mine including making a solution for match with the D&B worldbase. The worldbase holds over 165 million rows representing business entities from all over the world.

The results from automated matching with such directories may vary a lot like you see huge time differences in Le Tour de France when the riders faces the big mountains. Here the best climber gets the polka dotted jersey.

Bookmark and Share

Data Quality is an Ingredient, not an Entrée

Fortunately it is more and more recognized that you don’t get success with Business Intelligence, Customer Relationship Management, Master Data Management, Service Oriented Architecture and many more disciplines without starting with improving your data quality.

But it will be a big mistake to see Data Quality improvement as an entrée before the main course being BI, CRM, MDM, SOA or whatever is on the menu. You have to have ongoing prevention against having your data polluted again over time.

Improving and maintaining data quality involves people, processes and technology. Now, I am not neglecting the people and process side, but as my expertise is in the technology part I will like to mention some the technological ingredients that help with keeping data quality at a tasty level in your IT implementations.

Mashups

Many data quality flaws are (not surprisingly) introduced at data entry. Enterprise data mashups with external reference data may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

External ID’s

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

Error tolerant search

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for introducing duplicates.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

Bookmark and Share

Seeing Is Believing

One of my regular activities as a practice manager at a data quality tool vendor is making what we call a ”Test Report”.

Such a “Test Report” is a preferable presale activity regardless of if we are against a competitor or the option of doing nothing (or no more) to improve data quality. In the latter case I usually name our competitor “Laissez-Faire”.

The most test reports I do is revolving around the most frequent data quality issue being duplicates in party master data – names and addresses.

Looking at what an advanced data matching tool can do with your customer master data and other business partner registries is often the decisive factor for choosing to implement the tool.

I like to do the test with a full extract of all current party master data.

A “Test Report” has two major outcomes:

  • Quantifying the estimated number of different types of duplicates, which is the basis for calculating expected Return on Investment for implementing such an advanced data matching tool.
  • Qualifying both some typical and some special examples in order to point at the tuning efforts needed both for an initial match and the recommended ongoing prevention.

When participating in follow up meetings I have found that discussions around what a tool can do (and not do) is much more sensible when backed up by concrete numbers and concrete examples with your particular data.

Bookmark and Share