How to Avoid True Positives in Data Matching

Now, this blog post title might sound silly, as we generally consider true positives to be the cream of data matching as it means that we have found a match between two data records that reflects the same real world entity and it has been confirmed, that this is true and based on that we can eliminate a harmful and costly duplicate in our records.

Why this isn’t still an optimal situation is that the duplicate shouldn’t have entered our data store in the first place. Avoiding duplicates up front is by far the best option.

So, how do you do that?

You may aim for low latency duplicate prevention by catching the duplicates in (near) real-time by having duplicate checks after records have been captured but before they are committed in whatever is the data store for the entities in question. But still, this is actually also about finding true positives and at the same time to be aware of false positives.

Killing Keystrokes
Killing Keystrokes

The best way is to aim for instant data quality. That is, instead of entering data for the (supposed) new records, you are able to pick the data from data stores already available presumably in the cloud through an error tolerant search that covers external data as well as data records already in the internal data store.

This is exactly such a solution I’m working with right now. And oh yes, it is exactly called instant Data Quality.

Bookmark and Share

2 thoughts on “How to Avoid True Positives in Data Matching

  1. Dave Chamberlain 26th February 2013 / 13:12

    It’s an important point that cannot be made often enough. The programming equivalent has been understood for a long time, the further down the development cycle errors are discovered the more they cost to fix. The same is true of data – the further down the life-cycle data gets the more expensive it is to fix and the more potentially disruptive its effect becomes. At the extreme, when bad data is eventually consumed by a BI or analytic application, and poor business decisions are made, the cost can be millions of times the cost of fixing the data at source.

    • Henrik Liliendahl Sørensen 27th February 2013 / 13:02

      Thanks for commenting Dave. Indeed, instant data quality is very cost effective.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s