Big Data and Data Matching

Data matching has been an established discipline for many years and most data quality tools have more or less sophisticated features for data matching as well as many MDM (Master Data Management) platforms have data matching capabilities.

The LinkedIn Big Data Quality group

In a way the data matching realm has become slightly dull the recent years. People don’t get excited anymore over a discussion about if deterministic matching or probabilistic matching is the right way.  Soundex is old, edit distance has been around for ages and matchcodes may have outlived themselves.

So, it’s good to see a new beast turning up. Data matching with big data.

It may be about deduplicating (deduping) volumes that is bigger than traditional data matching can handle. You know: Dedoop’ing.

But it is also very much about matching big data with small data, first and foremost master data. And having well matched master data. Kimmo Kontra wrote a good post about that recently. The post is called Big Grease, Big Data, and Big Apple – manholes and MDM.

The case presented by Kimmo holds many exciting implementations of data matching like for example proximity matching of locations.

Bookmark and Share

2 thoughts on “Big Data and Data Matching

  1. Richard Ordowich 2nd April 2013 / 17:00

    Before looking at matching it is critical to understand why the matching is required. What are the uses of the matched data? What is the required quality of the matched data? For each use is there a need for an identical, similar, equivalent or corresponding match? Then it is necessary to define what factors determine what constitutes an identical, similar or corresponding match? This is a process we refer to as harmonizing data.

    Once each use and the required match type are defined, then solutions to meet these requirements can be examined. What is the workflow of the data to achieve the match? This involves both technology and human activities. No technology matching solutions work perfectly.

    Once this work is done, there will be a realization that achieving the desired results is challenging. Exceptions will occur. The desired match quality may not be achievable. Changes to the environment will occur and matching may degrade.
    Adding data to the environment is not necessarily a better solution. The quality of big data should be suspect. Match conflicts will increase resulting in more labor.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s