A Data Quality Appliance?

Today it was announced that IBM is to acquire Netezza, a data warehouse appliance vendor.

5 years ago I guess the interest for data warehouse appliances was very sparse. I guess this because I attended a session held by Netezza at the 2005 London Information Management conference. We were 3 people in the room: The presenter, a truly interested delegate and me. I was basically in the room because I was the next speaker in the room and wanted to see how things worked out. For the record: It was a good session, I learned a lot about appliances.  

Probably therefore I noticed a piece from 2007 where Philip Howard of Bloor wrote about The scope for appliances. In this article Phillip Howard also suggested other types of appliances, for example a data quality (data matching) appliance.  

I have been around some implementations where we could use the power of an appliance when we have to match a lot of rows. The Achilles’ heel in data matching is candidate selection and often you have to restrict on your methods in order to maintain a reasonable performance.

But I wonder if I ever will see an on promise data quality (data matching) appliance or it will be placed in the cloud. Or maybe there already is one out there? If so, please tell about it.    

Bookmark and Share

4 thoughts on “A Data Quality Appliance?

  1. Wolfert van Duin 21st September 2010 / 09:20

    Hi Henrik,

    I didn’t want to make any advertisement but you challenged me:) Yes there is such a Match Engine. It is a Dutch based company called Olbico. We have the following time consuming process:
    – Finding the right candidates (low time consuming)
    – Matching the right candidates (heavy time consuming)
    – Sorting the right candidates (medium time consuming)
    These are three different services connected to each other. All the processes are ‘ multi-threading’ so we are able to use multiple processors to do the job in the cloud. The question is how the Data Quality setup will be and within this setup what the service level agreement of the customer is.
    We have Dun & Bradstreet as a European customer for matching trade experiences and they required a ‘fast’ solution. So we leased a ‘big’ server with multiple processors to meet there needs.
    Good luck with your ‘search’.



  2. Henrik Liliendahl Sørensen 21st September 2010 / 13:31

    Thanks Wolfert. Ad’s are OK if they are within the subject area 🙂

    One of my heavy duty matching implementations is also related to Dun & Bradstreet : Matching with the full WorldBase holding 170 million diverse business entities from all over the world.

    I guess a true data matching appliance will be a server built for nothing but data matching.

  3. Arthur Kay 28th September 2010 / 17:49

    Hi Henrik

    At Synaxis Data Services we have used our software to write an efficient cumulative matching process.

    i.e. we generate multiple match keys so that we can have multiple attempts to find the right candidates. Each match key is formed from a combination of keys generated from the raw data – such as address, e-mail, telephone. fax numbers etc. A match key may or may not be flagged as valid dependent on the quality of the data that it is comprised of.

    Valid match keys yield relatively small groups of candidates whose keys are exact matches (no time consuming fuzzy matches since the fuzziness has been taken care of in the way the original keys were generated (e.g. all vowels stripped)).

    Sorting the data on a match key, we assign a matched-group identifier to all the records in a matched group. The matched-group identifier will either be the lowest value of an already-assigned match group identifer or, if there are none, the lowest record identifier within the group.

    We repeat the process for different match keys, thus accumulating the power of several match attempts.

    We regularly use this method on databases of many millions of records.

    The matching process uses “standardized” records so the only customized element of the process is the format standardization of incoming data.

    In this sense, the matching process itself can be considered an appliance and we intend putting it online in the not too distant future.

    Kind regards
    Arthur Kay

  4. Henrik Liliendahl Sørensen 28th September 2010 / 17:59

    Thanks Arthur, sounds like a big washing machine you got there -:)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s