Today I stumbled upon an article from Australia on BMC: Medical Informatics and Decision Making. The article is called The effect of data cleaning on record linkage quality.
The result of the described research is:
“Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.”
This resonates very well with my experience too. Usually I like to match with both original data and standardized (cleansed) data in order to exploit the best of both approaches.
What are your experiences?
Though I agreee that matching clean or unclean data doesn’t make a big difference, I would say that data cleaning leading to false positives in data matching would indicate a data cleaning fail.
Did the articles give examples ?
Hi Philippe. I don’t see any examples in the stuff online, but as I read it is especially nickname lookup that is found to be bad in this research.
My experience has been similar to your Henrik, especially in the area of name/address matching. If I use only the cleansed/standardized data, I lose out on the potential matches that could have been made on names/nicknames, but pick up on many more false positives because a nickname on the source has been standardized to an incorrect cleansed name, e.g. Will being standardized as William, instead of Willis. I like your approach of doing multiple matching passes to ensure the best return.
Thanks, as always, for a thought provoking post.
Henrik,
Great question.
In my experience , when implementing a repeatable matching process, it is helpful to use a data lineage approach. Meaning that you carry the cleansed or artifact data attributes(original dirty data).
Using this concept you can provide analytical solution that can be used to inspect matching results and drive the “tweaks” necessary to improve future results.
I have a small post on implementing this concept(data lineage).
http://irawarrenwhiteside.com/2012/10/24/microsoft-bi-stack-leveraged-with-vendor-add-ins-for-data-quality-and-data-lineage/
In addition if you know the relationship between the original data and the cleansed data , you can store the original values as “hashed values” in indexed reference tables(lookup tables) and very quickly determine the cleansed values with processing the records thru a “fuzzy match” again.
Ira Warren Whiteside
Actuality Business Intelligence LLC.
Thanks Karen and Ira for adding in.
My experience is similar. Cleansing is a way of loosing information. Though we cannot just avoid it, overdoing it leads to ignoring variability and subsequently an increase in false positives.
Ido prefer the normal approach (clean then match). The clean step allow to identify and invalidate bad data (like test data or old legacy datas) that leads to performance issue.
A lot of my clients have very high SLA requirments on online matching, and i allways prefer do it right the first time during online create/update instead crawling the whole database on a batch mode. Reducing the matching consumption is our priority.
A very salient lesson indeed, and so ironic that data cleansing can ultimately decrease data quality. It strikes me that this is analogous to the problem of overfitting in analytics, and I wonder if there could be an analogous approach for avoiding it. In analytics you split the dataset into training and validation to detect the moment at which overfitting starts to occur.
I wonder if it would be possible to similarly split a dataset for matching, and so detect the point at which overmatching occurs.
Thanks Thirtankar, Vianney and Mark for joining the discussion.
Indeed, for candidate selection, not at least in a real time environment, standardization can help a lot, however you may still include the original values when it comes to match scoring.
The concept of overfitting and how a nice “match model” made from a certain set of training data fails when new (and bigger) data is introduced applies to data matching as well for sure. A very common issue in matching party data is how a good matching set up for one country fails utterly for data from another country.
Comments from LinkedIn:
> Bill Winkler > It is a nice article. For inexperienced individuals, one must be cautious in applying default methods. If you read the article carefully, you will determine that almost 100% of the anomalies are due to the nickname standardization.
Although we have used nickname tables since 1988, I removed almost 90% of the entries in our tables. Additionally, we only attempt to use the nickname if the original first names do not agree.
In the U.S., Bobbie might be a woman’s name (on her birth certificate) or a nickname for Roberta. I did not want Bobbie changed to Roberta or Robert.
As with most preprocessing prior to matching, the routines (name, address standardization) that parse names and addresses into more easily compared components can often yield the biggest matching improvement. The matching parameters can yield less of an improvement. Some of the remaining clean-ups must be used very cautiously.
> André Wajda > Agree on using both original and standardized for record linkage.
> Elena Zhuravleva > We use data quality code for cleansed data for matching reasons, which allows to make sure that matched records are standardized correctly before match them.
My experience says that decrease in matching quality comes from other side – from non-standardized data matching. For example, 2 non-standardized addresses like “Moscow, Komsomolsky prospekt, 41” and “Russia, Moscow region, Moscow city, Komsomolsky, house 41” could be totally different in percent of equal symbols but same in standardized format.
> Richard Northwood > Interesting as ever, Henrik, It depends on how heavy-handed your cleansing is. If the cleansing tolerances are too loose, you could end up with missed results or even corrupting the matching and merging process further down the line, causing you to de-duplicate excessively. You have to take each job on it’s individual merits and not apply too broad-brush approach. This is why profiling your source data before developing is absolutely key – to discover any querks in the data that may break the standard cleansing rules you wish to apply. All the best, Rich
Very interesting topic.
For our matching engine, cleansing improves both accuracy & frequency of matches without increasing false match rates. We, however, don’t do any cleansing on the name – only address and a few other fields just to “level the playing field” for all records.
I think the bottom line here is that changing the integrity of the data can negatively affect match results, while standardization of common values improves match results.
Some good points, but I was thinking along the same lines as Richard Northwood.
If you fail to understand the subject area pitfalls of the information you’re standardising or appreciate the dangers of false positives for your particular scenario, the heavy-handed cleanse could well be worse than no cleanse.
Agree with Stuart and Richard – profile the data before you set up the cleansing and matching rules.
Nicknames can be a problem – one size doesn’t fit all, it depends on the data and the requirement. Software should allow matching of nickname equivalents and fuzzy matching of first names/nicknames, but not insist on either. One good option is to not use fuzzy matching on recognized first names e.g. you could allow fuzzy matching of Jon and Jin if Jin is unrecognized, but not allow fuzzy matching of Tom and Tim. However, if you’re doing a fuzzy match on a cold mailing, you might want to allow Tom and Tim to match.
Another useful option is not to allow bridging of fuzzy matches e.g. not group together Jon, Jin and Jim all in the same set of person matches, but let Jin match one of the other two – you probably can’t know if it should match one rather than the other though.
Great question and some very interesting points here!
As became evident in that article, nicknames often cause matching problems especially if the data cleansing approach is one-size-fits-all. Therefore, cleansing should be tailored to meet the requirements and nature of the nicknames. When the matching is to be performed using software that can match nickname equivalents and can allow for fuzzy matching of nicknames or first names, individuals should never insist on any one at the expense of the other. Nonetheless, it can sometimes be wise to actually avoid fuzzy matching of recognised nicknames and first names. The use of both source data and cleansed data during matching is advisable, especially when establishing the principles and controls of future data cleansing procedures.
In the end, the value of cleansing will depend on how good the data cleaning techniques are. Applying a heavy-handed approach with greatly loose tolerances may lead to excessive deduplication, missed results and corrupt matching. I believe data cleansing is valuable to data matching, but all necessary precautions should be taken with first names, nicknames and addresses.
starting perhaps from table copies, where it doesn’t matter what you mess up, then doing order (count) lists, then matching them before / after may well indicate that things might get worse, perhaps even adding extra test-table columns of candidate / revised nicknames , but preserving all existing rows and columns would preserve the integrity, as one could use two separate search joins (a join, an outer join etc). on the same table, then order the result.
I think this problem of reducing the quality or matching and linking is to do with the reference data. If you choose to abbreviate Alison, Alan, Albert, Alisandra, Alex, etc. etc. to Al for matching, then it should be no surprise that you will get false matches. If however you take into consideration the original data as well, or are a bit smarter in you algorithms the problem will be minimised or mitigated.
In my experience, each data entity – person, company, address, phone, email etc. has its own challenges and should be attacked using different strategies to avoid over or under matching records.
As per my experience, data cleaning is really a good for database management because it ensures the quality of data and make it easily accessible.
Great post, I think Data cleansing and data matching are like two sides of a coin both are useful for business to know more visit Globalemaillists.com
Nice post. Data cleansing and data matching both are really a very useful for business because it helps to make data of database accurate and duplicate free.