Most ready made tools for data matching are focused on party data – names and addresses.
How can these tools help you, what do they do?
I have worked with these different approaches:
Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the (other) Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’).
Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.
Algorithms: A complex algorithm is a more sophisticated way to settle if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz. The “LevenshteinDistance” is an example of an algorithm you could use – but such a method is just a fraction compared to the commercial used algorithms around.
Probabilistic learning: This is in fact a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. The tool will register the frequency and context of the paired elements in the decisions. This of course requires a substantial collection. I have implemented such a feature at organisations, where several people every day do verify matching results.
Parsing and standardisation is often supplementary methods used to improve the matching. Also bringing in more data to support the decision is in my eyes a key to actually settle if some records make up the same real world entity. Business and consumer/citizen directories are available in different forms, coverage and depth around the world.
Nice post! Smth I never saw so clear and available on the net before.
These matching techniques are okay for most cases (like you said, only skeptical for soundex applied to something else than english).
A simple approach is missing anyway: Key Mapping 🙂 Quite used to match Securities when each data provider has its own key and when there is no widely adopted key standard (ISIN is not everywhere…).
Probabilistic Matching is less straight-forward to understand (archiving probabilities used to match a pair is important) but has better efficiency with high volumes (lot of attributes considered in the matching) compared to the deterministic matching, if these statistics are often updated.
Both techniques should be combined: deterministic matching to reduce the number of suspects, and probabilistic to enhance it (see IBM MDM Server+Qualitystage, etc.).
Some good articles I found can be found at the end of this post:
But you do not talk about matching in multi-cultural contexts. These techniques reaches their limits when you start to deal with data coming from different countries, languages, cultures, where you face:
– problem of format for addresses and names (see Graham Rhind, an guru on that topic).
– language and alphabets
Creating and maintaining matching Rules is just becoming a nightmare…
I heard about some semantic technologies with learning-engines (to remember users-decisions) that may help data quality in such cases (see Zoomix bought by Microsoft).
I am not an Identity Resolution specialist, but Im wondering what matching techniques they do use…
Just one point Olivier. Persdonally I prefer not to use soundex or other available functions of this nature. However, I prepare my own phonetics table which is editable and can be configured based on the nature of the underlying data.
Thanks Olivier. Matching in multi-cultural contexts is actually one of my favourite topics and one of the planned enhancements on these pages.
Working with the D&B Worldbase and other global data projects has given me a good deal of experiences on this matter. See: https://liliendahl.wordpress.com/2009/07/11/the-globalmatchbox/
Also I was planning on more description on probabilistic learning.
Thanks for the interesting post. We in InQuera deal only with product data. Our technology can handle name & address matching, extraction, etc. but we decided to focus on the more complex domain where we have significant advantage (experience, know-how and technology). Nevertheless, some times we need to address problems related to manufacturer/supplier names. A typical challenge is to find/match manufacturer name and product number hidden in a long messy product description. It is not so simple because in many cases the manufacturer name consists of several tokens, where some of them are quite confusing (e.g., screws, tools, power tools, ball bearing). Another challenge is to match it fast as possible (performance!) against customer manufacturer/product-ID pairs found in the customer database (7,000,000!!).
The above is part of one of our solutions – DataRefiner MetaMatch, a unique tool that automatically processes RFPs and match them against huge product catalog, either by identifiers (manufacturers/product) or by functionality (technical attributes).
To make it more concrete. One of our customers is one of the biggest distributers of technical parts for industry. It has a catalog (SAP MDM) of more than 7,000,000 products. It received for time to time a RFPs from customer in a form of 20,000 or 30,000 product descriptions. The descriptions are free text extracted from the customer ERP. Matching between the customer free text descriptions and the organized distributer catalog is a real challenge.
In order to do so we apply many of the techniques you mentioned, including several algorithms for distance measuring, probability algorithms, self-learning by example in context and some proprietary algorithms that use domain (engineering) knowledge.
Because each RFP is written differently (synonyms, unit of measures, engineering standards and language) we found that semantic approach or other natural language techniques are not applicable.
This specific solution (MetaMatch) uses our standard DataRefiner server, as rest of our solutions.
I will gladly provide more information if it is interesting enough.
Thanks for your time…
Thanks Yossi. One of our success stories at Omikron is about the Swedish power giant Vattenfall. Vattenfall Europe had a challenge in Product Data Management after many acquisitions and mergers. The same spare parts were listed multiple times. The result of a deduplication project was that 30,000 dual products in a database of 400,000 was identified. Some vendors had to admit that they probably have sold the same product to different Vattenfall sections where the term and the price could vary a lot. More on:
I find that a series techniques works better for (international) name and address data than a single technique. I have always concentrated on the basis: data standardisation, formatting and parsing, through my software GRCTools (http://www.grcdi.nl/dmtools.htm) which uses extensive synonym tables to make each record as accurate but as similar as possible. Synonym tables include place name/postal code tables (about 27 million records – http://www.grcdi.nl/settlements.htm) and thoroughfare constants (street types etc – about half a million records – http://www.grcdi.nl/addresses.htm).
Once the data is parsed, standardised and formatted, other techniques such as match coding, work much more effectively and accurately. I avoid techniques like Soundex, though, which is linguistically-based on English and works very badly on other-language data. For one project I invented a new “Soundex” system which was less linguistically affected.
Thanks Graham. It seems that there is not one single trick here but you have to combine methods. I agree.
Yossi, I agree with your strategy. There are precious few of us who can provide both true entity resolution AND entity analytics technologies for PIM.
At Infoglide, we run the whole gamut of matching – transactional and historical, probablistic and attributal, entities and identities. It seems as though we are both positioned well. Good luck with your business model!
Henrik, thank you again for this excellent forum. Please feel free to visit the industry identity resolution blog site at http://www.identityresolutiondaily.com
Have tumbled on this resource and am finding it most illuminating. Reason I was snooping the web is that one of my customers has generated a requirement that ‘We hold all orders when in a set of orders there are more than three orders going to the same address”. I immediatly disgarded the exactitude measure purely because the addresses are entered manually by the customer enters his order.
The approach I had taken is to concatonate all the address elements together, upcase them, remove weird characters “,” “.” “#” “(” “)” etc. then as a final step remove all spaces.
Then performing a frequency count on all the unique strings would tell me how many duplicates I would have.
This approach worked pretty well until I encounted 6 orders that had addresses like
121, Hightower. Wentworthville (1 order)
121, Hightower Rd. Wentworthville (4 orders)
121, Hightower Road. Wentworthville (1 order)
As you can see I ended up with three unique strings – where in a perfect world I would end up with one!
SImplistically I could (in the same way I removed the weird characters above) translate “RD” “RD.” “ROAD” “ROAD.” to null strings which would then yeild the desired result – until you consider the number of street ‘types’ (Crescents, Ways, places, Alleys etc) In addition a street name “Broadroad” would break the routine (unless of course I bounded the substitution with a leading space e.g ” Road.”. I confess to a little twitchyness however (cand elucidate the twitchyness – just a bad feeling in my bones).
After doing some googling, my customers simple request seems to have opened a can of worms considering the techniques available to perform this request.
Have you considered using PAF data to standardise and parse the address into its constituent parts? I believe you can also get PAF to deliver a Unique Delivery Point Reference Number (UDPRN) for those addresses that can be PAF standardised.
Thanks for the comment Stan. I’ve been there too opening that can of worms when making my first data matching tool. There is however also ready-made tools out there embracing and combining several data matching techniques. As always you have to choose between build or buy based on ROI.
Great presentation. Actually one of my favourite work is to implement Data Quality solutions to non-english speaking regions.
Where can I learn more on ‘Probabilistic learning’ method mentioned by you?
Thanks for the comment Tirthankar.
I have planned a blog post on probabilistic learning for long time – time to do it.