In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.
Having more than a few B2B records often calls for an automated process to do that.
So, how do you do that?
Say you have a B2B record as this (Name, HouseNo, Street, City):
- Smashing Estate, 1, Main Street, Anytown
The business directory has the following entries (ID, Name, HouseNo, Street, City):
- 1, Smashing Estates, , Central Square, Anytown
- 2, Smashing Holding, 1, Main Street, Anytown
- 3, Smashing East, 1, Main Street, Anytown
- 4, Real Consultants, 1, Main Street, Anytown
Several different forms of functionality are used around to settle the matter.
Here are some:
Here no candidates at all are found.
Say you make a match code on input and directory rows with:
- 4 first consonants in City
- 4 first consonants in Street
- 4 digit with leading zero of HouseNo
- 4 first consonants in Name
- Input: NTWN-MNST-0001-SMSH
- Directory 1: NTWN-CNTR-0000-SMSH
- Directory 2: NTWN-MNST-0001-SMSH
- Directory 3: NTWN-MNST-0001-SMSH
- Directory 4: NTWN-MNST-0001-RLCN
Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.
Many other and more sophisticated match code assignments exist including phonetic match codes.
You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.
Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.
Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).
The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.
Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.
The following example is based on a patented method by Dun & Bradstreet.
Based on an element similarity as above you assign a match grade with a character for each element as:
- A being exact or very close e.g. scores above 90
- B being close e.g. scores between 50 and 90
- F being no match e.g. scores below 50
- Z being missing values
Including Name, HouseNo, Street and City this will make the following match grades:
- Directory 1: AZFA
- Directory 2: BAAA
- Directory 3: BAAA
- Directory 4: FAAA
Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:
- AAAA = 10 (High)
- BAAA = 9
- AZAA = 8
- A—A = 1 (Low)
Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.
I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.
Adding additional elements:
While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.
- Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
- LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
- Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.
Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.
This topic deserves a post of its own.