In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.
Having more than a few B2B records often calls for an automated process to do that.
So, how do you do that?
Say you have a B2B record as this (Name, HouseNo, Street, City):
- Smashing Estate, 1, Main Street, Anytown
The business directory has the following entries (ID, Name, HouseNo, Street, City):
- 1, Smashing Estates, , Central Square, Anytown
- 2, Smashing Holding, 1, Main Street, Anytown
- 3, Smashing East, 1, Main Street, Anytown
- 4, Real Consultants, 1, Main Street, Anytown
Several different forms of functionality are used around to settle the matter.
Here are some:
Exact match:
Here no candidates at all are found.
Match codes:
Say you make a match code on input and directory rows with:
- 4 first consonants in City
- 4 first consonants in Street
- 4 digit with leading zero of HouseNo
- 4 first consonants in Name
This makes:
- Input: NTWN-MNST-0001-SMSH
- Directory 1: NTWN-CNTR-0000-SMSH
- Directory 2: NTWN-MNST-0001-SMSH
- Directory 3: NTWN-MNST-0001-SMSH
- Directory 4: NTWN-MNST-0001-RLCN
Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.
Many other and more sophisticated match code assignments exist including phonetic match codes.
Scoring:
You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.
Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.
Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).
The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.
Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.
Matrix:
The following example is based on a patented method by Dun & Bradstreet.
Based on an element similarity as above you assign a match grade with a character for each element as:
- A being exact or very close e.g. scores above 90
- B being close e.g. scores between 50 and 90
- F being no match e.g. scores below 50
- Z being missing values
Including Name, HouseNo, Street and City this will make the following match grades:
- Directory 1: AZFA
- Directory 2: BAAA
- Directory 3: BAAA
- Directory 4: FAAA
Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:
- AAAA = 10 (High)
- BAAA = 9
- AZAA = 8
- …
- A—A = 1 (Low)
Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.
Satisfied?
I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.
Adding additional elements:
While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.
- Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
- LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
- Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.
Probabilistic learning:
Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.
This topic deserves a post of its own.
Hej Henrik,
Great article.
Can there be a risk that rules based on probabilistic learnings become so complex, that adding a new data source for validation becomes a problem?
Kind regards,
Jane
Hi Jane
Thanks for joining. I guess so, but I must admit that my experience in probabilistic learning is in stable environments regarding the sources for validation. But I know that there are people out there who have more experience in the probabilistic fields.
Hi Jane – I think this is a major issue. We have recently been through a CDI implementation with a leading CDI vendor. They have amazing capability in terms of searching / matching, but their is so much interection between different algorithms that tuning the process is incredibly specialised. It has been stated by the vendor that adding a new source generally doesn’t require tuning (unless you go to a different country or radically different data standards / quality) so here is hoping. The tools they provide for assessing matching with a new source are also very good, which is clearly important, as a QA step is always required.
What you have to do is separate matching decisions from the more eventual selection decision.
This means that your solution needs added sophistication to determine the survivor in case of a DM process or a Master Record in case where a Master Database is being created and maintained.
I have achieved control in this area two ways.
1) By assigning a hierarchy to the List Source itself as a way to direct the matching solution which of the matching records will survive.
2) By creating a score on each record based on the completeness of specific data elements going into matching so the record with a more complete set of data elements will survive the matching.
The beauty of this approach is that you always treat what you have as base and only replace this data (provided you have permission) with better data when that comes in.