Splitting names

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training

SplitSome of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

  • String manipulation
  • Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
  • Matching with address directories, used for checking if the address is a private residence or a business address.
  • Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
  • Matching with consumer/citizen directories, used for checking which names are known on an address.
  • Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

  • A: The ones the computer may split automatically with an accepted failure rate of false positives
  • B: The dubious ones, selected for human inspection
  • C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

  • “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
  • “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
  • “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
  • “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
  • “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Bookmark and Share

11 thoughts on “Splitting names

  1. Rich Murnane 21st October 2009 / 15:48

    I really enjoyed this entry about “Splitting Names” and I’m very glad I haven’t had to do this.

    Best of luck to you…Rich Murnane

  2. Steve Sarsfield 21st October 2009 / 19:54

    You can often get a parser to recognize two or more people in a single record. Take the case of ‘Margaret and John Smith’. Some companies use a strategy that maintains the original record, but also the parser resulting ‘John Smith’ and ‘Margaret Smith’, each with a unique customer ID but the same household ID.
    At some point, you may have some need to match that record with ‘Maggie Smith’, so that type of data structure comes in handy. Marketing will also want to market products to the head of household and not necessarily to all the members of a household, so you get some benefit there, too.
    You’re right, you see this in financial companies where spouses have both joint accounts and individual accounts. It’s always a challenge.

  3. Daryl Swinden 22nd October 2009 / 09:56

    Good points in here.
    Yes it certainly is a problematic area and which a computer program will never be 100% accurate on. We’ve worked with this type of data that has been captured without much foresight in where the data should be placed. I’d like to add that problems occur when company names replicate person names. I.e. Ethel Austin, John Lewis & Thomas Cook. Also where contact names can be either way around i.e. James David or even more complicated in terms of genderisation examples such as John Hayley. A can of worms!

  4. Jax 22nd October 2009 / 23:11

    Hang on – what’s the driving force behind name splitting in this case?
    If you’re looking at bank account details, then “John & Jane Smith” is not the same customer as “John Smith” – they may happen to live at the same address. They must be treated differently, because “John & Jane Smith” might be a company name that needs to receive appropriate circulars, whereas “John Smith” is the account that the wife doesn’t know about that John uses for entertaining his mistress.
    There’s a big difference between matching account holders who live at the same address, & account holders who should receive the same mail-outs – & accounts that are really the same & therefore can give mail cost savings.
    The only real ‘purpose’ in the first is to verify against third-party real-world identity data; but where do you draw the line for “John Smith & sons” – does John Smith still live? What about “John Smith & Jane Doe in care of Joe Bloggs” – which of these will be ‘registered’? Which lives at the address? Far too many business rules to take into consideration to make splitting into ‘customers’ useful.

  5. Henrik Liliendahl Sørensen 23rd October 2009 / 17:16

    Thanks Rich, Steve, Daryl and Jax for your comments.

    Jax, nevertheless I have seen and been splitting at several occasions.

    I think the multi purpose of master data is the most obvious business reason for splitting, as I have tried to explain in this post.

  6. Jackie Roberts 23rd October 2009 / 20:33

    I really enjoyed your blog on Splitting Names. We use a similar analytic process for the manufacturer and supplier names referenced to our spare parts data and the submitted spare parts data to re-classify to the correct product classes of our schema. It is a very challenaging and exciting process especially when we process have more than a couple million records in a year.

  7. Triebs 29th October 2009 / 01:49

    Hi Henrik,

    What do you do if you get a name & address stencil that looks like this?:

    Richard C. & Eileen Smith Trustees
    Konrad J., Carol M, & Clayton C. Smith Beneficiaries
    1322 Broad Meadows, St. Louis, MO 63124
    Revocable Trust

    If you need answer, let me know, there is software that can not only identify whether this is an organization or multiple individuals, but also retain the relationships between the entities identified in the stencil. The financial services industry has had a hold on this technlogy for decades.

    Regards,
    Triebs

  8. Henrik Liliendahl Sørensen 29th October 2009 / 07:35

    Thanks Jackie and “Triebs”.

    Jackie it’s true we need computers when millions of records have to be settled say in a migration project.

    The example provided by “Triebs” shows it’s amazing what a computer actually can do.

    Solutions I have worked with don’t go that far, they will basically:

    • Match with a Business Directory in order to find whether such an organization (may it be that) exist with a reasonable similarity. It may vary between countries whether you have public registration of these.
    • String manipulate and give up because of too many names and unknown words.

    But surely, “teaching” the computer domain knowledge from financial service about trusts, trustees and beneficiaries and combining with general name recognizing and splitting capabilities, the computer eventually will learn to produce something like this:

    ADDRESS in “USA” state “MO” postal code “63124xxxx” city “St. Louis” on “Broad Meadows” no “1322” having
    BUSINESS being a “Revocable Trust” having
    • CITIZEN “Richard C. Smith” being Trustee
    • CITIZEN “Eileen Smith” being Trustee
    • CITIZEN “Konrad J. Smith” being Beneficiary
    • CITIZEN “Carol M. Smith” being Beneficiary
    • CITIZEN “Clayton C. Smith” being Beneficiary

  9. Jax 29th October 2009 / 22:04

    The technology for relationship profiling might have been around, but the OASIS standard – xPRL – is not gathering appropriate support. It’s all well & good to say ‘a computer can work that out’, but what next? Who else is working on making the relationship useful for people to use? Mastersoft is.
    Now we’re talking about an application of techniques for dealing with complex customer data.

  10. Mike O'Connor 4th November 2009 / 22:56

    Thanks Henrik for bringing up the topic. Apologize if this has already been mentioned…. But, this challenge is 100% relevant for US based financial institutions and their consumers (Admittedly, I am not as well versed outside of the US). Take a look at the following link and how the federal laws describe to what level consumers are protected by FDIC (deposit insurance). Its a real requirement to split names and properly aggregate the totals. Warning, do not attempt to follow the link if driving or operating heavy machinery. Immediate drowsiness may occur. – https://www.fdic.gov/EDIE/fdic_info.html#04

  11. Henrik Liliendahl Sørensen 7th November 2009 / 08:41

    Thanks for the link Mike.

    I think the need for splitting names in financial services in my home country Denmark is very limited, as we have a way of registering citizens with unique citizen ID’s and every financial account must be attached with one or several such ID’s.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s