Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

  • Name (could be given name and surname or a company name)
  • House number
  • Street name
  • Postal code
  • City name
  • Phone number
  • E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

You may:

  • If you only have a few rows do this process within a few hours and possibly find all the duplicates
  • If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

4 thoughts on “Deduplicating with a Spreadsheet

  1. Paige Roberts 5th October 2010 / 15:25

    Nice article. Lots of data has been scrubbed with Excel, but this is the first time I’ve ever seen anyone write up the method.

    Trying to de-dupe with a spreadsheet could send you to a rubber room on large data sets, though. That’s why Pervasive came up with Data MatchMerge and did some partnering with Melissa Data and Experion for address completion info.

    It would be great if folks just didn’t make duplicate or incomplete entries, but doesn’t seem very practical to expect. Especially if you’re trying to merge two data sets, for example, when two companies merge (one of your 5 reasons).

    Paige

    • Henrik Liliendahl Sørensen 6th October 2010 / 06:31

      Thanks Paige. Yes, fortunately there are a lot of software packages like the Pervasive Data MatchMerge Solution that will make the described process faster and better than battling with a spreadsheet.

  2. John Owens 6th October 2010 / 10:17

    Nicely Explained, Henrik

    One slight modification that I would suggest is to add another column called “Status” and, when a row ought to be deleted, enter a “D” in this column.

    Values with “D” in this column can be excluded from the next sort using an auto filter on the column with a “custom filter” set to “does not equal” and D in the the adjacent box.

    This a) avoids rows being inadvertently deleted and b) allows you to “undelete” and step back by simply removing the value D from the status column.

    When all rows have been finally sorted, then all those rows with value D in the Status column can be deleted.

    Regards
    John

    • Henrik Liliendahl Sørensen 6th October 2010 / 10:46

      Thanks John. Good advice for those choosing the spreadsheet path.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s