Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

Name (could be given name and surname or a company name)
House number
Street name
Postal code
City name
Phone number
E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

You may:

If you only have a few rows do this process within a few hours and possibly find all the duplicates
If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Paige Roberts 5th October 2010 / 15:25

Nice article. Lots of data has been scrubbed with Excel, but this is the first time I’ve ever seen anyone write up the method.

Trying to de-dupe with a spreadsheet could send you to a rubber room on large data sets, though. That’s why Pervasive came up with Data MatchMerge and did some partnering with Melissa Data and Experion for address completion info.

It would be great if folks just didn’t make duplicate or incomplete entries, but doesn’t seem very practical to expect. Especially if you’re trying to merge two data sets, for example, when two companies merge (one of your 5 reasons).

Paige

- Henrik Liliendahl Sørensen 6th October 2010 / 06:31
  
  Thanks Paige. Yes, fortunately there are a lot of software packages like the Pervasive Data MatchMerge Solution that will make the described process faster and better than battling with a spreadsheet.
  
John Owens 6th October 2010 / 10:17

Nicely Explained, Henrik

One slight modification that I would suggest is to add another column called “Status” and, when a row ought to be deleted, enter a “D” in this column.

Values with “D” in this column can be excluded from the next sort using an auto filter on the column with a “custom filter” set to “does not equal” and D in the the adjacent box.

This a) avoids rows being inadvertently deleted and b) allows you to “undelete” and step back by simply removing the value D from the status column.

When all rows have been finally sorted, then all those rows with value D in the Status column can be deleted.

Regards
John

- Henrik Liliendahl Sørensen 6th October 2010 / 10:46
  
  Thanks John. Good advice for those choosing the spreadsheet path.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

Deduplicating with a Spreadsheet

Related

4 thoughts on “Deduplicating with a Spreadsheet”

Leave a comment Cancel reply