The term deduplication may be two different things in computing:
- The storage kind of deduplication
- The data quality kind of deduplication
The storage kind of deduplication refers to reducing the data volumes stored and backed up by finding exactly the same file (or other assemblies of data I guess) and eliminate all but one copy.
The data quality kind of deduplication is about finding entities in databases that don’t have a common unique key and are not spelled exactly the same but are so similar, that we may consider them representing the same real world object.
The result of the data quality kind of deduplication may be that all but one duplicate row are eliminated, but most often we actually will add more bytes by linking the duplicate rows and perhaps make a new golden record.
This disambiguation sometimes leads to mixing it all up.
I remember some years ago when I started as employee number no 1 in Omikron Data Quality in the Nordics we made a meeting booking campaign. This was done by a telemarketing bureau. They booked a lot of meetings for me including one at a company that was very interested in tools for deduplication.
It was a very strange meeting until that we after 12 minutes and 34 seconds concluded, that indeed there are two kinds of deduplication in computing.
Also I noticed lately that a leading vendor of the data quality kind of deduplication tools promoted their product by referring to articles on cost savings and more related to the storage kind of deduplication.