Deduplicating with a Spreadsheet

Say you have a table with a lot of names, postal addresses, phone numbers and e-mail addresses and you want to remove duplicate rows in this table. Duplicates may be spelled exactly the same, but may also be spelled somewhat different, but still describe the same real world individual or company.

You can do the deduplicating with a spreadsheet.

In old times some spreadsheets had a limit of number of rows to be processed like the 64,000 limit in Excel, but today spreadsheets can process a lot of rows.

In this case you may have the following columns:

  • Name (could be given name and surname or a company name)
  • House number
  • Street name
  • Postal code
  • City name
  • Phone number
  • E-mail address

What you do is that first you sort the sheet by name, then postal code and then street name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by name, postal code and street name you make an alternate sort, because some possible duplicates may not begin with the same letters in the name field.

So what you do is that you sort the sheet by postal code and then street name and then house number.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some possible duplicates may not have the proper postal code assigned or the street name may not start with the same letters.

So what you do is that you sort the sheet by city name and then house number and then name.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by postal code, street name and house number you make an alternate sort, because some duplicates may have moved or have different addresses for other reasons .

So what you do is that you sort the sheet by phone number, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

When finished with all the rows sorted by phone number, name and then by postal code you make an alternate sort, because some duplicates may not have a phone number or may have different phone numbers.

So what you do is that you sort the sheet by e-mail address, then by name and then by postal code.

Then you browse down all the rows and focus at one row at the time and from there looks up and down if the rows before or after seems to duplicates. If so, you delete all but one row being the same real world entity.

You may:

  • If you only have a few rows do this process within a few hours and possibly find all the duplicates
  • If you have a lot of rows do this process within a few years and possibly find some of the duplicates

PS: The better option is of course avoiding having duplicates in the first place. Unfortunately this is not the case in many situations – here is The Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

My Secret

Yesterday I followed a webinar on DataQualityPro with ECCMA ISO 8000 project leader Peter Benson.

Peter had a lot of good sayings and fortunately Jim Harris as a result of his live tweeting has documented a sample of good quotes here.

My favorite:

“Quality data does NOT guarantee quality information, but quality information is impossible without quality data.”

I have personally conducted an experiment that supports that hypothesis. It goes as this:

First, I found a data file on my computer. Lots of data in there being numbers and letters. And sure, what is interesting is the information I can derive for different purposes.

Then I deleted the data file and tried to see how much information was left behind.

Guess what? Not a bit.

I first published that experiment as a comment to one of Jim’s blog posts: Data Quality and the Cupertino Effect.

As documented in the comments on this blog post the subject of data (quality) versus information (quality) is ever recurring and almost always guarantees a fierce discussion among data/information management professionals.

So, I’ll just tell you this secret: My work in achieving quality information is done by fixing data quality.

And guess what? I have disabled comments on this blog post.

Bookmark and Share

Data Quality Is Like Parenting

Thinking about it: Data Quality has a lot of similarities with parenting.

Some equivalence that comes to my mind is:

  • Parenting must be done by everyone who has children; you are not supposed to have an education in education before being parents. The same about data. You are not supposed be a data quality expert before working with data; some common sense will bring you a long way.
  • Some parenting experts never had their own children. I have seen the same with data quality experts too.
  • Many people are more knowledgeable about how other people should raise children than about raising their own children. Same same with data quality.
  • While we internally in the family may have some noise when parenting we keep that internally and keep up appearances to the outside. I think everyone have seen the same with data quality.
  • There may be different styles in parenting going from “because I said so” to talking about it. The same is true around data quality improvement efforts.
  • We do see more and more regulatory around parenting like it in my country now is forbidden to slap your kids.  I think it should be forbidden to slap your naughty data too.

Bookmark and Share

Consultants

Just arrived home from summer vacation I have been thinking a bit about how we consultants act at work. On our vacation we used local guides at some places. These guides were our consultants at places they know very well and we didn’t know at all. But I also noticed they had some habits which may be considered as common weak sides of practicing consultancy.

Different language

Francisco Caballero has lived all his long life in the beautiful town Ronda in Southern Spain. He shared his great knowledge about the town with us in his distinguished blend of English and Spanish spiced up with some Russian, German and probably also Dutch words. I think we understood the most though we did have some variances when we compared our perceptions afterwards.

Personal opinions

Besides telling about the town and the history behind Señor Caballero also shared his views about politics. He told about problems with young people today and increasing crime. He remembered things were much better when Generalissimo Franco was in charge. He admitted though that today there is no “bandidos” in the mountains as in the old days, but as he put it: “Today all bandidos in Madrid”. I guess he was referring to recent governments.

Assessing risk

Robert is fifth generation of British descent living in Gibraltar, the small English enclave around the marvelous rock on the Southern tip of Spain facing Africa cross the narrow strait. I remember the opening scene of the James Bond film The Living Daylights is a hazardous car ride down the rock. Robert took us in his taxi on the very same narrow roads, practicing pretty much the same style of driving while explaining that as we had to go off and on the car all the time at the different sights, there was really no point in using the safety belts.

Personal commercial agenda

Salam seemed to know everyone and everything in Tangier, the Moroccan city on the Northern tip of Africa on the other side of the Strait of Gibraltar. Salam offered us a guided tour where we would go everywhere we wanted and look at everything we fancied using any time as we pleased. Only when going around he strongly urged us to go to exactly that spice shop he knew and strongly recommended not sitting at that café we spotted but preceding to a much better one. As infidels we couldn’t of course go into a mosque, unless (of course) we gave some extra Euro.

Bookmark and Share

No Re-Tweets?

12 hours ago from now I noticed the following tweet on Twitter from the profile @GartnerTedF:

The person behind @GartnerTedF is the analyst Ted Friedman of Gartner, Inc. He is a very important person in the data quality realm as he co-writes the Magic Quadrant.

Many of Ted’s tweets are usually re-tweeted by other tweeps.

But not this one.

I think I know why: It’s because technology simply doesn’t work.

I have noticed this often. What happens is that twitter somehow simply doesn’t index some tweets from time to time, so people don’t see them.

New Blog Name?

As reported by Mark Goloboy here ”Data Quality” is becoming a dirty word. ”Information Quality” is in vogue.

Maybe I will soon have to change the name of my blog?

Also one may expect other related terms will be changed, like:

  • Data Governance becomes Information Governance
  • Master Data Management becomes Master Information Management
  • Data Matching becomes Information Matching
  • Data Warehouse becomes Information Warehouse
  • Database becomes Informationbase
  • Information Technology becomes Data Technology

But changing the name of a blog is a serious thing you shouldn’t do too often. I think I will wait and see if the term renaming stops at simply replacing data and information. Some guesses for further renaming:

Information Fitness replaces Data Quality as Data quality is often defined as “fit for intended purpose of use” and by replacing data with information that trail is even more clear – opposed to the other trail being real world alignment.

Information Political Correctness replaces Data Governance as Data Governance is a lot about policies and the Data Governance practice is a lot about maneuvering in the corporate political landscape.    

Master Information Technology (MIT) replaces Master Data Management (MDM)

Bookmark and Share

Why do you watch it?

Statler and Waldorf is a pair of Muppet characters. They are two ornery, disagreeable old men. Despite constantly complaining about the show and how terrible some acts were, they would always be back the following week in the best seats in the house. At the end of one episode, they looked at the camera and asked: “Why do you watch it?”.

This is a bit like blogging about data quality, isn’t it? Always describing how bad data is everywhere. Bashing executives who don’t get it. Telling about all the hard obstacles ahead. Explaining you don’t have to boil the ocean but might get success by settling for warming up a nice little drop of water.

Despite really wanting to tell a lot of success stories, being the funny Fuzzy Bear on the stage, well, I am afraid I also have been spending most time on the balcony with Statler and Waldorf.

So, from this day forward: More success stories.

This is the start of a series of 1.3 blog posts…. No, just kidding.

Bookmark and Share

The Slurry Project

When cleansing party master data it is often necessary to typify the records in order to settle if it is a business entity, a private consumer, a department (or project) in a business, an employee at a business, a household or some kind of dirt, test, comic name or other illegible name and address.

Once I made such a cleansing job for a client in the farming sector. When I browsed the result looking for false positives in the illegible group this name showed up:

  • The Slurry Project (in Danish: Gylleprojektet)

So, normally it could be that someone called a really shitty project a bad name or provided dirty data for whatever reason. But in the context of the farming sector it makes a good name for a project dealing with better exploitation of slurry in growing crops.

A good example of the need for having the capability to adjust the bad word lists according to the context when cleansing data.


Bookmark and Share

Sticky Data Quality Flaws

Fighting against data quality flaws is often most successfully done at data entry. When incorrect information has been entered into the system it most often seems nearly impossible to eliminate the falsehood.

A hilarious example is told in an article from telegraph.co.uk. A local council sent a letter to a woman’s pet pig (named Blossom Grant) offering the animal the chance to register for a vote in last week’s UK election. This is only the culmination of a lot of letters –including tons of direct marketing – addressed to the pigsty. The pigsty was according to the article wrongly registered as a residence some years ago after a renovation. Since then the owner (named Pauline Grant) of the pig has tried to get the error corrected over and over again – but with no success.

Bookmark and Share