Deduplication vs Identity Resolution

When working with data matching you often finds that there basically is a bright view and a dark view.

Traditional data matching as seen in most data quality tools and master data management solutions is the bright view: Being about finding duplicates and making a “single customer view”. Identity resolution is the dark view: Preventing fraud and catching criminals, terrorists and other villains.

These two poles were discussed in a blog post and the following comments last year. The post was called What is Identity Resolution?

While deduplication and identity resolution may be treated as polar opposites and seemingly contrary disciplines they are in my eyes interconnected and interdependent. Yin and Yang Data Quality.

At the MDM Summit in London last month one session was about the Golden Nominal, Creating a Single Record View. Here Corinne Brazier, Force Records Manager at the West Midlands Police in the UK told about how a traditional data quality tool with some matching capabilities was used to deal with “customers” who don’t want to be recognized.

In the post How to Avoid Losing 5 Billion Euros it was examined how both traditional data matching tools and identity screening services can be used to prevent and discover fraudulent behavior.

Deduplication becomes better when some element of identity resolution is added to the process. That includes embracing big reference data in the process. Knowing what is known in available sources about the addresses that is being matched helps. Knowing what is known in business directories about companies helps. Knowing what is known in appropriate citizen directories when deduping records holding data about individuals helps.

Identity Resolution techniques is based on the same data matching algorithms we use for deduplication. Here for example a fuzzy search technology helps a lot compared to using wildcards. And of course the same sources as mentioned above are a key to the resolution.

Right now I’m dipping deep into the world of big reference data as address directories, business directories, citizen directories and the next big thing being social network profiles. I have no doubt about that deduplication and identity resolution will be more yinyang than yin and yang in the future.

Bookmark and Share

How to Avoid Losing 5 Billion Euros

Two years ago I made a blog post about how 5 billion Euros were lost due to bad identity resolution at European authorities. The post was called Big Time ROI in Identity Resolution.

In the carbon trade scam criminals were able to trick authorities with fraudulent names and addresses.

One way of possible discovery of the fraudster’s pattern of interrelated names and physical and digital locations was, as explained in the post, to have used an “off the shelf” data matching tool in order to achieve what is sometimes called non-obvious relationship awareness. When examining the data I used the Omikron Data Quality Center.

Another and more proactive way would have been upstream prevention by screening identity at data capture.

Identity checking may be a lot of work you don’t want to include in business processes with high volume of master data capture, and not at least screening the identity of companies and individuals on foreign addresses seems a daunting task.

One way to help with overcoming the time used on identity screening covering many countries is using a service that embraces many data sources from many countries at the same time. A core technology in doing so is cloud service brokerage. Here your IT department only has to deal with one interface opposite to having to find, test and maintain hundreds of different cloud services for getting the right data available in business processes.

Right now I’m working with such a solution called instant Data Quality (iDQ).

Really hope there’s more organisations and organizations out there wanting to avoid losing 5 billion Euros, Pounds, Dollars, Rupees, Whatever or even a little bit less.

Bookmark and Share

What is Identity Resolution?

We are continuously struggling with defining what it is we are doing like defining: What is data quality? What is Master Data? Lately I’ve been involved in discussions around: What is Identity Resolution? A current discussion on this topic is rolling in the Data Matching LinkedIn group.

This discussion has roots in one of my blog posts called Entity Revolution vs Entity Evolution. Jeffrey Huth of IBM Initiate followed up with the post Entity Resolution & MDM: Interchangeable? In January Phillip Howard of Bloor made a post called There’s identity resolution and then there’s identity resolution (followed up by a correction post the other day called My bad).

It is a “same same but different” discussion. Traditional data matching (or record linkage) as seen in a data quality tool and master data management solution is the bright view: Being about finding duplicates and making a “single business partner view” (or “single party view” or “single customer view”). Identity resolution is the dark view: Preventing fraud and catching criminals, terrorists and other villains.

The Gartner Hype Cycle describes the dark view as ”Entity Resolution and Analysis”. This discipline is approaching the expectation peak and will, according to Gartner, be absorbed by other disciplines as no one can tell the difference I guess.

Certainly there are poles. In an article from 2006 called Identity Resolution and Data Integration David Loshin said: There is a big difference between trying to determine if the same person is being mailed two catalogs instead of one and determining if the individual boarding the plane is on the terrorist list.

But there is also a grey zone.

From a business perspective for example the prevention of misuse of a restricted campaign offer is a bit of both sides. Here you want to avoid that an existing customer is using an offer only meant for new customers. How does that apply to members of the same household or the same company family tree? Or you want to avoid someone using an introduction offer twice by typing her name and address a bit different.

From a technical perspective I have an example from working with a newspaper in a big fraud scam described in the post Big Time ROI in Identity Resolution. Here I had no trouble using a traditional deduplication tool in discovering non-obvious relationships. Also the relationships discovered in traditional data matching ends up quite nicely in hierarchy management as part of master data management as described in the post Fuzzy Hierarchy Management.

And then there is the use of the words identity (resolution) versus entity (resolution).

My feeling is that we could use identity resolution for describing all kind of matching and linking with party master data and entity resolution could be used for describing all kind of matching and linking with all master data entity types as seen in multi-domain master data management. But that’s just my words.

Bookmark and Share

Citizen ID and Biometrics

As I have stated earlier on this blog: The solution to the single most frequent data quality problem being party master data duplicates is actually very simple: Every person (and every legal entity) gets a unique identifier which is used everywhere by everyone.

Some countries, like Denmark where I live, has a unique Citizen ID (National identification number). Some countries are on the way like India with the Aadhaar project. But some of the countries with the largest economies in the world like United Kingdom, Germany and United States don’t seem to getting it in the near future.

I think United Kingdom was close lately, but as I understand it the project was cancelled. As seen in a tweet from a discussion on twitter today the main obstacles were privacy considerations and costs:

A considerable cost in the suggested project in United Kingdom, and also as I have seen in discussions for a US project, may be that an implementation today should also include biometric technology.

The question is however if that is necessary.

If we look at the systems in force today for example in Scandinavia they were implemented +40 years ago, and the Swedish citizen ID was actually implemented without digitalization in 1947. There are discussions going on about biometrics also as this is inevitable for issuing passports anyway. In the mean time the systems however continues to make a lot of data quality prevention and party master data management a lot easier than else around the world without having biometrics as a component.

No doubt about that biometrics will solve some problems related to fraud and so. But these are rare exceptions. So the cost/benefit analysis for enhancing an existing system with biometrics seems to be negative.

I guess the alleged need for biometric may have something to do with privacy considerations in a strange way: Privacy considerations are often overruled by the requirements for fighting terrorism – and here you need biometrics in identity resolution.

Bookmark and Share

Real World Alignment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.

Bookmark and Share

Big Time ROI in Identity Resolution

Yesterday I had the chance to make a preliminary assessment of the data quality in one of the local databases holding information about entities involved in carbon trade activities. It is believed that up to 90 percent of the market activity may have been fraudulent with criminals pocketing 5 billion Euros. There is a description of the scam here from

Most of my work with data matching is aimed at finding duplicates. In doing this you must avoid finding so called false positives, so you don’t end up merging information about to different real world entities. But when doing identity resolution for several reasons including preventing fraud and scam you may be interested in finding connections between entities that are not supposed to be connected at all.

The result from making such connections in the carbon trade database was quite astonishing. Here is an example where I have changed the names, addresses, e-mails and phones, but such a pattern was found in several cases:

Here we have an example of a group of entities where the name, address, e-mail or phone is shared in a way that doesn’t seem natural.

My involvement in the carbon trade scam was initiated by a blog post yesterday by my colleague Jan Erik Ingvaldsen based on the story that journalists by merely gazing the database had found addresses that simply doesn’t exist.

So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.

Bookmark and Share

Information and Data Quality Blog Carnival, February 2010

El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on by IAIDQ

Bookmark and Share

Santa Quality

On the 3rd of December I feel inspired to relate some data quality issues to Mr. Santa Claus – or what is exactly the name. Is it:

  • Saint Nicholas or
  • Père Noël as they say in French or
  • Weihnachtsmann as they say in German or
  • Julemand as we say in Denmark or
  • Plenty of other local names?

Santa Claus versus Saint Nicholas is an example of the use of nicknames which is a main issue in name matching in many cultures.

It’s also important to observe that the German and Danish name is one word versus two words in English and French. Many company names and other names in respective languages shares the same linguistic characteristic.

Father Christmas is an alternative identification maybe more being a job title.

Another question is where he lives.

The North Pole is acknowledged as the correct geographical address in Anglo countries – but there seems to be alternative mailing possibilities as:

  • Santa Claus, North Pole, Canada, HOH OHO
  • Father Christmas, North Pole, SAN TA1 (UK)

However the Finish claims the valid address to be:

In my home country Denmark we will accept nothing but:

  • Julemanden, Box 1615, 3900 Nuuk, Greenland

Finally I could imagine which data quality issues the Santa business has to face:

  • Too many duplicates on the “nice list” leading to heavy overhead in gift spending as well as extra costs in reindeer management.
  • Inaccurate product masters resulting in complaints from nice boys and girls and a lot of scrap and rework.
  • Fraud entries from children already on the ‘naughty list’ may be a challenge.
  • A lot of missing chimney positions may cause severe delivery problems.

But then, why should Santa be smarter than everyone else?

Bookmark and Share