The New Year in Identity Resolution

identity resolutionYou may divide doing identity resolution into these categories:

  • Hard core identity check
  • Light weight real world alignment
  • Digital identity resolution

Hard Core Identity Check

Some business processes requires a solid identity check. This is usually the case for example for credit approval and employment enrolment. Identity check is also part of criminal investigation and fighting terrorism.

Services for identity checks vary from country to country because of different regulations and different availability of reference data.

An identity check usually involves the entity who is being checked.

Light Weight Real World Alignment

In data quality improvement and Master Data Management (MDM) you often include some form of identity resolution in order to have your data aligned with the real world. For example when evaluating the result of a data matching activity with names and addresses, you will perform a lightweight identity resolution which leads to marking the matched results as true or false positives.

Doing such kind of identity resolution usually doesn’t involve the entity being examined.

Digital Identity Resolution

Our existence has increasingly moved to the online world. As discussed in the post Addressing Digital Identity this means that we also will need means to include digital identity into traditional identity resolution.

There are of course discussions out there about how far digital identity resolution should be possible. For example real name policy enforcement in social networks is indeed a hot topic.

Future Trends

With regard to digital identity resolution the jury is still out. In my eyes we can’t avoid that the economic consequences of the rising social sphere will affect the demand for knowing who is out there. Also the opportunities in establishing identity via digital footprints will be exploited.

My guess is that the distinction between hard core identity check and real world alignment in data quality improvement and MDM will disappear as reference data will become more available and the price of reference data will go down.

That’s why I’m right now working with a solution (www.instantdq.com) that combines identity check features and data universe into master data management with the possibility of adding digital identity into the mix.

Bookmark and Share

Fighting Identity Fraud with Identity Fraud

I have earlier had issues with SEO agencies posting comments on this blog in their quest to help data quality tool vendors in getting better search rank for data quality related terms. Example here.

This happened again today with a recent post called Addressing Digital Identity.

I find it quite funny that the SEO guy is talking about fighting identity fraud while posting a comment under a name that I bet is not his/her real name:

InfoGlide SEO scam

Bookmark and Share

Some Kinds of Reference Data

The term ”reference data” and related Reference Data Management (RDM) is used commonly in the data quality and Master Data Management (MDM) realm.

As with most terms it may be used with slightly different meanings. Usually, but not necessarily always, reference data are core data entities defined outside a given organization.

I have come across the below discussed kinds of reference data:

Reference Data in Investment Banking

The term “reference data” is well established in investment banking. Reference data are core master data entities as counterparties, securities and currencies. These are the things you deal with in investment banking. They are not made up for a given bank or other single financial institution but are shared across the whole market and should optimally be the same to every institution at exactly the same point of time.

RDMSmall Reference Data

In Master Data Management in general we usually see reference data as value lists helping describing and standardizing internal master data.

One example will be a country list. A list of countries should be the same for every organization in the world. However available lists does differ though most variations usually don’t have any business impact as the academic question about if Antarctica should be in the list or not.

A list of codes describing to which industry a given company belongs is another example of reference data. As examined in the post What are they doing? you may choose to standardize on SIC codes or standardise on NACE codes or develop your own set of codes for that purpose.

Big Reference Data

In geography a country list is in the top levels of defining locations. Further deep we may have postal code systems within each country as ZIP codes in the United States, PLZ codes in Germany and PIN codes in India. Yet further deep we have every single valid postal address eventually all over the world. This is what I call big reference data.

A way of sourcing industry codes for your customers, suppliers and other business partners will be picking from or enriching from a business directory like for example the D&B WorldBase or any other of the many business directories around. Such directories may also be seen as big reference data.

The dramatic increase in the use of social media and related social network profiles has emerged as a new kind of big reference data serving as links to our internal master data.

Bookmark and Share

Addressing Digital Identity

A physical address has traditionally been a core element of doing identity resolution. Stating a name and an address is the most widespread way of telling with which person or which company we are (aiming at) having a business and other form of relationship.

However, during the last 25 years a lot of things have moved from the physical world to the online world. Not at least a lot of things start in the online world while in many cases ends up in the physical world. Today selling, the smart way, starts in social media. Final delivery may be digital or may be sending a package or a consultant to a physical address. A thing like dating most often starts in the online world today but surely the aim is a physical encounter.

This new way of life has a tremendous affect on data quality and master data management. Within quality of contact data, the most frequent domain for data quality issues, we have traditionally dealt with verifying names and addresses and deduplicating names and addresses.

As the best way of preventing data quality issues is looking at the root we must address that onboarding of contact data often starts with a digital identity where a physical address isn’t present in the first place but often will be updated at a later stage.

As described in the post Social MDM and Systems of Engagement a new trend in master data management is to establish a link between the new systems of engagement and the old systems of record.

In the same way data quality prevention and improvement will have to cover establishing a link between a new discipline being digital identity resolution and the good old address verification stuff.

Bookmark and Share

Beyond True Positives in Deduplication

The most frequent data quality improvement process done around is deduplication of party master data.

A core functionality of many data quality tools is the capability to find duplicates in large datasets with names, addresses and other party identification data.

When evaluating the result of such a process we usually divide the result of found duplicates into:

  • False positives being automated match results that actually do not reflect  real world duplicates
  • True positives being  automated match results reflecting the same real world entity

The difficulties in reaching the above result aside, you should think the rest is easy. Take the true positives, merge into a golden record and purge the unneeded duplicate records in your database.

Well, I have seen so many well executed deduplication jobs ending just there, because there are a lot of reasons for not making the golden records.

Sure, at lot of duplicates “are bad” and should be eliminated.

But many duplicates “are good” and have actually been put into the databases for a good reason supporting different kind of business processes where one view is needed in one case and another view is needed in another case.

Many, many operational applications, including very popular ERP and CRM systems, do have inferior data models that are not able to reflect the complexity of the real world.

Only a handful of MDM (Master Data Management) solutions are able to do so, but even then the solutions aren’t easy as most enterprises have an IT landscape with all kinds of applications with other business relevant functionality that isn’t replaced by a MDM solution.

What I like to do when working with getting business value from true positives is to build a so called Hierarchical Single Source of Truth.

Bookmark and Share

The Letter Æ

This blog is written in English. Therefore the letters used are normally restricted to A to Z.

The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.

Æ, which in lower case is æ, was part of the old English alphabet. For example an old English king was called Æthelred the Unready.

The letter Æ is a combined AE and is pronounced in English as the first letter in Edmund and Edward.

Today Æ exists in a few alphabets: The Danish/Norwegian, the Faroese and the Icelandic. People and places from the corresponding Viking territories  may have the letter Æ/æ as part of the string. For example the home of Microsoft Dynamics AX and NAV is the town Vedbæk north of Copenhagen. When represented in the English alphabet the town name will be Vedbaek.

So Vedbæk and Vedbaek should be a 100% match when doing data matching. And so should Vedbæk and Vedb%C%A6k when systems are as bad as Æthelred the Unready was in handling the Vikings.

And oh, Æthelred wasn’t actually unready. He was unræd meaning bad-counseled.

Bookmark and Share

Business Entity Identifiers

The least cumbersome way of uniquely identifying a business partner being a company, government body or other form of organization is to use an externally provided number.

However, there are quite a lot of different numbers to choose from.

All-Purpose National Identification Numbers

In some counties, like in Scandinavia, the public sector assigns a unique number to every company to be used in every relation to the public sector and open to be used by the private sector as well for identification purposes.

As reported in the post Single Company View I worked with the early implementation of such a number in Denmark way back in time.

Single-Purpose National Identification Numbers

In most countries there are multiple systems of numbers for companies each with an original special purpose. Examples are registration numbers, VAT numbers and employer identification numbers.

My current UK company has both a registration number and a VAT number and very embarrassing for a data quality and master data geek these two numbers have different names and addresses attached.

Other Numbering Systems

The best known business entity numbering system around the world is probably the DUNS-number used by Dun & Bradstreet. As examined in the post Select Company_ID from External_Source Where Possible the use of DUNS-numbers and similar business directory id’s is a very common way of uniquely identifying business partners.

In the manufacturing and retail world legal entities may, as part of the Global Data Synchronization Network, be identified with a Global Location Number (GLN).

There has been a lot of talk in the financial sector lately around implementing yet a new numbering system for legal entities with an identifier usually abbreviated as LEI. Wikipedia has the details about a Legal Entity Identification for Financial Contracts.

These are only some of the most used numbering systems for business entities.

So, the trend doesn’t seem to be a single source of truth but multiple sources making up some kind of the truth.

Bookmark and Share

Probabilistic Learning in Data Matching

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Bookmark and Share

Staying in Doggerland

Currently I’m travelling a lot between my present home in London, United Kingdom and Copenhagen, Denmark where I have most of my family and where the iDQ headquarter is.

When flying between London and Copenhagen you pass the southern North Sea. In the old days (8,000 years ago) this area was a land occupied by human beings. This ancient land is known today as Doggerland.

Sometimes I feel like a citizen of Doggerland not really belonging in the United Kingdom or Denmark.

I still have some phone subscriptions in Denmark I use there and my family are using there.  The phone company seems to have a hard time getting a 360 degree customer view as I have two different spellings of my name and two different addresses as seen on the screen when I look up myself in the iDQ service:

Besides having a Customer Relationship Mess (CRM) the phone company has recently shifted their outsourcing partner (from CSC to TCS). This has caused a lot of additional mess, apparently also closing one of my subscriptions due to that they have failed to register my payments. They did however send a chaser they say, but to the oldest of the addresses where I don’t pick up mail anymore.

I called to settle the matter and asked if they could correct the address not in use anymore. They couldn’t. The operator did some kind of query into the citizen hub similar to what I can do on iDQ:

However the customer service guy’s screen just showed that I have no address in Denmark in the citizen hub (called CPR), so he couldn’t change the address.

Apparently the phone company have correctly picked up an accurate address in the citizen hub when I got the subscription but failed to update it (along with the other subscriptions) when I moved to another domestic address and now don’t have an adequate business rule when I’m registered at a foreign address.

So now I’m staying in Doggerland.

Bookmark and Share

Beyond Address Validation

The quality of contact master data is the number one data quality issue around.

Lately there has been a lot of momentum among data quality tool providers in offering services for getting at least the postal address in contact data right. The new services are improved by:

  • Being cloud based offering validation services that are implemented at data entry and based on fresh reference data.
  • Being international and thus providing address validation for customer and other party data embracing a globalized world.

Capturing an address that is aligned with the real world may have a significant effect on business outcomes as reported by the tool vendor WorldAddresses in a recent blog post.

However, a valid address based on address reference data only tells you if the address is valid, not if the addressee is (still) on the address, and you are not sure if the name and other master data elements are accurate and complete. Therefore you often need to combine address reference data with other big reference data sources as business directories and consumer/citizen reference sources.

Using business directories is not new at all. Big reference sources as the D&B WorldBase and many other directories have been around for many years and been a core element in many data quality initiatives with customer data in business-to-business (B2B) environments and with supplier master data.

Combining address reference data and business entity reference data makes things even better, also because business directories doesn’t always come with a valid address.

Using public available reference data when registering private consumers, employees and other citizen roles has until now been practiced in some industries and for special reasons. Therefore the big reference data and the services are out there and being used today in some business processes.

Mashing up address reference data, business entity reference data and consumer/citizen reference data is a big opportunity for many organizations in the quest for high quality contact master data, as most organizations actually interact with both companies and private persons if we look at the total mix of business processes.

The next big source is going to be exploiting social network profiles as well. As told in the post Social Master Data Management social media will be an additional source of knowledge about our business partners. Again, you won’t find the full truth here either. You have to mashup all the sources.

Bookmark and Share