The Myth about a Myth

A sentiment repeated again and again related to Data (Information) Quality improvement goes like this:

“It’s a myth that Data Quality improvement is all about technology”.

In fact you see the same related to a lot of other disciplines as:

  • “It’s a myth that Master Data Management is all about technology”.
  • “It’s a myth that Business Intelligence is all about technology”.
  • “It’s a myth that Customer Relationship Management is all about technology”.

I have a problem with that: I have never heard anyone say that DQ/MDM/BI/CRM… is all about technology and I have never seen anyone writing so.

When I make the above remark the reaction is almost always this:

“Of course not, but I have seen a lot of projects carried out as if they were all about technology – and of course they failed”.

Unquestionable true.

But the next question is then about root cause. Why did those projects seem to be all about technology? I think it was:

  • Poor project management or
  • Bad balance between business and IT involvement or
  • Immature technology alienating business users.

In my eyes there is no myth about that Data Quality (and a lot of other things) is all about technology. It’s a myth it’s a myth.

Bookmark and Share

Diversity in City Names

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

  • The data model may reflect the real world by having places described by multiple names in given languages.
  • Some data matching solutions use synonym listing for this challenge.
  • Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

Bookmark and Share

2010 predictions

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.

Bookmark and Share

Data Quality and Climate Change Management

A month ago I made a blog post titled “Data Quality and climate politics”. In this post I highlighted some similarities between data governance / data quality and climate politics mainly focussing on why sometimes nothing is done.

Today, 1 day before the United Nations climate change summit commence in my hometown Copenhagen, it seems that executive buy-in has come through. Over 100 heads of states and government will attend the conference among them key stake holders as Indian prime minister Singh and US president Obama.

The plan for how to manage climate change seems at this moment to have some ingredients with similarities to how to manage data quality change.  

The bill

Related to my previous post Eugene Desyatnik commented on LinkedIn:

In both cases, everyone in their heart agrees it’s a noble cause, and sees how they can benefit — but in both cases, everyone also hopes someone else will pay for most of it.

Progress in fighting climate change seems to be closely related to that the rich countries seems to be in agreement about paying a fair share.

With enterprise data quality you also can’t rely on that one business unit will pay for solving all enterprise wide data quality issues related to common data domains. 

Key Performance Indicators

Reductions in greenhouse gas emissions are key performance indicators and goals in fighting climate change – measuring temperatures is more like looking at the final outcome.

For data quality we also knows that the business outcome is related to information in context but in order to look at improving progress we have to measure (raw) data quality at the root.  

Using technology

This article from BBC “Tackling climate change with technologypoints at a wealth of different technologies that may help fighting global warming while we still get the power we need. There is pros and cons for each. Some technologies works in some geographies but not somewhere else. Some technologies are mature now and some will be in the future. There is no silver bullet but a range of different possibilities

Very similar to data quality technology.

What’s in an eMail Address?

When you are deduping, consolidating or doing identity resolution with party master data the elements that may be used includes names, postal addresses and places, phone numbers, national ID’s and eMail addresses.

Types of eMail addresses

In this post I will look closer into eMail addresses based on a general list of types of party master data.

You may divide eMail addresses into these types:

CONSUMER/CITIZEN:

This is a private eMail address belonging to an individual person.

Typical formats are myname@hotmail.com and nickname@gmail.com and name123@anymail.com

You may change your eMail address as a private person as time goes or have several such addresses at a time depending on your favourite providers of eMail services and other reasons to split your personality.

HOUSEHOLD:

A household/family may choose to have a shared eMail Address for private use.

Typical format will be xyz-family@anymail.com where the word family of course could be in a lot of different languages like famiglia-italiano@email.it

A special usage is the GROUP where two (or more) names are included like mary-and-john@anymail.com

EMPLOYEE:

This is the eMail address you are assigned as an employee (including owner) at a company.

Common formats are abc@company.com and name.name@company.com

When you change employer you also change eMail address and you may have several employers or other assignments at the same time. Also different formats like initials and full name may point to the same inbox.

DEPARTMENT:

Here the eMail address is not pointed at a particular person but some sort of a team within a company.

Formats are like sales@company.com and salg@firma.dk and vertrieb@firma.de choosing the sales team in some different languages.

Some eMail are referring to a specific FUNCTION like webmaster@company.com

BUSINESS:

This is an eMail address for the entire company.

Most common formats are info@company.com and company@company.com

INVALID:

Often a field designed for an eMail address is populated with invalid values going from obvious wrong values like XXX to harder detectable syntax errors and not existing domains.

Real world duplication

Many online services are based on registration via an eMail address assuming that one eMail represents one real world entity which of course is not the case.

Even on a service like LinkedIn where you may attach several eMail addresses to one profile you do encounter persons with obvious duplicate profiles.

Multi-channel marketing and sales

An increasing number of organisations are doing both offline and online operations today and when building enterprise wide master data hubs the eMail address becomes an more and more important element in matching party master data.

In such matching activities the eMail address can not stand alone but must be combined with the other elements as names, postal addresses, phone numbers and national ID’s upon availability.

Success in automating such processes is based on advanced algorithms in flexible and configurable solutions.

Comment or eMail me

If you also have been battling here I will be glad to have your comments here or by mail. My mail is hlsgr@mail.tele.dk and hls@omikron.net and hls@locus.dk and hls@dmpartner.dk and nordic@omikron.net

Bookmark and Share

Man versus Computer. Special Edition.

trafficFollowing up on my previous post on Man versus Computer I am actually most workday mornings reminded about how man sucks.

Most workday mornings I leave home in my car heading into the following traffic:

  • A 4 lane motorway rolling in from southern Copenhagen, rest of Denmark, Germany and ultimately rest of Eurasia.
  • A 5th lane coming in from a local area.

These 5 lanes then split into:

  • 2 lanes heading for the Danish answer to Silicon Valley (called Ballerup)
  • 3 lanes leading to downtown Copenhagen or the main fair (called Bella Center), airport, Sweden and rest of Scandinavia.

Of course you will expect some mingling here. What happens every morning is rather a complete stop in traffic and the cause is not the merge and splitting but humans being drivers as:

  • Experienced local selfish drivers staying in the fastest lane until they suddenly want to switch lane according to their ongoing route.
  • Unexperienced (in this area) foreign drivers coming up from crowded central Europe in search for tranquility deep into the Swedish forests having no clue about where to position in this intersection. The same goes for Swedes returning for the opposite reason.
  • Everyone else having fun rejecting the switching from the selfish types and the foreign ones who should know better than passing in rush hours.

Some solutions to this problem might be:

  • Change Management learning people better driving habits.
  • Onboard computer in every car taking care of lane positioning. Should go smooth splitting 5 lanes into 2 + 3 lanes.

Now I am waiting for which solution that will be implemented first.

Guerrilla Data Quality

Estatua_La_GalanaOh yes, in my crazy berserkergang of presenting stupid buzzword suggestions it’s time for “Guerrilla Data Quality”. And this time there is no previous hits on google to point at as the original source.

But I noticed that “Guerrilla Data Governance” is in use and as Data Governance and Data Quality are closely related disciplines, I think there could be something being “Guerrilla Data Quality”.

Also recently an article called “How to set data quality goals any business can achieve” was published by Dylan Jones on DataQualityPro. Here the need for setting short term realistic goals is emphasised in contrast to making a full size enterprise wide all domain massive initiative. This article sets focus on the people and process side of what may be “Guerrilla Data Quality”.

Recently I wrote a blog post called “Driving Data Quality in 2 Lanes” focussing on the tool selection for what may be “Guerrilla Data Quality” and the enterprise wide follow up.

Actually I guess most Data Quality activity going on is in fact “Guerrilla Data Quality”. The problem then is that most literature and teaching on Data Quality is aimed at the massive enterprise wide implementations.

Any thoughts?

Splitting names

When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training

SplitSome of the entities having these names must be split into two entities before we can do the proper processing.

When you as a human look at a name field, you mostly (given that you share the same culture) know what it is about.

Making a computer program that does the same is an exiting but fearful journey.

What I have been working with includes the following techniques:

  • String manipulation
  • Look up in list of words as given names, family names, titles, “business words”, special characters. These are country/culture specific.
  • Matching with address directories, used for checking if the address is a private residence or a business address.
  • Matching with business directories, used for checking if it is in fact a business name and which part of a name string is not included in the corresponding name.
  • Matching with consumer/citizen directories, used for checking which names are known on an address.
  • Probabilistic learning, storing and looking up previous human decisions.

As with other data quality computer supported processes I have found it useful having the computer dividing the names into 3 pots:

  • A: The ones the computer may split automatically with an accepted failure rate of false positives
  • B: The dubious ones, selected for human inspection
  • C: The clean ones where the computer have found no reason to split (with an accepted failure rate of false negatives)

For the listed names a suggestion for the golden single version of the truth could be:

  • “Margaret & John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Margaret Smith. John Smith” will be split into CONSUMER “Margaret Smith” and CONSUMER “John Smith”
  • “Maria Dolores St. John Smith” stays as CONSUMER “Maria Dolores St. John Smith”
  • “Johnson & Johnson Limited” stays as BUSINESS “Johnson & Johnson Limited”
  • “Johnson & Johnson Limited, John Smith” will be split into BUSINESS “Johnson & Johnson Limited” having EMPLOYEE “John Smith”
  • “Johnson Furniture Inc., Sales Dept” will be split into “BUSINESS “Johnson Furniture Inc.” having “DEPARTMENT “Sales Dept”
  • “Johnson, Johnson and Smith Sales Training” stays as BUSINESS “Johnson, Johnson and Smith Sales Training”

For further explanation of the Master Data Types BUSINESS, CONSUMER, DEPARTMENT, EMPLOYEE you may have a look here.

Bookmark and Share

Man versus Computer

In a recent social network happening Jim Harris and Phil Simon discussed whether IT projects are like the board games Monopoly or Risk.

I notice that both these games are played with dice.

I remember back in the early 80’s I had some programming training by constructing a Yahtzee game on a computer. The following parts were at my disposal:

  • Platform: IBM 8100 minicomputer
  • Language: COBOL compiler
  • User Interface: Screen with 80 characters in 24 rows

As the user interface design options were limited the exiting part became the one player mode where I had to teach (program) the computer which dice to save in a given situation – and make that logic be based on patterns rather than every possible combination.

While having some other people testing the man versus computer in the one player mode I found out that I could actually construct a compact program that in the long run won more rounds than (ordinary) people.

Now, what about games without dice? Here we know that there has been a development even around chess where now the computer is the better one compared to any human.

So, what about data quality? Is it man or computer who is best at solving the matter. A blog post from Robert Barker called “Avoiding False Positives: Analytics or Humans?” has a sentiment.

diceAlso seen from a time and cost perspective the computer does have some advantages compared to humans.

But still we need humans to select what game to be played. Throw the dice…

Bookmark and Share

Settling a Match

In a recent post on this blog we went trough how a process of consolidating master data could involve a match with a business directory.

Having more than a few B2B records often calls for an automated process to do that.

So, how do you do that?

Say you have a B2B record as this (Name, HouseNo, Street, City):

  • Smashing Estate, 1, Main Street, Anytown

The business directory has the following entries (ID, Name, HouseNo, Street, City):

  • 1, Smashing Estates, , Central Square, Anytown
  • 2, Smashing Holding, 1, Main Street, Anytown
  • 3, Smashing East, 1, Main Street, Anytown
  • 4, Real Consultants, 1, Main Street, Anytown

Several different forms of functionality are used around to settle the matter.

Here are some:

Exact match:

Here no candidates at all are found.

Match codes:

Say you make a match code on input and directory rows with:

  • 4 first consonants in City
  • 4 first consonants in Street
  • 4 digit with leading zero of HouseNo
  • 4 first consonants in Name

This makes:

  • Input: NTWN-MNST-0001-SMSH
  • Directory 1: NTWN-CNTR-0000-SMSH
  • Directory 2: NTWN-MNST-0001-SMSH
  • Directory 3: NTWN-MNST-0001-SMSH
  • Directory 4: NTWN-MNST-0001-RLCN

Here directory entry 2 and 3 will be considered equal hits. You may select a random automated match or forward to manual inspection.

Many other and more sophisticated match code assignments exist including phonetic match codes.

Scoring:

You may assign a similarity between each element and then calculate a total score of similarity between the input and each directory row.

Often you use a percentage like measure here where similarity 100 is exact, 90 is close, 75 is fair, 50 and below is far away.

match score

Selecting the best match candidate with this scoring will result in directory entry 3 as the winner given we accept automated matches with score 95 (and a gap of 5 points between this and next candidate).

The assigning of similarity and calculating of total score may be (and are) implemented in many ways in different solutions.

Also the selection of candidates plays a role. If you have to select from a directory with millions of rows you may use swapped match codes and other techniques like advanced searching.

Matrix:

The following example is based on a patented method by Dun & Bradstreet.

Based on an element similarity as above you assign a match grade with a character for each element as:

  • A being exact or very close e.g. scores above 90
  • B being close e.g. scores between 50 and 90
  • F being no match e.g. scores below 50
  • Z being missing values

Including Name, HouseNo, Street and City this will make the following match grades:

  • Directory 1: AZFA
  • Directory 2: BAAA
  • Directory 3: BAAA
  • Directory 4: FAAA

Based on the match grade you have a priority list of combinations giving a confidence code, e.g.:

  • AAAA = 10 (High)
  • BAAA = 9
  • AZAA = 8
  • A—A = 1 (Low)

Directory entry 3 and 2 will be winners with confident code 9 remotely challenged by entry 1 with confidence code 1. Directory entry 4 is out of the game.

Satisfied?

I am actually not convinced that the winner should be directory entry 3 (or 2). I think directory entry 1 could be the one if we have to select anyone.

Adding additional elements:

While we may not have additional information in the input we may derive more elements from these elements not to say that the business directory may hold many more useful elements, e.g.

  • Geocoding may establish that there is a very short distance from “Central Square” to “1 Main Street” thus making directory 1 a better fit.
  • LOB code (e.g. SIC or NACE) may confirm that directory 2 is a holding entity which typically (but not always) is less desirable as match candidate.
  • Hierarchy code may tell that directory 3 is a branch entity which typically (but not always) is less desirable as match candidate.

Probabilistic learning:

Here you don’t relay on or supplement the deterministic approaches shown above with results from confirmed matching with the same elements and combination and patterns of elements.

This topic deserves a post of its own.