Diversity in City Names

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points back at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

  • The data model may reflect the real world by having places described by multiple names in given languages.
  • Some data matching solutions use synonym listing for this challenge.
  • Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but will be useless tomorrow with international data.

Bookmark and Share

A New Year Resolution

Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.

Just before Christmas 2009 I made such a mistake in my professional work.

It’s not that I don’t have a lot of excuses. Sure I have.

The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.

I was very busy before Christmas – but accepted this minor trivial assignment.

When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.

But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.

The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.

Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.

Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.

Bookmark and Share

2010 predictions

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.

Bookmark and Share

Phony Phones and Real Numbers

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

  • Completeness – Missing values
  • Precision – Inclusion of country codes, area codes, extensions
  • Reliability – Real world alignment, pseudo numbers: 1234.., 555…
  • Timeliness – Outdated and converted numbers
  • Conformity – Formatting of numbers
  • Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

  • Count of filled values
  • Minimum and maximum lengths
  • Represented formats – best inspected per country if international data
  • Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.

Data Quality and Climate Change Management

A month ago I made a blog post titled “Data Quality and climate politics”. In this post I highlighted some similarities between data governance / data quality and climate politics mainly focussing on why sometimes nothing is done.

Today, 1 day before the United Nations climate change summit commence in my hometown Copenhagen, it seems that executive buy-in has come through. Over 100 heads of states and government will attend the conference among them key stake holders as Indian prime minister Singh and US president Obama.

The plan for how to manage climate change seems at this moment to have some ingredients with similarities to how to manage data quality change.  

The bill

Related to my previous post Eugene Desyatnik commented on LinkedIn:

In both cases, everyone in their heart agrees it’s a noble cause, and sees how they can benefit — but in both cases, everyone also hopes someone else will pay for most of it.

Progress in fighting climate change seems to be closely related to that the rich countries seems to be in agreement about paying a fair share.

With enterprise data quality you also can’t rely on that one business unit will pay for solving all enterprise wide data quality issues related to common data domains. 

Key Performance Indicators

Reductions in greenhouse gas emissions are key performance indicators and goals in fighting climate change – measuring temperatures is more like looking at the final outcome.

For data quality we also knows that the business outcome is related to information in context but in order to look at improving progress we have to measure (raw) data quality at the root.  

Using technology

This article from BBC “Tackling climate change with technologypoints at a wealth of different technologies that may help fighting global warming while we still get the power we need. There is pros and cons for each. Some technologies works in some geographies but not somewhere else. Some technologies are mature now and some will be in the future. There is no silver bullet but a range of different possibilities

Very similar to data quality technology.

Santa Quality

On the 3rd of December I feel inspired to relate some data quality issues to Mr. Santa Claus – or what is exactly the name. Is it:

  • Saint Nicholas or
  • Père Noël as they say in French or
  • Weihnachtsmann as they say in German or
  • Julemand as we say in Denmark or
  • Plenty of other local names?

Santa Claus versus Saint Nicholas is an example of the use of nicknames which is a main issue in name matching in many cultures.

It’s also important to observe that the German and Danish name is one word versus two words in English and French. Many company names and other names in respective languages shares the same linguistic characteristic.

Father Christmas is an alternative identification maybe more being a job title.

Another question is where he lives.

The North Pole is acknowledged as the correct geographical address in Anglo countries – but there seems to be alternative mailing possibilities as:

  • Santa Claus, North Pole, Canada, HOH OHO
  • Father Christmas, North Pole, SAN TA1 (UK)

However the Finish claims the valid address to be:

In my home country Denmark we will accept nothing but:

  • Julemanden, Box 1615, 3900 Nuuk, Greenland

Finally I could imagine which data quality issues the Santa business has to face:

  • Too many duplicates on the “nice list” leading to heavy overhead in gift spending as well as extra costs in reindeer management.
  • Inaccurate product masters resulting in complaints from nice boys and girls and a lot of scrap and rework.
  • Fraud entries from children already on the ‘naughty list’ may be a challenge.
  • A lot of missing chimney positions may cause severe delivery problems.

But then, why should Santa be smarter than everyone else?

Bookmark and Share

What’s in an eMail Address?

When you are deduping, consolidating or doing identity resolution with party master data the elements that may be used includes names, postal addresses and places, phone numbers, national ID’s and eMail addresses.

Types of eMail addresses

In this post I will look closer into eMail addresses based on a general list of types of party master data.

You may divide eMail addresses into these types:

CONSUMER/CITIZEN:

This is a private eMail address belonging to an individual person.

Typical formats are myname@hotmail.com and nickname@gmail.com and name123@anymail.com

You may change your eMail address as a private person as time goes or have several such addresses at a time depending on your favourite providers of eMail services and other reasons to split your personality.

HOUSEHOLD:

A household/family may choose to have a shared eMail Address for private use.

Typical format will be xyz-family@anymail.com where the word family of course could be in a lot of different languages like famiglia-italiano@email.it

A special usage is the GROUP where two (or more) names are included like mary-and-john@anymail.com

EMPLOYEE:

This is the eMail address you are assigned as an employee (including owner) at a company.

Common formats are abc@company.com and name.name@company.com

When you change employer you also change eMail address and you may have several employers or other assignments at the same time. Also different formats like initials and full name may point to the same inbox.

DEPARTMENT:

Here the eMail address is not pointed at a particular person but some sort of a team within a company.

Formats are like sales@company.com and salg@firma.dk and vertrieb@firma.de choosing the sales team in some different languages.

Some eMail are referring to a specific FUNCTION like webmaster@company.com

BUSINESS:

This is an eMail address for the entire company.

Most common formats are info@company.com and company@company.com

INVALID:

Often a field designed for an eMail address is populated with invalid values going from obvious wrong values like XXX to harder detectable syntax errors and not existing domains.

Real world duplication

Many online services are based on registration via an eMail address assuming that one eMail represents one real world entity which of course is not the case.

Even on a service like LinkedIn where you may attach several eMail addresses to one profile you do encounter persons with obvious duplicate profiles.

Multi-channel marketing and sales

An increasing number of organisations are doing both offline and online operations today and when building enterprise wide master data hubs the eMail address becomes an more and more important element in matching party master data.

In such matching activities the eMail address can not stand alone but must be combined with the other elements as names, postal addresses, phone numbers and national ID’s upon availability.

Success in automating such processes is based on advanced algorithms in flexible and configurable solutions.

Comment or eMail me

If you also have been battling here I will be glad to have your comments here or by mail. My mail is hlsgr@mail.tele.dk and hls@omikron.net and hls@locus.dk and hls@dmpartner.dk and nordic@omikron.net

Bookmark and Share

55 reasons to improve data quality

The business value in data quality improvement is an ever recurring topic in the realm of data quality.

In the following I will list the first 55 reasons that comes to my mind for improving data quality related to the single most frequent data quality issue around, which is duplicates (and unresolved hierarchies) in party master data – names and addresses.

It goes like this:

1.  It’s a waste of money sending the same printed material twice or more times to the same individual consumer.

2.  Allowing the same customer enter twice or more times for an introduction offer challenges the return of investment in such campaigns.

3.  When measuring churn and win-back two or more unrelated accounts for the same business hierarchy will produce an incomplete result leading to a wrong decision.

4.  Sending the same promotion eMail twice or more times to the same individual consumer looks like spam even if different eMail addresses are used. Spam has more offending than selling power.

5.  It’s probably a waste of money sending the same printed material with presentation and offerings to a household already having a customer.

6.  Assigning different credit terms for two or more unrelated accounts for the same business hierarchy will make uncontrolled financial risk.

7.  When measuring cross selling results two or more unrelated accounts for the same household will produce an incomplete result leading to a wrong decision.

8.  When measuring life time value two or more unrelated accounts for the same individual consumer will produce a wrong result leading to a wrong decision.

9.  It’s probably a waste of money sending the same printed material twice or more times to the same household.

10.  When measuring life time value two or more unrelated accounts for the same individual being a consumer and a business owner will produce an incomplete result leading to a wrong decision.

11.  When wanting a 1-1 dialogue two or more unrelated accounts for the same individual consumer will not lead to a 1-1 dialogue.

12.  Having companies represented in two or more unrelated accounts for the same company with a different line-of-business assigned will produce an incomplete segmentation.

13.  When trying to point at your best customers being households in order to find similar households two or more unrelated accounts for the same household will produce an incomplete segmentation.

14.  When measuring cross selling results two or more unrelated accounts for the same individual consumer will produce a wrong result leading to a wrong decision.

15.  It’s a waste of money sending printed material with presentation and offerings to an individual consumer already being a customer.

16.  When wanting a 1-1 dialogue two or more unrelated accounts for the same business hierarchy will not lead to a complete 1-1 dialogue.

17.  When measuring life time value two or more unrelated accounts for the same business hierarchy will produce an incomplete result leading to a wrong decision.

18.  Assigning different credit terms for two or more unrelated accounts for the same individual consumer will increase financial risk.

19.  When measuring cross selling results two or more unrelated accounts for the same individual being a consumer and a business owner will produce only an incoherent result leading to a wrong decision.

20.  When wanting a 1-1 dialogue two or more unrelated accounts for the same household will not lead to a true 1-1 dialogue.

21.  Assigning different credit terms for two or more unrelated accounts for the same business entity could increase financial risk.

22.  Having activities related to companies attached to two or more unrelated accounts for the same company will show an incomplete customer history with the risk of taking damaging actions.

23.  It’s a waste of money and credibility sending printed material with presentation and offerings to an individual business decision maker in a business entity already being a customer.

24.  When buying from a supplier having two or more unrelated accounts despite being the same business entity you may miss discount opportunities.

25.  Having companies represented in two or more unrelated accounts for the same company with a different lead source assigned will produce a false measure of marketing and sales performance.

26.  Sending the same promotion eMail or newsletter twice or more times to the same individual business decision maker looks like spam even if different eMail addresses are used. Spam has more offending than selling power.

27.  When measuring  churn and win-back two or more unrelated accounts for the same household will produce an incomplete result leading to a wrong decision.

28.  Having activities related to influencers attached to two or more unrelated business contact records for the same person will show an incomplete business partner history with the risk of retaking already made actions.

29.  When buying from a supplier having two or more unrelated accounts despite they are belonging the same business hierarchy you could miss discount opportunities.

30.  Having activities related to households attached to two or more unrelated accounts for the same household will show an incomplete customer history with the risk of taking insufficient  actions.

31.  When trying to point at your best customers being individual consumers in order to find similar individuals two or more unrelated accounts for the same individual consumer will produce a wrong segmentation.

32.  Having companies represented in two or more unrelated accounts for the same company with a different address assigned will produce an incomplete segmentation.

33.  When measuring life time value two or more unrelated accounts for the same business entity will produce a false result leading to a wrong decision.

34.  Having activities related to decision makers in companies attached to two or more unrelated contacts for the same person will show an incomplete customer contact history with the risk of not taking appropriate actions.

35.  When wanting a 1-1 dialogue two or more unrelated accounts for the same business entity will not lead to a real 1-1 dialogue.

36.  When trying to point at your best customers being companies in order to find similar companies two or more unrelated accounts for the same company will produce a false segmentation.

37.  Maintaining data related to two or more unrelated accounts for the same real world entity will probably be more costly than necessary when exploiting external reference data.

38.  It’s probably a waste of money sending printed material with presentation and offerings to a business entity already being a customer at a higher or lower hierarchy level.

39.  Having individual consumers represented in two or more unrelated accounts for the same individual consumer with a different lead source assigned will produce a wrong measure of marketing and sales performance.

40.  Allowing the same customer re-enter for an offer already turned down (e.g. credit services) will create unnecessary double validation work.

41.  When measuring churn and win-back two or more unrelated accounts for the same business entity will produce a false result leading to a wrong decision.

42.  When wanting a 1-1 dialogue two ore more unrelated accounts for the same individual being a consumer and a business owner will not lead to a sensible 1-1 dialogue.

43.  When measuring cross selling results two or more unrelated accounts for the same business entity will produce a false result leading to a wrong decision.

44.  Having activities related to individual consumers attached to two or more unrelated accounts for the same individual consumer will show an incomplete customer history with the risk of taking wrong actions.

45.  When measuring life time value two or more unrelated accounts for the same household will produce an incomplete result leading to a wrong decision.

46.  Having activities related to customers attached to two or more unrelated accounts for the same real world entity may lead to that different sales representatives are working against each other.

47.  Allowing sales representatives creating new accounts for already existing customers may create time consuming commission disputes.

48.  Having households represented in two or more unrelated accounts for the same household with a different lead source assigned will produce an incomplete measure of marketing and sales performance.

49.  Maintaining data related to two or more unrelated accounts for the same real world entity will consume more manual work than necessary.

50.  When measuring churn and win-back two or more unrelated accounts for the same individual consumer will produce a wrong result leading to a wrong decision.

51.  When buying from a supplier having two or more unrelated accounts despite being the same business entity you may have multiple unnecessary inventory costs.

52.  It’s a waste of money and credibility sending the same printed material twice or more times to the same individual business decision maker.

53.  When measuring churn and win-back two or more unrelated accounts for the same individual being a consumer and a business owner will produce only an incoherent result leading to a wrong decision.

54.  Assigning different credit terms for two or more unrelated accounts for the same household may increase financial risk.

55.  When measuring cross selling results two or more unrelated accounts for the same business hierarchy will produce an incomplete result leading to a wrong decision.

Bookmark and Share

Ongoing Data Maintenance

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

An obvious example: If I tell you that I am 49 years old that may be just that piece of information you needed for completing a business process. But if you asked me about my birth date you will have the age information also upon a bit of calculation plus you based on that raw data will know when I turn 50 (all too soon) and your organization will know my age if we should do business again later.

Birth dates are stable personal data. Gender is pretty much too. But most other data changes over time. Names changes in many cultures in case of marriage and maybe divorce and people may change names when discovering bad numerology. People move or a street name may be changed.

There is a great deal of privacy concerns around identifying individual persons and the norms are different between countries. In Scandinavia we are used to be identified by our unique citizen ID but also here within debatable limitations. But you are offered solutions for maintaining raw data that will make valid and timely B2C information in what precision asked for when needed.

Otherwise it is broadly accepted everywhere to identify a business entity. Public sector registrations are a basic source of identifying ID’s having various uniqueness and completeness around the world. Private providers have developed proprietary ID systems like the Duns-Number from D&B. All in all such solutions are good sources for an ongoing maintenance of your B2B master data assets.

Addresses belonging to business or consumer/citizen entities – or just being addresses – are contained as external reference data covering more and more spots on the Earth. Ongoing development in open government data helps with availability and completeness and these data are often deployed in the cloud. Right now it is much about visual presenting on maps, but no doubt about that more services will follow.

Getting data right at entry and being able to maintain the real world alignment is the challenge if you don’t look at your data asset as a throw-away commodity.

Figure 1: one year old prime information

PS: If you forgot to maintain your data: Before dumping Data Cleansing might be a sustainable alternative.

Bookmark and Share