Liliendahl on Data Quality

Birthday Party

21st June 20109th July 2010Henrik Gabs Liliendahl3 Comments

Today this blog has been online one year. It’s time for a birthday party.

The economy around a birthday party usually goes like this:

You, the guest, spend some money on a nice birthday present
I, the host, spend some money on fine food and beverage

Now, a blog is a virtual thing and I reckon that most of my readers live far, far away from the Copenhagen South Coast. So it’s going to be a remote birthday party and as most other things happening in the social media realm actually no money is going to be exchanged.

Anyway, here is what I would have liked to serve in the real world:

Paella

The dish I have prepared the most times when we have guests is the Spanish paella. I love paella very much and so do all our polite guests.

Also I am a shrimp addict, so I usually like to add two or three different kind of shrimps as the smaller but extremely tasteful Greenlandic shrimps to delicious giant Thai tiger prawns.

Steak

My second favorite meal is a steak. You probably don’t get a better steak than those originated from cattle grazing on the Argentinean pampas.

As I live in the Northern Hemisphere it’s summertime now and perfect weather for preparing the steak outside on the grill.

Wine

There is so much good wine coming from many places around the world. I like Californian wine, wine from Chile, South African wine, Australian wine, French wine and last but not least Italian wine including the unbeatable Amarone.

Beer

As I am a native Dane you will probably expect me to propose a Carlsberg. Don’t get me wrong: Carlsberg is probably a good beer. But there are many other good beers around. When I am in England I like the ultimate mainstream beer: A John Smith (now owned by Dutch Heineken). The best mainstream beer in my opinion is the Belgian Leffe.

Cheers

Thanks to everyone who has read this blog, subscribed, made a re-tweet and not at least those who has commented.

What’s In a Given Name?

18th June 20108th January 2011Henrik Gabs Liliendahl10 Comments

I use the term ”given name” here for the part of a person name that in most western cultures is called a ”first name”.

When working with automation of data quality, master data management and data matching you will encounter a lot of situations where you will like to mimic what we humans do, when we look at a given name. And when you have done this a few times you also learn the risks of doing so.

Here is some of the learning I have been through:

Gender

Most given names are either for males or for females. So most times you instinctively know if it is a male or a female when you look at a name. Probably you also know those given names in your culture that may be both. What often creates havoc is when you apply rules of one culture to data coming from a different culture. The subject was discussed on DataQualityPro here.

Salutation

In some cultures salutation is paramount – not at least in Germany. A correct salutation may depend on knowing the gender. The gender may be derived from the given name. But you should not use the given name itself in your greeting.

So writing to “Angela Merkel” will be “Sehr geehrte Frau Merkel” – translates to “Very honored Mrs. Merkel”.

If you have a small mistake as the name being “Angelo Merkel”, this will create a big mistake when writing “Sehr geehrter Herr Merkel” (Very honored Mr. Merkel) to her.

Age

In a recent post on the DataFlux Community of Experts Jim Harris wrote about how he received tons of direct mails assuming he was retired based on where he lives.

I have worked a bit with market segmentation and data (information) quality. I don’t know how it is with first names in the United States, but in Denmark you may have a good probability with estimating an age based on your given name. The statistical bureau provides statistics for each name and birth year. So combining that with the location based demographic you will get a better response rate in direct marketing.

Nicknames

Nicknames are used very different in various cultures. In Denmark we don’t use them that much and definitely very seldom in business transactions. If you meet a Dane called Jim his name is actually Jim. If you have a clever piece of software correcting/standardizing the name to be James, well, that’s not very clever.

Real World Alignment

16th June 201029th May 2012Henrik Gabs LiliendahlLeave a comment

I am currently involved in a data management program dealing with multi-entity (multi-domain) master data management described here.

Besides covering several different data domains as business partners, products, locations and timetables the data also serves multiple purposes of use. The client is within public transit so the subject areas are called terms as production planning (scheduling), operation monitoring, fare collection and use of service.

A key principle is that the same data should only be stored once, but in a way that makes it serve as high quality information in the different contexts. Doing that is often balancing between the two ways data may be of high quality:

Either they are fit for their intended uses
Or they correctly represent the real-world construct to which they refer

Some of the balancing has been:

Customer Identification

For some intended uses you don’t have to know the precise identity of a passenger. For some other intended uses you must know the identity. The latter cases at my client include giving discounts based on age and transport need like when attending educational activity. Also when fighting fraud it helps knowing the identity. So the data governance policy (and a business rule) is that customers for most products must provide a national identification number.

Like it or not: Having the ID makes a lot of things easier. Uniqueness isn’t a big challenge like in many other master data programs. It is also a straight forward process when you like to enrich your data. An example here is accurately geocoding where your customer live, which is rather essential when you provide transportation services.

What geocode?

You may use a range of different coordinate systems to express a position as explained here on Wikipedia. Some systems refers to a round globe (and yes, the real world, the earth, is round), but it is a lot easier to use a system like the one called UTM where you easily may calculate the distance between two points directly in meters assuming the real world is as flat as your computer screen.

A Really Bad Address

15th June 201029th May 2012Henrik Gabs Liliendahl2 Comments

Many years ago I worked in a midsize insurance company. At that time IT made a huge change in insurance pricing since it now was possible to differentiate prices based on a lot of factors known to the databases.

The CEO decided that our company should also make some new pricing models based on where the customer lived, since it was perceived that you were more exposed to having your car stolen and your house ripped off if you live in a big city opposite to living in a quiet countryside home. But then the question: How should the prices be exactly and where are the borderlines?

We, the data people, eagerly ran to the keyboard and fired up the newly purchased executive decision tool from SAS Institute. And yes, there were a different story based on postal code series, and especially downtown Copenhagen was really bad (I am from Denmark where Copenhagen is the capital and largest city).

Curiously we examined smaller areas in downtown Copenhagen. The result: It wasn’t the criminal exposed red light district that was bad; it was addresses in the business part that hurt the most. OK, more expensive cars and belongings there we guessed.

Narrowing down more we were chocked. It was the street of the company that was really really bad. And last: It was a customer having the very same house number as the company that had a lot of damage attached.

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merely adopted the pricing model from the leading competitors.

I’m still wondering how these companies did the analysis. They all had head quarter addresses in the same business area.

A Really Bad Address

14th June 201013th September 2011Henrik Gabs LiliendahlLeave a comment

Investigating a bit more case was solved. All payments made to specialists doing damage reporting all over the country was made attached to a fictitious customer on the company address.

After cleansing the data the picture wasn’t that bad. Downtown Copenhagen is worse than the countryside, but not that bad. But surprisingly the CEO didn’t use our data; he merelthese companies did the analysis. They all had head quarter addresses in the same business area.

Returns from Investing in a Data Quality Tool

13th June 201018th June 2010Henrik Gabs Liliendahl11 Comments

The classic data quality business case is avoiding sending promotion letters and printed materials to duplicate prospects and customers.

Even as e-commerce moves forward and more complex data quality business cases as those related to multi-purpose master data management becomes more important I will like to take a look at the classic business case by examining some different kind of choices for a data quality tool.

As you may be used to all different kind of currencies as EUR, USD, AUD, GBP and so on I will use the fictitious currency SSB (Simple Stupid Bananas).

Let’s say we have a direct marketing campaign with these facts:

100,000 names and addresses, ½ of them also with phone number
Cost per mail is 3 SSB
Response is 4,500 orders with an average profit of 100 SSB

From investigating a sample we know that 10% of the names and addresses are duplicates with slightly different spellings.

So from these figures we know that the cost of a false negative (a not found actual duplicate) is 3 SSB. Savings of a true positive is then also 3 SSB.

The cost of a false positive (a found duplicate that actually isn’t a duplicate) is a possible missing order worth: 4,500 / (100,000 * 90 %) * 100 SSB = 5 SSB.

Now let’s examine 3 options for tools for finding duplicates:

A: We already have Excel

B: Buying the leader of the pack data quality tool

C: Buying an algorithm based dedupe tool

A: We already have Excel

You may first sort 100,000 rows by address and look for duplicates this way. Say you find 2,000 duplicates. Then sort 98,000 rows by surname and look for duplicates. Say you find 1,000 duplicates. Then sort 97,000 rows by given name. Say you find 1,000 duplicate. Finally sort 48,000 rows by phone number. Say you find 1,000 duplicates.

If a person can look for duplicates in 1,000 rows per hour (without making false positives) we will browse totally 343,000 sorted rows in 343 hours.

Say you hire a student for that and have the Subject Matter Expert explaining, controlling and verifying the process using 15 hours.

Costs are:

343 student hours each 15 SSB = 5.145 SSB
15 SME hours each 50 SSB = 750 SSB

Total costs are 5.895 SSB.

Total savings are 5,000 true positives each 3 SSB = 15.000 SSB, making a positive ROI = 9.105 SSB in each campaign.

Only thing is that it will take one student more than 2 months (without quitting) to do the job.

B: Buying the leader of the pack data quality tool

Such a tool may have all kind of data quality monitoring features, may be integrated smoothly with ETL functionality and so on. For data matching it may use so called match codes. Doing that we may expect that the tool will find 7,500 duplicates where 7,000 are true positives and 500 are false positives.

Costs may be:

Tool license fee is 50.000 SSB
Training fee is 7.000 SSB
80 hours external consultancy each 125 SSB = 10.000 SSB
60 IT hours for training and installation each 50 SSB = 3.000 SSB
100 SME hours for training and configuration each 50 SSB = 5.000 SSB

Total costs are 75.000 SSB

Savings per campaign are 7,000 * 3 SSB – 500* 5 SSB = 18.500 SSB.

A positive ROI will show up after the 5^th campaign.

C: Buying an algorithm based dedupe tool

By using algorithm based data matching such a tool depending on the threshold setting may find 9,100 duplicates where 9,000 are true positives and 100 are false positives.

Costs may be:

Tool license fee is 5.000 SSB
8 hours external consultancy for a workshop each 125 SSB = 1.000 SSB
15 SME hours for training, configuration and pushing the button each 50 SSB = 750 SSB

Total costs are 6.750 SSB

Savings per campaign are 9,000 * 3 SSB – 100* 5 SSB = 26.500 SSB

A remarkable ROI will show up in the 1^stcampaign.

The Slurry Project

10th June 201018th June 2010Henrik Gabs LiliendahlLeave a comment

When cleansing party master data it is often necessary to typify the records in order to settle if it is a business entity, a private consumer, a department (or project) in a business, an employee at a business, a household or some kind of dirt, test, comic name or other illegible name and address.

Once I made such a cleansing job for a client in the farming sector. When I browsed the result looking for false positives in the illegible group this name showed up:

The Slurry Project (in Danish: Gylleprojektet)

So, normally it could be that someone called a really shitty project a bad name or provided dirty data for whatever reason. But in the context of the farming sector it makes a good name for a project dealing with better exploitation of slurry in growing crops.

A good example of the need for having the capability to adjust the bad word lists according to the context when cleansing data.

Picture This

9th June 201019th June 2010Henrik Gabs Liliendahl4 Comments

How do people find their way to your blog? I use Twitter and LinkedIn to say: Hey, I made a new post. And then I pretty much rely on that people find my blog when searching with terms as:

Data Quality
Master Data Survivorship
Fit for purpose

But honestly, the search terms that hits my blog many fold more than the above terms are those little texts I add to the images I use to have on every post. And I am pretty sure that those people were not looking for data quality and master data management.

The top term is pearls, including the same word in Russian (жемчуг), Turkish (inci) and Arabian (لآلئ). This word was the title in the image in the post “Universal Pearls of Wisdom” where I wrote about the new SOA manifesto and how this manifesto might as well be about data quality and a lot of other disciplines and concepts. Probably not very interesting for someone trying to buy pearls or so. But maybe a single or two of the +2,000 pearl fishers was captured in the data quality net.

The second most used term is Gorilla. This was used as text for the image in the post “Gorilla Data Quality”. Personally I like this gorilla picture, and so it seems that approximate 1,600 other people also do. Whether they also like the philosophic ideas around “Gorilla Data Quality” and “Guerilla Data Quality” I am not so sure.

Other terms hitting big is Brueghel and Tower of Babel used in a post about international challenges in data quality called “The Tower of Babel” as it was illustrated by a painting by Brueghel. Also Penny Black used in a post about “Postal Address Hierarchy, Granularity, Precision and History” raised the pageview counter.

But it doesn’t seem that every little common word will do. Once I used the word traffic, but it didn’t generate any traffic at all.

Algorithm Envy

5th June 201011th October 2010Henrik Gabs LiliendahlLeave a comment

The term “algorithm envy” was used by Aaron Zornes in his piece on MDM trends when talking about identity resolution.

In my experience there is surely a need for good data matching algorithms.

As I have a built a data matching tool myself I faced that need back in 2005. At that time my tool was merely based on some standardization and parsing, match codes, some probabilistic learning and a few light weight algorithms like the hamming distance (more descriptions of these techniques here).

My tool was pretty national (like many other matching tools) as it was tuned for handling Danish names and addresses as well as Swedish, Norwegian, Finish and German addresses which are very similar.

The task ahead was to expand the match tool so it could be used to match business-to-business records with the D&B worldbase. This database has business entities from all over the world. The names and addresses in there are only standardized to the extent that is provided by the public sector or other providers for each country.

The records to be matched came from Nordic companies operating globally. For such records you can’t assume that these are entered by people who know the name and address format for the country in question. So, all in all, standardization and parsing wasn’t the full solution. If you don’t trust me, there is more explanation here.

When dealing with international data match codes becomes either too complex or too bad. This is also due to lack of standardization in both the records to be compared.

For the probabilistic learning my problem was that all learned data until then was only gathered from Nordic data. They wouldn’t be any good for the rest of the world.

The solution was including an advanced data matching algorithm, in this case Omikron FACT.

Since then the Omikron FACT algorithm has been considerable improved and is now branded as WorldMatch®. Some of the new advantages is dealing with different character sets and script systems and having synonyms embedded directly into the matching logic, which is far superior to using synonyms in a prior standardization process.

For full disclosure I work for the vendor Omikron Data Quality today. But I am not praising the product because of that – I work for Omikron because of the product.

Citizen ID within seconds

1st June 201020th March 2012Henrik Gabs Liliendahl8 Comments

Here is a picture of my grandson Jonas taken minutes after his was born. He has a ribbon around his wrist showing his citizen ID which has just been assigned. There is even a barcode with it on the ribbon.

Now, I have mixed feelings about that. It is indeed very impersonal. But as a data quality professional I do realize that this is a way of solving a problem at the root. Duplicate master data in healthcare is a serious problem as Dylan Jones reported last year when he had a son in this article from DataQualityPro.

A unique citizen ID (National identification number) assigned in seconds after a birth have a lot of advantages. As said it is a foundation for data quality in healthcare from the very start of a life. Later when you get your first job you hand the citizen ID to your employer and tax is collected automatically. When the rest of the money is in the bank you are uniquely identified there. When you turn 18 you are seamlessly put on the electoral roll. Later your marriage is merely a relation in a government database between your citizen ID and the citizen ID of your beloved one.

Oh joy, Master Data Management at the very best.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph