Liliendahl on Data Quality

Military Intelligence

2nd September 20102nd September 2010Henrik Gabs LiliendahlLeave a comment

Many data quality issues may be prevented by having some intelligent (error tolerant) search going on. I wrote a post about it called Upstream prevention by error tolerant search.

Intelligent search may have a lot of other advantages too.

A scam related to the Danish Military has been going on for a while. The short story is:

A member of the Special Forces wrote a book about combat actions in Afghanistan. The Military tried to stop it, because it could help the enemy. In that process they by some reason made an Arabic translation and by some mistake leaked that to the press. The key person at the military around doing that has the surname “Sønderskov”.

Police “experts” were assigned to find the leak. For a month they unsuccessful searched for an e-mail address including “Sønderskov” only to realize: Oh, e-mail addresses can’t have the national character “ø”. It must either be “oe” or “o” instead as “Soenderskov” or “Sonderskov”.

The story (in Danish) here from the online computer media Version2.

Game, Set, Match

1st September 20108th January 2011Henrik Gabs LiliendahlLeave a comment

Tennis is one of the sports I practiced a lot when I was young and still like to play when possible.

As a consequence I guess I also like to follow world class tennis not at least now where we finally got a Dane competing for the big titles. I’m thinking about Caroline Wozniacki who is seeded as number one in the ongoing US Open Grand Slam tournament.

So, as an excuse to write a blog post about it I have come up with these connections between Caroline and Data Matching.

The name:

Wozniacki isn’t exactly a Nordic name as she is the daughter of native-born Polish parents. In fact, if the Polish naming practice should be followed her surname should be Wozniacka; the female form of the name. But as practiced in Western countries she has inherited a genderless family name. Good for matching.

The bet:

Bets on sports event is like scoring in data matching. You are not 100 % sure but rely on probability. Odds for Caroline winning the US Open opening round matches are as 1.01 and 1.02 = 98 – 99 % certainty = pretty sure. But odds get higher as the tournament proceeds to final rounds and it can go either way.

Out of Facebook

1st September 20105th September 2010Henrik Gabs Liliendahl7 Comments

Some while ago it was announced that Facebook signed up member number 500,000,000.

If you are working with customer data management you will know that this doesn’t mean that 500,000,000 distinct individuals are using Facebook. Like any customer table the Facebook member table will suffer from a number of different data quality issues like:

Some individuals are signed up more than once using different profiles.
Some profiles are not an individual person, but a company or other form of establishment.
Some individuals who created a profile are not among us anymore.

Nevertheless the Facebook member table is a formidable collection of external reference data representing the real world objects that many companies are trying to master when doing business-2- consumer activities.

For those companies who are doing business-2-business activities a similar representation of real world objects will be the +70,000,000 profiles on LinkedIn plus profiles in other social business networks around the world which may act as external reference data for the business contacts in the master data hubs, CRM systems and so on.

Customer Master Data sources will expand to embrace:

Traditional data entry from field work like a sales representative entering prospect and customer master data as part of Sales Force Automation.
Data feed and data integration with traditional external reference data like using a business directory. Such integration will increasingly take place in the cloud and the trend of governments releasing public sector data will add tremendously to this activity.
Self registration by prospects and customers via webforms.
Social media master data captured during social CRM and probably harvested in more and more structured ways as a new wave of exploiting external reference data.

Doing “Social Master Data Management” will become an integrated part of customer master data management offering both opportunities for approaching a “single version of the truth” and some challenges in doing so.

Of course privacy is a big issue. Norms vary between countries, so do the legal rules. Norms vary between individuals and by the individuals as a private person and a business contact. Norms vary between industries and from company to company.

But the fact that 500,000,000 profiles has been created on Facebook in a very few years by people from all over world shows that people are willing to share and that much information can be collected in the cloud. However no one wants to be spammed by sharing and indeed there have been some controversies around how data in Facebook is handled.

Anyway I have no doubt that we will see less data entering clerks entering the same information in each company’s separate customer tables and that we increasingly will share our own master data attributes in the cloud.

Out-of-Africa

30th August 201027th March 2012Henrik Gabs Liliendahl4 Comments

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
“Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Follow Friday Data Quality

28th August 20107th September 2011Henrik Gabs Liliendahl7 Comments

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hash tag.

My username on twitter is @hlsdk.

Sometimes I notice tweeps I follow are recommending the username @hldsk or @hsldk or other usernames with my five letters swapped.

It could be they meant me? – but misspelled the username. Or they meant someone else with a username close to mine?

As the other usernames wasn’t taken I have taken the liberty to create some duplicate (shame on me) profiles and have a bit of (nerdish) fun with it:

@hsldk

For this profile I have chosen the image being the Swedish Chef from the Muppet show. To make the Swedish connection real the location on the profile is set as “Oresund Region”, which is the binational metropolitan area around the Danish capital Copenhagen and the 3^rd largest Swedish city Malmoe as explained in the post The Perfect Wrong Answer.

@hldsk

For this profile I have chosen the image being a gorilla originally used in the post Gorilla Data Quality.

This Friday @hldsk was recommended thrice.

But I think only by two real life individuals: Joanne Wright from Vee Media and Phil Simon who also tweets as his new (one-man-band I guess) publishing company.

What’s the point?

Well, one of my main activities in business is hunting duplicates in party master databases.

What I sometimes find is that duplicates (several rows representing the same real world entity) have been entered for a good reason in order to fulfill the immediate purpose of use.

The thing with Phil and his one-man-band company is explained further in the post So, What About SOHO Homes.

By the way, Phil is going to publish a book called The New Small. It’s about: How a New Breed of Small Businesses is Harnessing the Power of Emerging Technologies.

360° Share of Wallet View

26th August 201023rd February 2011Henrik Gabs Liliendahl4 Comments

I have found this definition of Share of Wallet on Wikipedia:

Share of Wallet is the percentage (“share”) of a customer’s expenses (“of wallet”) for a product that goes to the firm selling the product. Different firms fight over the share they have of a customer’s wallet, all trying to get as much as possible. Typically, these different firms don’t sell the same but rather ancillary or complementary product.

Measuring your share of given wallets – and your performance in increasing it – is a multi-domain master data management exercise as you have to master both a 360° view of customers and a 360° view of products.

With customer master data you are forced to handle uniqueness (consolidate duplicates) of customers and handle hierarchies of customers, which is further explained in the post 360° Business Partner View.

With product master data you are not only forced to categorize your own products and handle hierarchies within, but you also need to adapt to external categorizations in order to getting access to external data available for spending probably on a high level for a segment of customers but sometimes even possible down to the single customer.

Location master data may be important here for geographical segmentations and identification.

My educated guess is that companies will increasing rely on having better data quality and master data management processes and infrastructure in order to measure precise shares of wallets and thereby gain advantages in a stiff competition rather than relying on gut feelings and best guesses.

Linked Data Quality

24th August 201020th October 2010Henrik Gabs Liliendahl4 Comments

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

Data Quality Is Like Parenting

22nd August 20105th January 2011Henrik Gabs Liliendahl2 Comments

Thinking about it: Data Quality has a lot of similarities with parenting.

Some equivalence that comes to my mind is:

Parenting must be done by everyone who has children; you are not supposed to have an education in education before being parents. The same about data. You are not supposed be a data quality expert before working with data; some common sense will bring you a long way.
Some parenting experts never had their own children. I have seen the same with data quality experts too.
Many people are more knowledgeable about how other people should raise children than about raising their own children. Same same with data quality.
While we internally in the family may have some noise when parenting we keep that internally and keep up appearances to the outside. I think everyone have seen the same with data quality.
There may be different styles in parenting going from “because I said so” to talking about it. The same is true around data quality improvement efforts.
We do see more and more regulatory around parenting like it in my country now is forbidden to slap your kids. I think it should be forbidden to slap your naughty data too.

Same Same But Different

21st August 201015th December 2010Henrik Gabs Liliendahl7 Comments

The two most common master data types are:

Party master data (customers, prospects, suppliers and other business partners)
Product master data

When working with data quality within master data management you may of course encounter some similarities between these two master data types, but you will certainly also meet a range differences.

The basic activities as standardization, consolidation and hierarchy building are the same.

Some of the differences I have learned are:

Multi-cultural issues:

Party master data is often stored in a single global format but should be transformed to embrace multi-cultural diversities.
Product master data may have multi-cultural issues but should be transformed into a single global format (of course embracing multi-language hierarchies and so).

External reference data available:

For party master data the possibilities for real world alignment with external data sources are plenty.
For product master data the possibilities for real world alignment with external data sources are few.

Industry specific requirements:

Requirements for party master data quality are pretty much the same across industries with few variations as B2B (corporate customers) or B2C (private customers) or both being the most prominent.
Requirements for product master data quality vary tremendously across different industries.

Your say:

What are your examples of (similarities and) differences between party master data quality and product master data quality?

What are they doing?

19th August 201019th August 2010Henrik Gabs Liliendahl12 Comments

A core attribute in customer master data when dealing with business entities is assigning values for your customers/prospects industry vertical (or Line-of-Business or market segment or whatever metadata name you like).

When handling this particular data element you will come across many of the classic different options in data and information management.

Unstructured versus structured

Many early CRM (Customer Relationship Management) implementations offered a free text field for the industry vertical. While this approach may have been good for the free flow in data entry it of course has created havoc when business intelligence was applied to the CRM data. Countless cleansing projects have been done (and is going on) around in order to fix this basic mistake.

Most data entry forms today having an industry vertical value has a value list to choose from.

Your list versus an external standard

When having a value list it may be a list of your own creation or be based on an external standard list, for example SIC or NACE codes.

Having a list of your own tends to fulfill the data quality principle of fit for purpose of use while an external standard tends to fulfill the data quality principle of reflecting the real world construct.

The main weaknesses of a list of your own are that it requires continuous manual based maintenance and may cause conflicts. Deep down into a discussion on the Initiate MDM blog Julian Schwarzenbach offered a good example saying:

“I have also come across ‘flip-flop’ data – which is typically subjective data where two users cannot agree what the correct value is and it keeps getting changed between two values. This could be the classification of a customer by market sector where two different territories are reflecting different capabilities in their territories.” – Link here.

The main weaknesses of an external standard are that they seldom offer the granularity you need and for global data the different standards (SIC versions and different national NACE implementations and others) are a pain in the…

One versus several values

Many companies have more than one distinct activity. Catching only one (the primary) value for each company is keeping it simple, stupid. Having more than one value in relevant cases is adding complexity but may lead to better decisions.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph