Liliendahl on Data Quality

The Toyota Way

3rd November 2010Henrik Gabs Liliendahl2 Comments

Yesterday I visited a Toyota branch office.

While waiting in the unmanned reception (a result of removing waste, known as muda in Japanese, I guess) I had the chance to study the five posters hanging there with the main principles in The Toyota Way:

Challenge: Form a long-term vision and meet challenges with courage and creativity.

Kaizen (continuous improvement): Improve business operations continuously, always driving for innovation and evolution.

Genchi Genbutsu (go and see): Go to the source to find the facts to make correct decisions, build consensus and achieve goals at best speed.

Respect: Respect others. Make every effort to understand each other, take responsibility and do your best to build mutual trust.

Teamwork: Stimulate personal and professional growth, share the opportunities of development and maximize individual and team performance.

What a great way to prepare for a meeting about data quality improvement.

Is a Small Difference a Big Deal?

2nd November 20103rd November 2010Henrik Gabs LiliendahlLeave a comment

The title of this blog post is stolen from/was inspired by a post on the Nation of Why Not blog. The Nation of Why Not is the branded name of Royal Caribbean. Royal Caribbean operates among a lot of other vessels the world’s two largest cruise ships: ‘Oasis of the Seas’ and ‘Allure of the Seas’. The youngest ship ‘Allure of the Seas’ has just left the shipyard in Turku, Finland and passed under the Great Belt Bridge in grey Danish waters on the way to the blue Caribbean Sea.

The Oasis and Allure are sister ships supposed to have exactly the same dimensions. But according to the official measures by DNV, Allure is 50 millimeters longer than Oasis. This has led to some teasing between the crews and now it has been suggested that NASA should make a new measurement (from up above I guess).

This is a good old classic data quality issue. Is it acceptable to assume that two similar things have the same attributes? Or do you need to measure each thing separately? And is an eventual difference a difference in the real world or a difference in measurement?

Now, with the ships I think they are a bit different anyway, as I see that the new ship Allure opposite to Oasis also have a Samba Grill, Rita’s Cantina and a Starbucks café inside.

The Magic Numbers

31st October 201014th July 2011Henrik Gabs Liliendahl5 Comments

An often raised question and a subject for a lot of blog posts in the data quality realm is whether data quality challenges should be solved by people or technology.

As in all things data quality I don’t think there is a single right answer for that.

Now, in this blog post I will not tell about what I then think is the answer(s) to the question, but simply tell about what I have seen been chosen as the solution to the question, which have been both people centric solutions and technology centric solutions.

If I look at the situations where people centric solutions have been chosen versus the situations where technology centric solutions have been chosen, the first differentiator seems to be numbers:

If you have only a small number of customers and a single channel where entered, the better solution to optimal data quality and uniqueness seems to be a people centric solution.
If you have millions of customers and multiple channels where entered, the only practical solution to optimal data quality and uniqueness seems to be a technology centric solution.
If you have only a small number of products and a single channel where entered, the only sensible solution to optimal data quality and uniqueness seems to be a people centric solution.
If you have thousands of products coming from multiple channels, the most reliable solution to optimal data quality and uniqueness seems to be a technology centric solution.

So, based on common sense the answer to the people or technology question is that it magically depends on the numbers.

Legal Forms from Hell

27th October 20109th October 2014Henrik Gabs Liliendahl2 Comments

When doing data matching with company names a basic challenge is that a proper company name in most cultures in most cases have two elements:

The actual company name
The legal form

Some worldwide examples:

Informatica Corporation
Talend SA
SAP Deutschland AG & Co. KG
Sony Kabushiki Kaisha
LEGO A/S

There are hundreds of different legal forms in full and abbreviated forms. Wikipedia has a list here (here called types of business entity).

However, when typing in company names in databases the legal form is often omitted. And even where legal forms are present they may be represented differently in full or abbreviated forms, with varying spelling and punctuation and so on. As the actual company names also suffer from this fuzziness, the complexity is overwhelming.

A common way of handling this issue in data matching is to separate the legal form and then emphasize on comparing the remaining part being the actual company name. When doing that it has to be done country specific or else you may remove the entire name of a company like with a name of an Italian company called Société Anonyme, which is a French legal form.

While the practice of having legal forms in company names may serve well for the original purpose of knowing the risk of doing business with that entity, it is certainly not serving the purpose of having the uniqueness data quality dimension solved.

One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.

Golden Copy Musings

22nd October 201022nd October 2010Henrik Gabs Liliendahl2 Comments

In a recent blog post by Jim Harris called Data Quality is not an Act, it is a Habit the term “golden mean” was mentioned.

As I commented, mentioning the “golden mean” made me think about the terms “golden copy” and “golden record” which are often used terms in data quality improvement and master data management.

In using these terms I think we mostly are aiming on achieving extreme uniqueness. But we should rather go for symmetry, proportion, and harmony.

The golden copy subject is very timely for me as I this weekend is overseeing the execution of the automated processes that create a baseline for a golden copy of party master data at a franchise operator for a major brand in car rental.

In car rental you are dealing with many different party types. You have companies as customers and prospects and you have individuals being contacts at the companies, employees using the cars rented by the companies and individuals being private renters. A real world person may have several of these roles. Besides that we have cases of mixed identities.

During a series of workshops we have worked with defining the rules for merge and survivorship in the golden copy. Though we may be able to go for extreme uniqueness in identifying real world companies and persons this may not necessary serve the business needs and, like it or not, be capable of being related back into the core systems used in daily business.

Therefore this golden copy is based on a beautiful golden mean exposing symmetry, proportion, and harmony.

The Value of Free Address Data

21st October 201029th May 2012Henrik Gabs Liliendahl1 Comment

In yesterdays blog post I wrote about Free and Open Sources of Reference Data. As mentioned we have had some discussions in my home country Denmark about fees for access to public sector data.

However since 2002 basic Danish public sector data about addresses has been free without a fee. This summer a report about the benefits from this practice was released. Link in Danish here.

I’ll quote the key findings:

The direct economic gains for the Danish community in the last five years 2005-2009 is approximately 471 million DKK (63 million EUR). The total cost until 2009 has been about 15 million DKK (2 million EUR).
Approximately 30% of the profits are made in the public sector and approximately 70% at the private actors.

I think this is a fine example of the win-win situation we’ll get when sharing data between public sector and private sector.

To be called Hamlet or Olaf – that is the question

18th October 20105th November 2010Henrik Gabs LiliendahlLeave a comment

Right now my family and I are relocating from a house in a southern suburb of Copenhagen into a flat much closer to downtown. As there is a month in between where we haven’t a place of our own, we have rented a cottage (summerhouse) north of Copenhagen not far from Kronborg Castle, which is the scene of the famous Shakespeare play called Hamlet.

Therefore a data quality blog post inspired by Hamlet seems timely.

Though the feigned madness of Hamlet may be a good subject related to data quality, I will however instead take a closer data matching look at the name Hamlet.

Shakespeare’s Hamlet is inspired by an old Norse legend, but to me the name Hamlet doesn’t sound very Norse.

Nor does the same sounding name Amleth found in the immediate source being Saxo Grammaticus.

If Saxo’s source was a written source, it may have been from Irish monks in Gaelic alphabet as Amhlaoibh where Amhl=owl and aoi=ay and bh=v sounding just like the good old Norse name Olav or Olaf.

So, there is a possible track from Hamlet to Olaf.

Also today a fellow data quality blogger Graham Rhind posted a post called Robert the Carrot with the same issue. As Graham explains, we often see how data is changed through interfaces and in the end after passing through many interfaces doesn’t look at all as it was when first entered. There may be a good explanation for each transformation, but the end-to-end similarity is hard to guess when only comparing these two.

I have met that challenge in data matching often. An example will be if we have the following names living on the same address:

Pegy Smith
Peggy Smith
Margaret Smith

A synonym based similarity (or standardization) will find that Margaret and Peggy are duplicates.

An edit distance similarity will find that Peggy and Pegy are duplicates,

A combined similarity algorithm will find that all three names belong to a single duplicate group.

The Art of Programming

16th October 201016th October 2010Henrik Gabs Liliendahl2 Comments

Beginner’s All-purpose Symbolic Instruction Code or simply BASIC is one of the oldest programming languages around and also the first programming language I learned in school back in the 70’s. Later I came around a dialect of BASIC called COMAL, learned and forgot all about ASSEMBLER, made my first business code in COBOL (plus a Yahtzee game), created applications with SPEED and PACE, worked with PowerBuilder, wrote some SQL and made my own data quality tool using MAGIC.

Independent of all the different languages being used, when programming there may be two different basic measures of quality:

Good code may refer to if the code is well structured, readable by others including being feasible documented, is reusable and is setup to use the computer resources the best way possible.
Good code (delivered as an application) may refer to that it helps solving the business (or gaming) issue addressed through the best possible user experience.

Looking at good code these two ways resembles the two ways we also measure if our data is good:

Good data may refer to if the data is well structured, readable by others including being feasible documented, is reusable and reflects the real world the best way possible.
Good data (delivered as information) may refer to that it supports solving the business issue addressed through the best possible user experience.

Application (and information) users concern is point 2.

As a programmer (and data quality professional) you have to consider point 1 in order to achieve point 2. You may get along with a quick and dirty work around in a short term, but in the long run you have to make it technically right.

Magic Quadrant Diversity

12th October 201022nd July 2011Henrik Gabs Liliendahl2 Comments

The Magic Quadrants from Gartner Inc. ranks the tool vendors within a lot of different IT disciplines. Related to my work the quadrants for data quality tools and master data management is the most interesting ones.

However, the quadrants examine the vendors in a global scope. But, how are the vendors doing in my country?

I tried to look up a few of the vendors in a local business directory for Denmark provided (free to use on the web) by the local Experian branch.

DataFlux

First up is DataFlux, the (according to Gartner) leading data quality tool vendor.

Result: No hits.

Knowing that DataFlux is owned by SAS Institute will however, with a bit of patience, finally bring you to information about the DataFlux product deep down on the SAS local website.

PS: Though SAS is more known here as the main airline (Scandinavian Airlines System), SAS Institute is actually very successful in Denmark having a much larger part of the Business Intelligence market here than most places else.

Informatica

Next up is Informatica, a well positioned company in both the quadrant for data quality tools and customer master data management.

Result: No Hits.

Here you have to know that Informatica is represented in the Nordic area by a company called Affecto. You will find information about the Informatica products deep down on the Affecto website – along with the competing product FirstLogic owned by Business Objects (owned by SAP) also historically represented by Affecto.

Stibo Systems

Stibo Systems may not be as well known as the two above, but is tailing the mega vendors in the quadrant for Product Master Data Management, as mentioned recently in a blog post by Dan Power.

Result: Hit:

They are here with over 500 employees – at least in the legal entity called Stibo where Stibo Systems is an alternate name and brand. And it’s no kidding; I visited them last month at the impressive head quarter near Århus (the second largest city in Denmark).

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph