Big Reference Data – Page 12 – Liliendahl on Data Quality

No Privacy Customer Onboarding

9th March 20117th September 2011Henrik Gabs Liliendahl12 Comments

This post is a follow up on today’s #DataKnightsJam happening on twitter. Today’s subject was data quality and data privacy.

Diversity in data quality is a subject discussed a lot of times on this blog.

So I want to share a real life example of a good upstream get it right first time data sharing approach that might compromise privacy thresholds in other places.

The image to the right is the data entry form from a Swedish webshop used for customer self-registration. The main flow is that:

You type your national ID (personnummer in Swedish)
You press the following button
The system fetches your name and address data from the public citizen hub
The webshop gets an accurate, complete single customer view

The webshop www.jula.se sells tools for home improvement.

Pick Any Two

22nd February 2011Henrik Gabs Liliendahl5 Comments

The project triangle expresses the dilemma about that you probably want your project to be good, fast and cheap, but in practice you are only able to prioritize two of these three desirable options, in short:

Good, fast, cheap – pick any two

The pick any two among three theme can be related to a lot of other activities thus stating three terms with only two combinations possible in real life.

So what could be the pick any two among three themes for data quality?

Of course the good, fast, cheap dilemma also goes for data quality projects. But as data quality management isn’t just a project but an ongoing program, what else?

I have one suggestion:

Fit for purpose, real world alignment, fix it as we go – pick any two

The term “fit for purpose” has become more or less synonymous with “high quality data” and thus here chosen to express the good angle of data quality.

Some data, especially those we call master data, is used for multiple purposes within an organization. Therefore some kind of real world alignment is often used as a fast track to improving data quality where you don’t spend time analyzing how data may fit multiple purposes at the same time in your organization. Real world alignment also may fulfill future requirements regardless of the current purposes of use.

Managing data both being fit for multiple purpose and aligned with the real world is not something you just do in a cheap way by fixing it as we go. You may pick any two options in these combinations:

Make some data fit for purpose by fixing it as the pains shows up.
Align data with the real world typically by exploiting external reference data as the prices go down.
Lay out a thorough plan for having fit for multiple-purpose data aligned with the real world.

We Will Become More Open

12th January 201115th April 2012Henrik Gabs Liliendahl4 Comments

Yesterday I read a post called Taking Stock Of DQ Predictions For 2011 by Clarke Patterson of Informatica Corporation. Informatica is a well established vendor within data integration, data quality and master data management. The post is based on post called Six Data Management Predictions for 2011 by Steve Sarsfield of Talend. Talend is an open source vendor within data integration, data quality and master data management.

One of the six predictions for 2011 is: Data will become more open.

Steves (open source based) take on this is:

“In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands. The data might have been sold for profit or simply not available. Today, there really is no “wrong hands”. Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org. That trend will continue in 2011. Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?”

Clarkes (propriety software based) take is as follows:

“As data becomes more open, data quality tools will need to be able to handle data from a greater number of sources used for a broader number of purposes. Gone are the days of single domain data manipulation. To excel in this new, open market, you’ll need a data quality tool that can profile, cleanse and monitor data regardless of domain, that is also locale-aware and has pre-built rules and reference data.”

I agree with both views which by the way are on each of The Two Sides To The IT Coin – Data Centric IT vs Process Centric IT as explained by Robin Bloor in another recent post on the blog by data integration vendor Pervasive Software.

Steves and Clarkes perspectives are also close to me as my 2011 to do list includes:

Involvement in a solution called iDQ (instant Data Quality). The solution is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done.
Helping enhancing a hot MDM hub solution with further data quality and multi-domain capabilities.

Right the First Time

6th January 20116th January 2011Henrik Gabs Liliendahl3 Comments

Since I have just relocated (and we have just passed the new year resolution point) I have become a member of the nearby fitness club.

Guess what: They got my name, address and birthday absolutely right the first time.

Now, this could have been because the young lady at the counter is a magnificent data entry person. But I think that her main competency actually rightfully is being a splendid fitness instructor.

What she did was that she asked for my citizen ID card and took the data from there. A little less privacy yes, but surely a lot better for data quality – or data fitness (credit Frank Harland) you might say.

Hell in Norway

27th November 201027th November 2010Henrik Gabs LiliendahlLeave a comment

Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.

Like if you see a city name called “Hell”.

Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.

But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.

Entity Revolution vs Entity Evolution

18th November 201027th March 2012Henrik Gabs Liliendahl8 Comments

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

Is a given customer master data record representing a real world person or organization?
Is a person acting as a private customer and a small business owner going to be seen as the same?
Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

The Value of Free Address Data

21st October 201029th May 2012Henrik Gabs Liliendahl1 Comment

In yesterdays blog post I wrote about Free and Open Sources of Reference Data. As mentioned we have had some discussions in my home country Denmark about fees for access to public sector data.

However since 2002 basic Danish public sector data about addresses has been free without a fee. This summer a report about the benefits from this practice was released. Link in Danish here.

I’ll quote the key findings:

The direct economic gains for the Danish community in the last five years 2005-2009 is approximately 471 million DKK (63 million EUR). The total cost until 2009 has been about 15 million DKK (2 million EUR).
Approximately 30% of the profits are made in the public sector and approximately 70% at the private actors.

I think this is a fine example of the win-win situation we’ll get when sharing data between public sector and private sector.

Business Directory Match: Global versus Local

6th October 201029th May 2012Henrik Gabs LiliendahlLeave a comment

When doing data quality improvement in business-to-business party master data an often used shortcut is matching your portfolio of business customers with a business directory and preferably picking new customers from the directory in the future.

If you are doing business in more than one country you will have some considerations about what business directory to use like engaging with a local business directory for each country or engaging with a single business directory covering all countries in question.

There are pro’s and con’s.

One subject is conformity. I have met this issue a couple of times. A business directory covering many countries will have a standardized way of formatting the different elements like a postal address, whereas a local (national) business directory will use best practice for the particular country.

An example from my home country Denmark:

The Dun & Bradstreet WorldBase is a business directory holding 170 million business entities from all over the world. A Danish street address is formatted like this:

Address Line 1 = Hovedgaden 12 A, 4. th

Observe that Denmark belongs to that half of the earth where house numbers are written after the street name.

In a local business directory (based on the public registry) you will be able to get this format:

Street name = Hovedgaden

Street code = 202 4321

House number = 012A

Floor = 04

Side/door = TH

Here you get an atomized address with metadata for the atomized elements and the unique address coding used in Denmark.

Linked Data Quality

24th August 201020th October 2010Henrik Gabs Liliendahl4 Comments

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph