No Privacy Customer Onboarding

This post is a follow up on today’s #DataKnightsJam happening on twitter. Today’s subject was data quality and data privacy.

Diversity in data quality is a subject discussed a lot of times on this blog.

So I want to share a real life example of a good upstream get it right first time data sharing approach that might compromise privacy thresholds in other places.

The image to the right is the data entry form from a Swedish webshop used for customer self-registration. The main flow is that:

  • You type your national ID (personnummer in Swedish)
  • You press the following button
  • The system fetches your name and address data from the public citizen hub
  • The webshop gets an accurate, complete single customer view  

The webshop www.jula.se sells tools for home improvement.

Bookmark and Share

Pick Any Two

The project triangle expresses the dilemma about that you probably want your project to be good, fast and cheap, but in practice you are only able to prioritize two of these three desirable options, in short:

Good, fast, cheap – pick any two

The pick any two among three theme can be related to a lot of other activities thus stating three terms with only two combinations possible in real life.

So what could be the pick any two among three themes for data quality?

Of course the good, fast, cheap dilemma also goes for data quality projects. But as data quality management isn’t just a project but an ongoing program, what else?

I have one suggestion:

Fit for purpose, real world alignment, fix it as we go – pick any two

The term “fit for purpose” has become more or less synonymous with “high quality data” and thus here chosen to express the good angle of data quality.

Some data, especially those we call master data, is used for multiple purposes within an organization. Therefore some kind of real world alignment is often used as a fast track to improving data quality where you don’t spend time analyzing how data may fit multiple purposes at the same time in your organization. Real world alignment also may fulfill future requirements regardless of the current purposes of use.

Managing data both being fit for multiple purpose and aligned with the real world is not something you just do in a cheap way by fixing it as we go. You may pick any two options in these combinations:

  • Make some data fit for purpose by fixing it as the pains shows up.
  • Align data with the real world typically by exploiting external reference data as the prices go down.
  • Lay out a thorough plan for having fit for multiple-purpose data aligned with the real world.

Bookmark and Share

We Will Become More Open

Yesterday I read a post called Taking Stock Of DQ Predictions For 2011 by Clarke Patterson of Informatica Corporation. Informatica is a well established vendor within data integration, data quality and master data management. The post is based on post called Six Data Management Predictions for 2011 by Steve Sarsfield of Talend. Talend is an open source vendor within data integration, data quality and master data management.

One of the six predictions for 2011 is: Data will become more open.

Steves (open source based) take on this is:

“In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands.  The data might have been sold for profit or simply not available.  Today, there really is no “wrong hands”.  Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org.  That trend will continue in 2011.  Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?”

Clarkes (propriety software based) take is as follows:

“As data becomes more open, data quality tools will need to be able to handle data from a greater number of sources used for a broader number of purposes.  Gone are the days of single domain data manipulation.  To excel in this new, open market, you’ll need a data quality tool that can profile, cleanse and monitor data regardless of domain, that is also locale-aware and has pre-built rules and reference data.”

I agree with both views which by the way are on each of The Two Sides To The IT Coin – Data Centric IT vs Process Centric IT as explained by Robin Bloor in another recent post on the blog by data integration vendor Pervasive Software.

Steves and Clarkes perspectives are also close to me as my 2011 to do list includes:

  • Involvement in a solution called iDQ (instant Data Quality). The solution is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done.
  • Helping enhancing a hot MDM hub solution with further data quality and multi-domain capabilities.

Bookmark and Share

Right the First Time

Since I have just relocated (and we have just passed the new year resolution point) I have become a member of the nearby fitness club.

Guess what: They got my name, address and birthday absolutely right the first time.

Now, this could have been because the young lady at the counter is a magnificent data entry person. But I think that her main competency actually rightfully is being a splendid fitness instructor.

What she did was that she asked for my citizen ID card and took the data from there. A little less privacy yes, but surely a lot better for data quality – or data fitness (credit Frank Harland) you might say.

Bookmark and Share

Hell in Norway

Looking for inappropriate words in customer data is always a risky business. Most times there is always a legitimate name or a place somewhere with that word.

Like if you see a city name called “Hell”.

Outside the English speaking parts of the world you will find “Hell” in Norway. It’s a village with its own postal code (NO-7517) situated in the Trondheim metropolitan area. Not at least at this time of year with winter on the Northern hemisphere it is surely considerable colder than the religious “Hell”.

But even in the English speaking world you will find a semi legitimate “Hell” in Michigan, United States.

Bookmark and Share

Entity Revolution vs Entity Evolution

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

  • Is a given customer master data record representing a real world person or organization?
  • Is a person acting as a private customer and a small business owner going to be seen as the same?
  • Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
  • Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

Bookmark and Share

The Value of Free Address Data

In yesterdays blog post I wrote about Free and Open Sources of Reference Data. As mentioned we have had some discussions in my home country Denmark about fees for access to public sector data.

However since 2002 basic Danish public sector data about addresses has been free without a fee. This summer a report about the benefits from this practice was released. Link in Danish here.

I’ll quote the key findings:

  • The direct economic gains for the Danish community in the last five years 2005-2009 is approximately 471 million DKK (63 million EUR). The total cost until 2009 has been about 15 million DKK (2 million EUR).
  • Approximately 30% of the profits are made in the public sector and approximately 70% at the private actors.

I think this is a fine example of the win-win situation we’ll get when sharing data between public sector and private sector.

Bookmark and Share

Free and Open Sources of Reference Data

This Monday I mingled in a tweetjam organized by the open source data integration vendor Talend.

One of the questions discussed was: Are free and open sources of reference data becoming more important in your projects?

When talking “free and open“, not at least in the open source realm, we can’t avoid talking about “free for a fee”. Some sources of open data like Geonames are free as in “free beer”. Other data comes with a fee. In my home country Denmark we have had some discussions about the reasoning in that the government likes to put a fee on mandatory collected data and I have observed similar considerations in our close neighbor country Sweden (By the way: The picture of a bridge that Talend uses a lot like on top of home page here looks like the bridge between Denmark and Sweden).

One challenge I have met over and over again in using free (maybe for a fee) and open data in data integration and data quality improvement is the cost of conformity. When using open government data there may, apart from the pricing, be a lot of differences between the countries in formats, coverage and so on. I think there is a great potential in delivering conformed data from many different sources for specific purposes.

Bookmark and Share

Business Directory Match: Global versus Local

When doing data quality improvement in business-to-business party master data an often used shortcut is matching your portfolio of business customers with a business directory and preferably picking new customers from the directory in the future.

If you are doing business in more than one country you will have some considerations about what business directory to use like engaging with a local business directory for each country or engaging with a single business directory covering all countries in question.

There are pro’s and con’s.

One subject is conformity. I have met this issue a couple of times. A business directory covering many countries will have a standardized way of formatting the different elements like a postal address, whereas a local (national) business directory will use best practice for the particular country.

An example from my home country Denmark:

The Dun & Bradstreet WorldBase is a business directory holding 170 million business entities from all over the world. A Danish street address is formatted like this:

Address Line 1 = Hovedgaden 12 A, 4. th

Observe that Denmark belongs to that half of the earth where house numbers are written after the street name.

In a local business directory (based on the public registry) you will be able to get this format:

Street name = Hovedgaden
Street code = 202 4321
House number = 012A
Floor = 04
Side/door = TH

Here you get an atomized address with metadata for the atomized elements and the unique address coding used in Denmark.

Bookmark and Share

Linked Data Quality

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

Bookmark and Share