Data Architecture – Page 20 – Liliendahl on Data Quality

The Database versus the Hub

4th September 201127th March 2012Henrik Gabs Liliendahl1 Comment

In the LinkedIn Multi-Domain MDM group we have an ongoing discussion about why you need a master data hub when you already got some workflow, UI and a database.

I have been involved in several master data quality improvement programs without having the opportunity of storing the results in a genuine MDM solution, for example as described in the post Lean MDM. And of course this may very well result in a success story.

However there are some architectural reasons why many more organizations than those who are using a MDM hub today may find benefits in sooner or later having a Master Data hub.

Hierarchical Completeness

If we start with product master data the main issue with storing product master data is the diversity in the requirements for which attributes is needed and when they are needed dependent on the categorization of the products involved.

Typical you will have hundreds or thousands of different attributes where some are crucial for one kind of product and absolutely ridiculous for another kind of product.

Modeling a single product table with thousands of attributes is not a good database practice and pre-modeling tables for each thought categorization is very inflexible.

Setting up mandatory fields on database level for product master data tables is asking for data quality issues as you can’t miss either over-killing or under-killing.

Also product master data entities are seldom created in one single insertion, but is inserted and updated by several different employees each responsible for a set of attributes until it is ready to be approved as a whole.

A master data hub, not at least those born in the product domain, is built for those realities.

The party domain has hierarchical issues too. One example will be if a state/province is mandatory on an address, which is dependent on the country in question.

Single Business Partner View

I like the term “single business partner view” as a higher vision for the more common “single customer view”, as we have the same architectural requirements for supplier master data, employee master data and other master data concerning business partners as we have for the of course extremely important customer master data.

The uniqueness dimension of data quality has a really hard time in common database managers. Having duplicate customer, supplier and employee master data records is the most frequent data quality issue around.

In this sense, a duplicate party is not a record with accurately the same fields filled and with accurate the same values spelled accurately the same as a database will see it. A duplicate is one record reflecting the same real world entity as another record and a duplicate group is more records reflecting the same real world entity.

Even though some database managers have fuzzy capabilities they are still very inadequate in finding these duplicates based on including several attributes at one time and not at least finding duplicate groups.

Finding duplicates when inserting supposed new entities into your customer list and other party master data containers is only the first challenge concerning uniqueness. Next you have to solve the so called survivorship questions being what values will survive unavoidable differences.

Finally the results to be stored may have several constructing outcomes. Maybe a new insertion must be split into two entities belonging to two different hierarchy levels in your party master data universe.

A master data hub will have the capabilities to solve this complexity, some for customer master data only, some also for supplier master data combined with similar challenges with product master data and eventually also other party master data.

Domain Real World Awareness

Building hierarchies, filling incomplete attributes and consolidating duplicates and other forms of real world alignment is most often fulfilled by including external reference data.

There are many sources available for party master as address directories, business directories and citizen information dependent on countries in question.

With product master data global data synchronization involving common product identifiers and product classifications is becoming very important when doing business the lean way.

Master data hubs knows these sources of external reference data so you, once again, don’t have to reinvent the wheel.

Unmaintainability

16th August 2011Henrik Gabs Liliendahl2 Comments

Following up on my post about word quality and inspired by a blog post by Joyce Norris-Montanari called “Things That Don’t Work So Well – Doing Analytics Before Their Time” in which the word “unmaintainable” is used I want to challenge my English spell checker even further with the rare and apparently not really existing word but frequent issue of unmaintainability.

I have previously on this blog pondered that you can’t expect that because you get it Right the First Time then everything will be just fine from this day forward. Things change.

This argument is about the data as plain data.

But there is also a maintainability (this is apparently a real word) issue around how we store data. I have many times conducted data quality exercises as deduplication and matching with and enriching from external reference data in order to reach a single version of the truth as far as it goes.

An often encountered problem is that this kind of data processing can get us somewhere close to a single version of the truth. But then there is a huge obstacle: You can’t get these great results back to the daily databases without destroying some of the correctness because the data structures don’t allow you to do that.

Such kind of unmaintainability is in my eyes a good argument for looking into master data management platforms that allows you to maintain your master data in the complexity that supports the business rules that make your company more competitive.

Good-Bye to the old CRM data model

31st July 201131st July 2011Henrik Gabs Liliendahl6 Comments

Today I stumbled upon a blog post called Good-Bye to the “Job” by David Houle, a futurist, strategist and speaker.

In the post it is said: “In the Industrial Age, machines replaced manual or blue-collar labor. In the Information Age, computers replaced office or white-collar workers”.

The post is about that today we can’t expect to occupy one life-long job at a single employer. We must increasingly create our own job.

My cyberspace friend Phil Simon also wrote about his advanced journey into this space recently in the post Diversifying Yourself Into a Platform Business.

The subject is close to me as I currently have approximately five different occupations as seen in my LinkedIn profile.

A professional angle to this subject is also how that development will turn some traditional data models upside down.

A Customer Relationship Management (CRM) system for business-to-business (B2B) environments has a basic data model with accounts having a number of contacts attached where the account is the parent and the contacts are the children in data modeling language.

Most systems and business processes have trouble when following a contact from account (company) to account (company) when the contact gets a new job or when the same real world individual is a contact at two or more accounts (companies) at the same time.

I have seen this problem many times and also failed to recognize it myself from time to time as told in the post A New Year Resolution.

My guess is that CRM systems in the B2B realm will turn to a more contact oriented view over time and this will probably be along with that CRM systems will rely more on Master Data Management (MDM) hubs in order to effectively reflect a fast, but not equally, changing world, as the development in the way we have jobs doesn’t happen at the same time at all places.

Phishing in Wrong Waters

27th July 201127th July 2011Henrik Gabs Liliendahl4 Comments

Yesterday a lot of Danes received an e-mail apparently coming from the tax authorities but was a phishing attempt.

The form to be filled may seem professional at first glance, but it actually had errors all over.

While such errors may be common in phishing as the ones behind only need a fraction of the receivers to take the bite, you actually do see many of the errors in lawful activities.

Some of the errors in the phishing attempt were:

It is very unlikely that the public sector would communicate in English instead of Danish
They got our national ID for every citizen right; it is called CPR-NR. But why ask for date of birth as this is included in the national ID.
Asking for “Mother Maiden Name” and “The name of your son” seems ridiculous to me. Don’t know if it’s some kind of custom anywhere else in the world.
The address format is (as usual) a United States standard. Here it would be: Address, Postal Code, Town/City.
You would never expect the public sector to pay anything to your credit/debit card. Our national ID is connected to a bank account selected for that purpose.

As the tax authorities stated in a warning e-mail today: “We do not know of anyone who has been cheated by the mail”.

I guess they are right.

Also, if you are doing lawful activities but committing the same kind of diversity errors in your forms: Don’t expect a whole lot of conversion.

Big Master Data

24th July 20113rd May 2012Henrik Gabs LiliendahlLeave a comment

Right now I am overseeing the processing of yet a master data file with millions of records. In this case it is product master data also with customer master data kind of attributes, as we are working with a big pile of author names and related book titles.

The Big Buzz

Having such high numbers of master data records isn’t new at all and compared to the size of data collections we usually are talking about when using the trendy buzzword BigData, it’s nothing.

Data collections that qualify as big will usually be files with transactions.

However master data collections are increasing in volume and most transactions have keys referencing descriptions of the master entities involved in the transactions.

The growth of master data collections are also seen in collections of external reference data.

For example the Dun & Bradstreet Worldbase holding business entities from around the world has lately grown quickly from 100 million entities to near 200 millions entities. Most of the growth has been due to better coverage outside North America and Western Europe, with the BRIC countries coming in fast. A smaller world resulting in bigger data.

Also one of the BRICS, India, is on the way with a huge project for uniquely identifying and holding information about every citizen – that’s over a billion. The project is called Aadhaar.

When we extend such external registries also to social networking services by doing Social MDM, we are dealing with very fast growing number of profiles in Facebook, LinkedIn and other services.

Extreme Master Data

Gartner, the analyst firm, has a concept called “extreme data” that rightly points out, that it is not only about volume this “big data” thing; it is also about velocity and variety.

This is certainly true also for master data management (MDM) challenges.

Master data are exchanged between organizations more and more often in higher and higher volumes. Data quality focuses and maturity may probably not be the same within the exchanging parties. The velocity and volume makes it hard to rely on people centric solutions in these situations.

Add to that increasing variety in master data. The variety may be international variety as the world gets smaller and we have collections of master data embracing many languages and cultures. We also add more and more attributes each day as for example governments are releasing more data along with the open data trend and we generally include more and more attributes in order to make better and more informed decisions.

Variety is also an aspect of Multi-Domain MDM, a subject that according to Gartner (the analyst firm once again) is one of the Three Trends That Will Shape the Master Data Management Market.

Mutating Platforms or Intelligent Design

16th July 201127th March 2012Henrik Gabs Liliendahl2 Comments

How do we go from single-domain master data management to multi-domain master data management? Will it be through evolution of single-domain solutions or will it require a complete new intelligent design?

The MDM journey

My previous blog post was a book review of “Master Data Management in Practice” by Dalton Servo and Mark Allen – or the full title of the book is in fact “Master Data Management in Practice: Achieving True Customer MDM”.

The customer domain has until now been the most frequent and proven domain for master data management and as said in the book, the domain where most organizations starts the MDM journey in particular by doing what is usually called Customer Data Integration (CDI).

However some organizations do start with Product Information Management (PIM). This is mainly due to the magic numbers being the fact that some organizations have a higher number of products than customers in the database.

Sooner or later most organizations will continue the MDM journey by embracing more domains.

Achieving Multi-Domain MDM

John Owens made a blog post yesterday called “Data Quality: Dead Crows Kill Customers! Dead Crows also Kill Suppliers!” The post explains how some data structures are similar between sales and purchasing. For example a customer and a supplier are very similar as a party.

Customer Data Integration (CDI) has a central entity being the customer, which is a party. Product Information Management (PIM) has an important entity being a supplier, which is a party. The data structures and the workflows needed to Create, Read, Update and perhaps Delete these entities are very similar, not at least in business-to-business (B2B) environments.

So, when you are going from PIM to CDI, you don’t have to start from scratch, not at least in a B2B environment.

The trend in the master data management technology market is that many vendors are working their way from being a single domain vendor to being a multi-domain vendor – and some are promoting their new intelligent design embracing all domains from day one.

Some other vendors are breeding several platforms (often based on acquisition) from different domains into one brand, and some vendors are developing from a single domain into new domains.

Each strategy has its pros and cons. It seems there will be plenty of philosophies to choose from when organizations are going the select the platform(s) to support the multi-domain MDM journey.

Party On

13th July 2011Henrik Gabs Liliendahl2 Comments

The most frequent data domain addressed in data quality improvement and master data management is parties.

Some of the issues related to parties that keeps on creating difficulties are:

Party roles
International diversity
Real world alignment

Party roles

Party data management is often coined as customer data management or customer data integration (CDI).

Indeed, customers are the lifeblood of any enterprise – also if we refer to those who benefit from our services as citizens, patients, clients or whatever term in use in different industries.

But the full information chain within any organization also includes many other party roles as explained in the post 360° Business Partner View. Some parties are suppliers, channel partners and employees. Some parties play more than one role at the same time.

The classic question “what is a customer?” is of course important to be answered in your master data management and data quality journey. But in my eyes there is lot of things to be solved in party data management that don’t need to wait for the answer to that question which anyway won’t be as simple as cutting the Gordian Knot as said in the post Where is the Business.

International diversity

As discussed in the post The Tower of Babel more and more organizations are met with multi-cultural issues in data quality improvement within party data management.

Whether and when an organization has to deal with international issues is of course dependent on whether and in what degree that organization is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality.

However, as Karen Lopez recently pondered in the post Data Quality in The Wild, Some Where …, actually everyone, even in the United States, has some international data somewhere looking very strange if not addressed properly.

Real world alignment

I often say that real world alignment, sometimes as opposed to the common definition of data quality as being fit for purpose, is the short cut to getting data quality right related to party master data.

It is however not a straight forward short cut. There are multiple challenges connected with getting your business-to-business (B2B) records aligned with the real world as discussed in the post Single Company View. When it comes to business-to-consumer (B2C) or government-to-citizen (G2C) I think the dear people who sometimes comments on this blog did a fine job on balancing mutating tables and intelligent design in the post Create Table Homo_Sapiens.

When a Cloudburst Hit

11th July 201111th July 2011Henrik Gabs Liliendahl2 Comments

Some days ago Copenhagen was hit by the most powerful cloudburst ever measured here.

More powerful cloudbursts may be usual in warmer regions on the earth, but this one was very unusual at 55 degrees north.

Fortunately there was only material damage, but the material damage was very extensive. When you take a closer look you may divide the underground constructions into two categories.

The first category is facilities constructed with the immediate purpose of use in mind. Many of these facilities are still out of operation.

The second category is facilities constructed with the immediate purpose of use in mind but also designed to resist heavy pouring rain. These facilities kept working during the cloudburst. One example is the metro. If the metro was constructed for only the immediate purpose of use, being circling trains below ground, it would have been flooded within minutes, with the risk of lost lives and a standstill for months.

We have the same situation in data management. Things may seem just fine if data are fit for the immediate purpose of use. But when a sudden change in conditions hit, then you know about data quality.

A Sudden Change: South Sudan

9th July 201111th August 2011Henrik Gabs Liliendahl2 Comments

This tenth Data Quality World Tour blog post is about South Sudan, a new country born today the 9^th July 2011.

Reference data

The term “reference data” is often used to describe small collections of data that are basically maintained outside an enterprise and being common to all organizations. A list of countries is a good example of what is reference data.

Sometimes the terms “reference data” and “master data” are used interchangeable. I started a discussion on that subject on the mdm community some time ago.

One problem with reference data as a country list is if you are able to keep such a list updated. A country list doesn’t change every day, but sometimes it actually does like today with South Sudan as a new country.

Suddenly changing dimensions

If you have master data entities linking to reference data like a country list it is not that simple when the reference data changes. If you have a customer placed in what is South Sudan today that entity should rightfully link to Sudan regarding yesterday’s transactions, but you may also have changed the name of Sudan to North Sudan which is the continuing part of the former Sudan.

We call that kind of challenge “slowly changing dimensions” but it actually looks like “suddenly changing dimensions” when we have to figure out who belongs to where at a certain time.

Previous Data Quality World Tour blog posts:

No NOT NULL

18th May 20119th September 2011Henrik Gabs Liliendahl10 Comments

A basic way of ensuring data quality in a database is to define that a certain attribute must be filled. This is done by specifying that the value “null” isn’t allowed or as said in SQL’ish: Setting the NOT NULL constraint.

A common data quality issue is that such constraints almost always are too rigid.

In my last post called Notes about the North Pole it was discussed that every place on earth has a latitude and a longitude except that the North Pole – and the South Pole – hasn’t a longitude. So if you have a table with geocodes you can’t set NOT NULL for the longitude if you (though very unlikely) should store the coordinates for the poles. Alternatively you could store 0 for longitude to make it complete – but then it would be very inaccurate. 360 degree inaccurate so to speak.

Another infrequent example from this blog is that every person in my country has a given (first) name and a family (last) name. But there are a few Royal Exceptions. So, no NOT NULL for the family name.

Related to people and places there are plenty of more frequent examples. If you only expect addresses form United States, Australia or India setting the NOT NULL for the state attribute seems wise. But expect foolish values in here when you get addresses from most other parts of the world. So, no NOT NULL for the state.

A common variant of the mandatory state value is when you register for data quality webinars, white papers and so on. Most often you must select from a value list containing the United States of America – in some cases also mixed in with Canadian Provinces. The NULL option to be used by strangers may hide as “Not Applicable” way down the list among states beginning with N.

I usually select Alaska which is among the first states in the alphabetical order – which also brings me back close to the North Pole making my data close to 360 degree inaccuracy.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph