This tenth Data Quality World Tour blog post is about South Sudan, a new country born today the 9th July 2011.
Reference data
The term “reference data” is often used to describe small collections of data that are basically maintained outside an enterprise and being common to all organizations. A list of countries is a good example of what is reference data.
Sometimes the terms “reference data” and “master data” are used interchangeable. I started a discussion on that subject on the mdm community some time ago.
One problem with reference data as a country list is if you are able to keep such a list updated. A country list doesn’t change every day, but sometimes it actually does like today with South Sudan as a new country.
Suddenly changing dimensions
If you have master data entities linking to reference data like a country list it is not that simple when the reference data changes. If you have a customer placed in what is South Sudan today that entity should rightfully link to Sudan regarding yesterday’s transactions, but you may also have changed the name of Sudan to North Sudan which is the continuing part of the former Sudan.
We call that kind of challenge “slowly changing dimensions” but it actually looks like “suddenly changing dimensions” when we have to figure out who belongs to where at a certain time.
Previous Data Quality World Tour blog posts:
The challenges with reference data and ultimately with master data are the variances in the value domains. The changes to the list of countries maybe slowly changing but the use of the data is much more dynamic and complex. For example even a list of countries is a challenge. You can use the ISO 3166 list of countries but this list only comprises countries recognized by the UN. In the U.S., native tribes are considered nation states so the dilemma is should these be part of a master data reference list for countries or a separate list?
Then we get to the codes used to represent countries. Will the master list contain only the numeric codes, the alphabetic codes or both? This is only the tip of the iceberg. The real fun begins when considering the sub-jurisdictions within countries such as provinces, states and cantons. How do you compile a master data list of sub-jurisdictions? For example within one client we identified 16 different lists representing “states”. The solution may look simple; just add all the instances of all the states from all the lists. But if you do a query and ask for a total number of “states”, what is the correct answer? The answer depends on the context of use and therefore you will have to identify which states are associated with each context.
When considering the numerous contexts of use, interpretations and policies and business processes that define the term “country”, you quickly discover the challenges of master data. It’s then that “workarounds” are suggested where each context of use imports their instance of a list of countries and augments them with other data. But then you have to ask the question, why have a master data list if each consumer creates their own list internally? Reference data is considered the easiest data set to consider as master data but as I have shown there remain many challenges. Imagine the challenges when getting to other potential master data candidates such as customer or product!
Master data management is a simplistic solution to a complex problem and many discover MDM’s limitations only after they attempt to deploy it. This is one of the primary reasons for the failure rate of MDM deployments. A recent survey concluded that only 24% of MDM projects were successful. Not really surprising.
Thanks for the excellent comment Datasherpa. Indeed, the challenges in master data management are overwhelming when we have such issues even in real world alignment for high level reference data as countries and states (provinces, cantons).