The typical example of a reference data set is a country table. This is of course a very small data set with around 250 entities. But even that can be complicated as told in the post The Country List.
Reference data can be much bigger. Some flavors of big reference data are:
Third-party data sources
Open government data
Crowd sourced open reference data
Third-party data sources:
The use of third-part data within Master Data Management is discussed in the post Third-Party Data and MDM. These data may also have a more wide use within the enterprise not at least within business intelligence.
Examples of such data sets are business directories, where the Dun & Bradstreet World Base as probably the best known one today counts over 200 million business entities from all over the world. Another example is address and property directories.
Open government data
The above mentioned directories are often built on top of public sector data which are becoming more and more open around the world. So an alternative is digging directly into the government data.
Crowd sourced open reference data
There are plenty of initiatives around where directories similar to the commercial and government directories are collected by crowd-sourcing and shared openly.
In social networks profile data are maintained by the entities in question themselves which is a great advantage in terms of timeliness of data.
The soul of Master Data Management (MDM) is often explained as the search for a single version of the truth. It has always puzzled me that that search in many cases has been about finding the truth as the best data within different data silos inside a given organization.
Big data, including how MDM and big data can be a good match, has been a well covered subject lately. As discussed in the post Adding 180 Degrees to MDM this has shed the light on how external data may help having better master data by looking at data from outside in.
At Gartner, the analyst firm, they have phrased that movement as a shift from truth to trust for example as told in the post by Andrew White called From MDM to Big Data – From truth to trust.
Don’t get me (and master data) wrong. The truth isn’t out there in a single silver bullet shot. You have to mash up your internal master data with some of the most trustworthy external big reference data. This include commercial directory offerings, open data possibilities, public sector data (made available for private entities) and social networks.
Indeed there are potholes in that path. Timeliness of directories, completeness of open data, consistency and availability and price tags on public sector data and validity of social network data are common challenges.
It has often been put forward that one might think that it’s strange that everyone think they can make sense out of big data while even the supposed best ones can’t get small data right.
A good reminder of that is reported by Gary Allemann in the post Data quality error embarrasses US. The post tells the story and learning from a recent incident, where a former South African anti-apartheid fighter was detained in the United States because he was still on a terrorist list +many years after the world finally has changed view about bad guys and good guys in that struggle.
So, while we have no doubt about that the United States security agencies are able to collect and store big data about almost every person (friends and enemies all together) we may have our doubts if these guys are able to make any sense of it if they don’t know who is naughty and who is nice at a given time.
When talking about Master Data Management (MDM) we deal with something that maybe could be better coined as Master Entity Management. As a good old (logical or not) data model in the relational database world also have relations between entities there must of course then also be something called Master Relationship Management. And indeed there is as mentioned by Aaron Zornes in the interview called MDM and Next-Generation Data Sources on Information Management.
As touched by Aaron Zornes the solution to handling relations in the future may come from outside the relational database world in the form of graph databases. This was also discussed in the post Will Graph Databases become Common in MDM?
An often mentioned driver for looking much more into relationships is the promise of finding customer, and other, insights in social data based on the match between traditional master entity data and social network profiles. Handling these relations is an important facet of social MDM, an often mentioned subject on this blog.
Building the relations doesn’t stop with party master entities. There are valuable relations to location master entities and not at least crucial relations between party master entities and product master entities as told in the post Customer Product Matrix Management.
So Master Relationship Management fits very well with the current main trends in the MDM world being embracing big data not at least social data and encompassing multi-domain MDM. The third main trend being MDM in the cloud also fits. It’s not that we can’t explore all the relations out there from on-premise solutions; it’s just that there is a better relationship in doing so in the cloud.
Veracity is often mentioned as the 4th V of big data besides Volume, Velocity and Variety.
While veracity of course is paramount for a data quality geek like me veracity is kind of a different thing compared to volume, velocity and variety as these three terms are something that defines big data and veracity is more a desirable capacity of big data. This argument is often prompted by Doug Laney of Gartner (the analyst firm) who is behind the Volume, Velocity and Variety concept that also was coined as Extreme Data at some point.
As mentioned in the post Five Flavors of Big Data the challenges with data quality – or veracity – is very different with the various types of big data. If I should order the mentioned types of big data I would say that veracity has more challenges in this order going from some challenges to huge challenges:
Big reference data
Big transaction data
It’s interesting that you may say that variety has the same increasing order, but volume and velocity doesn’t necessarily follow that order apart from that big reference data is less challenging in all respects and therefore maybe isn’t big data at all. However I like it to be. That is because big reference data in my eyes will play a big role in order to solve the veracity challenge for the other types of big data.
We are often talking about big data as if it is one kind of data while in fact we need separate approaches to handling for example data quality issues with different sorts of big data.
In the following I will go through some different types of big data and share some observations related to data quality.
The most mentioned type of big data I guess is social data and the opportunity to listen to Twitter streams and Facebook status updates in order to get better customer insight is an often stated business case for analyzing big data.
However, everyone who listens to those data will be aware of the tremendous data quality problems in doing that as told in the post Crap, Damned Crap and Big Data.
Another often mentioned type of big data is sensor data and as examined in the post Social Data vs Sensor Data these are somewhat different from social data with less complex data quality issues but not in all free of data quality flaws as reported in the post Going in the Wrong Direction.
Following the clicks from people surfing the internet is a third type of big data. This kind of big data shares characteristics from both social data and sensor data as they are human generated as social data but more fact oriented as sensor data.
Big transaction data
Even traditional transaction data in huge volume are treated as big data but of course inherits the same data quality challenges as all transaction data as even that data are structured we may have trouble with having the right relations to the who, what, where and when in the transactions. And that isn’t easier with large volumes.
Big reference data
When reference data grows big we also meet big complexity. Try for example to build a reference data set with all the valid postal addresses in the world. Several standardizing bodies have a hard time making a common model for that right now. Learn about other examples of big reference data and the related complexity in the post Big Reference Data Musings.
Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.
A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:
Downstream data matching
Real time duplicate check
Search and mash-up of internal and external data
Downstream Data Matching
The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.
Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.
Real Time Duplicate Check
The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.
Search and Mash-Up of Internal and External Data
The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.
The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.
The advantages are:
If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.
Yesterday on The Postcode Anywhere blog Guy Mucklow wrote a nice piece called University Challenge. The blog post is about challenges with shared addresses and a remedy at least for addresses in the United Kingdom.
And sure, I also had my challenges with a shared address in the UK as reported in the post Multi-Occupancy.
But I guess the University Challenge is a universal challenge.
The postal formats and available reference data sources are of course very different around. Below is an example from the iDQ™ (instant Data Quality) tool when handling a Danish address with multiple flats. Here the tool continuously display what options is available to make the address unique:
As described in the post Small Data with Big Impact my guess is that we will see Master Data Management solutions as a core element in having data architectures that are able to make sustainable results from dealing with big data.
If we look at party master data a serious problem with many ERP and CRM systems around is that the data model for party master data aren’t good enough for dealing with the many different forms and differences in which the parties we hold data about are represented in big data sources which makes the linking between traditional systems of record and big data very hard.
Having a Master Data Management (MDM) solution with a comprehensive data model for party master data is essential here.
Some of the capabilities we need are:
Storing multiple occurrences of attributes
People and companies have many phone numbers, they have many eMail addresses and they have many social identities and you will for sure meet these different occurrences in big data sources. Relating these different occurrences to the same real world entity is essential as reported in the post 180 Degree Prospective Customer View isn’t Unusual.
An MDM hub with a corresponding data model is the place to manage that challenge in one place.
Exploiting rich external reference data
As told in the post Where the Streets have Two Names and emphasized in the comments to the post the real world has plenty of examples of the same thing having many names. And this real world will be reflected in big data sources.
Your MDM solution should embrace external reference data solving these issues.
Handling the time dimension
In the post A Place in Time the flaws of the usual customer table in ERP and CRM systems is examined. One common issue is handling when attributes changes. Change of address happens a lot. And this may be complicated by that we may operate several address types at the same time like visiting addresses, billing addresses and correspondence addresses. These different addresses will also pop up in big data sources. And the same goes for other attributes.
You must get that right in your MDM implementation.