Big Reference Data – Page 3 – Liliendahl on Data Quality

Four Flavors of Big Reference Data

17th January 2014Henrik Gabs Liliendahl1 Comment

In the post Five Flavors of Big Data the last flavor mentioned is “big reference data”.

The typical example of a reference data set is a country table. This is of course a very small data set with around 250 entities. But even that can be complicated as told in the post The Country List.

Reference data can be much bigger. Some flavors of big reference data are:

Third-party data sources
Open government data
Crowd sourced open reference data
Social networks

Third-party data sources:

The use of third-part data within Master Data Management is discussed in the post Third-Party Data and MDM. These data may also have a more wide use within the enterprise not at least within business intelligence.

Examples of such data sets are business directories, where the Dun & Bradstreet World Base as probably the best known one today counts over 200 million business entities from all over the world. Another example is address and property directories.

Open government data

The above mentioned directories are often built on top of public sector data which are becoming more and more open around the world. So an alternative is digging directly into the government data.

Crowd sourced open reference data

There are plenty of initiatives around where directories similar to the commercial and government directories are collected by crowd-sourcing and shared openly.

Social networks

In social networks profile data are maintained by the entities in question themselves which is a great advantage in terms of timeliness of data.

If you are in London please join the TDWI UK and IRM UK complimentary London meet-up on big data on the 19th February 2014 where I will elaborate on the four flavors of big reference data.

A Little Bit of Truth vs A Big Load of Trust

28th November 2013Henrik Gabs Liliendahl2 Comments

The soul of Master Data Management (MDM) is often explained as the search for a single version of the truth. It has always puzzled me that that search in many cases has been about finding the truth as the best data within different data silos inside a given organization.

Big data, including how MDM and big data can be a good match, has been a well covered subject lately. As discussed in the post Adding 180 Degrees to MDM this has shed the light on how external data may help having better master data by looking at data from outside in.

At Gartner, the analyst firm, they have phrased that movement as a shift from truth to trust for example as told in the post by Andrew White called From MDM to Big Data – From truth to trust.

Don’t get me (and master data) wrong. The truth isn’t out there in a single silver bullet shot. You have to mash up your internal master data with some of the most trustworthy external big reference data. This include commercial directory offerings, open data possibilities, public sector data (made available for private entities) and social networks.

Indeed there are potholes in that path. Timeliness of directories, completeness of open data, consistency and availability and price tags on public sector data and validity of social network data are common challenges.

Third-Party Data and MDM

24th November 201327th December 2016Henrik Gabs Liliendahl1 Comment

A recent blog post called Top 14 Master Data Management Misconceptions by William McKnight has as the last misconception this one:

“14. Third-party data is inappropriate for MDM

Third-party data is largely about extending the profile of important subject areas, which are mastered in MDM. Taking third-party data into organizations has actually kicked off many MDM programs.”

Indeed, using third-party data, which also could be called big external reference data, is in my eyes a very good solution for a lot of use cases. Some of the most popular exploitations today are:

Using a business directory as big reference data for B2B party master data in customer data integration (CDI) and supplier master data management.
Using address directories as big reference data for location master data management also related to party master data management for B2C customer data.
Using product data directories such as the Global data Synchronization Network (GDSN®) services, the UNSPSC® directory and heaps of industry specific product directories.

The next wave of exploiting external data, which is just kicking off as Social MDM, is digging into social media for sharing data, including:

Using professional social networks as LinkedIn in B2B environments where you often find the most timely reference data not at least about contact data related to your business partners’ accounts.
Using consumer oriented social networks as Facebook for getting to know your B2C customers better.
Using social collaboration as a way to achieve better product master data as told in the post Social PIM.

So You Think You Can Handle Big Data?

29th October 201329th October 2013Henrik Gabs Liliendahl1 Comment

It has often been put forward that one might think that it’s strange that everyone think they can make sense out of big data while even the supposed best ones can’t get small data right.

A good reminder of that is reported by Gary Allemann in the post Data quality error embarrasses US. The post tells the story and learning from a recent incident, where a former South African anti-apartheid fighter was detained in the United States because he was still on a terrorist list +many years after the world finally has changed view about bad guys and good guys in that struggle.

So, while we have no doubt about that the United States security agencies are able to collect and store big data about almost every person (friends and enemies all together) we may have our doubts if these guys are able to make any sense of it if they don’t know who is naughty and who is nice at a given time.

Big Data Veracity

2nd October 2013Henrik Gabs LiliendahlLeave a comment

Veracity is often mentioned as the 4th V of big data besides Volume, Velocity and Variety.

While veracity of course is paramount for a data quality geek like me veracity is kind of a different thing compared to volume, velocity and variety as these three terms are something that defines big data and veracity is more a desirable capacity of big data. This argument is often prompted by Doug Laney of Gartner (the analyst firm) who is behind the Volume, Velocity and Variety concept that also was coined as Extreme Data at some point.

Doug Laney on Veracity — Comment in discussion on the Big Data Quality LinkedIn group

As mentioned in the post Five Flavors of Big Data the challenges with data quality – or veracity – is very different with the various types of big data. If I should order the mentioned types of big data I would say that veracity has more challenges in this order going from some challenges to huge challenges:

Big reference data
Big transaction data
Web logs
Sensor data
Social data

It’s interesting that you may say that variety has the same increasing order, but volume and velocity doesn’t necessarily follow that order apart from that big reference data is less challenging in all respects and therefore maybe isn’t big data at all. However I like it to be. That is because big reference data in my eyes will play a big role in order to solve the veracity challenge for the other types of big data.

Five Flavors of Big Data

24th September 201324th September 2013Henrik Gabs Liliendahl6 Comments

We are often talking about big data as if it is one kind of data while in fact we need separate approaches to handling for example data quality issues with different sorts of big data.

Join the Big Data Quality group on LinkedIn

In the following I will go through some different types of big data and share some observations related to data quality.

Social data

The most mentioned type of big data I guess is social data and the opportunity to listen to Twitter streams and Facebook status updates in order to get better customer insight is an often stated business case for analyzing big data.

However, everyone who listens to those data will be aware of the tremendous data quality problems in doing that as told in the post Crap, Damned Crap and Big Data.

Sensor data

Another often mentioned type of big data is sensor data and as examined in the post Social Data vs Sensor Data these are somewhat different from social data with less complex data quality issues but not in all free of data quality flaws as reported in the post Going in the Wrong Direction.

Web logs

Following the clicks from people surfing the internet is a third type of big data. This kind of big data shares characteristics from both social data and sensor data as they are human generated as social data but more fact oriented as sensor data.

Big transaction data

Even traditional transaction data in huge volume are treated as big data but of course inherits the same data quality challenges as all transaction data as even that data are structured we may have trouble with having the right relations to the who, what, where and when in the transactions. And that isn’t easier with large volumes.

Big reference data

When reference data grows big we also meet big complexity. Try for example to build a reference data set with all the valid postal addresses in the world. Several standardizing bodies have a hard time making a common model for that right now. Learn about other examples of big reference data and the related complexity in the post Big Reference Data Musings.

The Good, Better and Best Way of Avoiding Duplicates

22nd September 2013Henrik Gabs Liliendahl1 Comment

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

Downstream data matching
Real time duplicate check
Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

A Universal Challenge

23rd August 2013Henrik Gabs Liliendahl2 Comments

Yesterday on The Postcode Anywhere blog Guy Mucklow wrote a nice piece called University Challenge. The blog post is about challenges with shared addresses and a remedy at least for addresses in the United Kingdom.

And sure, I also had my challenges with a shared address in the UK as reported in the post Multi-Occupancy.

But I guess the University Challenge is a universal challenge.

The postal formats and available reference data sources are of course very different around. Below is an example from the iDQ™ (instant Data Quality) tool when handling a Danish address with multiple flats. Here the tool continuously display what options is available to make the address unique:

On MDM, Data Models and Big Data

8th August 2013Henrik Gabs Liliendahl11 Comments

As described in the post Small Data with Big Impact my guess is that we will see Master Data Management solutions as a core element in having data architectures that are able to make sustainable results from dealing with big data.

If we look at party master data a serious problem with many ERP and CRM systems around is that the data model for party master data aren’t good enough for dealing with the many different forms and differences in which the parties we hold data about are represented in big data sources which makes the linking between traditional systems of record and big data very hard.

Having a Master Data Management (MDM) solution with a comprehensive data model for party master data is essential here.

Some of the capabilities we need are:

Storing multiple occurrences of attributes

People and companies have many phone numbers, they have many eMail addresses and they have many social identities and you will for sure meet these different occurrences in big data sources. Relating these different occurrences to the same real world entity is essential as reported in the post 180 Degree Prospective Customer View isn’t Unusual.

An MDM hub with a corresponding data model is the place to manage that challenge in one place.

Exploiting rich external reference data

As told in the post Where the Streets have Two Names and emphasized in the comments to the post the real world has plenty of examples of the same thing having many names. And this real world will be reflected in big data sources.

Your MDM solution should embrace external reference data solving these issues.

Handling the time dimension

In the post A Place in Time the flaws of the usual customer table in ERP and CRM systems is examined. One common issue is handling when attributes changes. Change of address happens a lot. And this may be complicated by that we may operate several address types at the same time like visiting addresses, billing addresses and correspondence addresses. These different addresses will also pop up in big data sources. And the same goes for other attributes.

You must get that right in your MDM implementation.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph