The term “big data” is huge these days. As Steve Sarsfield suggest in a blog post yesterday called Big Data Hype is an Opportunity for Data Management Pros, well, let’s ride on the wave (or is it tsunami?).
The definition of “big data” is as with many buzzwords not crystal clear as examined in a post called It’s time for a new definition of big data on Mike2.0 by Robert Hillard. The post suggests that big may be about volume, but is actually more about big complexity.
As I have worked intensively with large amounts of rich reference data, I have a homemade term called “big reference data”.
Big Reference Data Sets
Reference Data is a term often used either instead of Master Data or as related to Master Data. Reference data is those data defined and (initially) maintained outside a single organization. Examples from the party master data realm are a country list, a list of states in a given country or postal code tables for countries around the world.
The trend is that organizations seek to benefit from having reference data in more depth than those often modest populated lists mentioned above.
An example of a big reference data set is the Dun & Bradstreet WorldBase. This reference data set holds around 300 different attributes describing over 200 million business entities from all over world.
This data set is at first glance well structured with a single (flat) data model for all countries. However, when you work with it you learn that the actual data is very different depending on the different original sources for each country. For example addresses from some countries are standardized, while this isn’t the case for other countries. Completeness and other data quality dimensions vary a lot too.
Another example of a large reference data set is the United Kingdom electoral roll that is mentioned in the post Inaccurately Accurate. As told in the post there are fit for purpose data quality issues. The data set is pretty big, not at least if you span several years, as there is a distinct roll for every year.
Big Reference Data Mashup
Complexity, and opportunity, also arises when you relate several big reference data sets.
Lately DataQualityPro had an interview called What is AddressBase® and how will it improve address data quality? Here Paul Malyon of Experian QAS explains about a new combined address reference source for the United Kingdom.
Now, let’s mash up the AddressBase, the WorldBase and the Electoral Rolls – and all the likes.
Image called Castle in the Sky found on photobotos.