Using a Data Lake for Reference Data

TechTarget has recently published a definition of the term data lake.

In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.

If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.

Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25^th June 2015 with a webinar about Big Reference Data.

2920 Charlottenlund, Danmark

ramonchen 24th June 2015 / 01:27

Great post Henrik as usual. @Reltio we use graph/NoSQL to store master data, reference data, interaction and transaction data at limitless data volumes. We not only can manage big data but big metadata. When you deliver full adit capabilities and allow tracking and versioning of attribute level values at any point in time as well as unstructured data, you quickly hit petabyte scale. Especially when you are also a business facing app which monitors who looks at data from a complete compliance and governance perspective. It’s no longer about how big your master or reference data set is anymore. Graph and NoSQL is a prerequisite if you have any intentions of moving into the new world of data-driven applications.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

Using a Data Lake for Reference Data

Related

One thought on “Using a Data Lake for Reference Data”

Leave a comment Cancel reply