TechTarget has recently published a definition of the term data lake.
In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”
A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.
If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.
Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.
The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.
Great post Henrik as usual. @Reltio we use graph/NoSQL to store master data, reference data, interaction and transaction data at limitless data volumes. We not only can manage big data but big metadata. When you deliver full adit capabilities and allow tracking and versioning of attribute level values at any point in time as well as unstructured data, you quickly hit petabyte scale. Especially when you are also a business facing app which monitors who looks at data from a complete compliance and governance perspective. It’s no longer about how big your master or reference data set is anymore. Graph and NoSQL is a prerequisite if you have any intentions of moving into the new world of data-driven applications.