Do You Like the Lake?

4th December 2013Henrik Gabs Liliendahl

Today Capgemini as a result of a co-innovation partnership with Pivotal released their take on information management in the big data era in a piece called The Principles of the Business Data Lake.

The business data lake concept is a new try on getting rid of all the excel spreadsheets business people operate because of limitations in today’s enterprise data warehouses and the business intelligence solutions sitting on top of those extracted, transformed and loaded data.

In the business data lake you load raw data including unstructured data sources. Single view and related governance is restricted to master and reference data.

It’s not that you are going to load all the data in the world in your business data lake. You will link internal and external data based on where and when needed.

Thomas Redman has made a famous metaphor in the data quality realm about a polluted lake where the best option to deal with that is to prevent polluted water from streaming into the lake. I guess the rise of big data challenges that take as told some years ago in the post Extreme Data Quality.

In the business data lake we will have polluted data. In that view I think it’s a good thing that master and reference data has a special place in the lake.

What do you think? Do you like the lake – the old and/or the new one?

johnowensblog 5th December 2013 / 00:20

Hi Henrik

The concept of the “Data Lake” is an old one. Cartoon images of employees sitting and fishing in “Corporate Data Pond” was one that was commonplace in every large enterprise 15-20 years ago as part of marketing why an enterprise ought to invest in a corporate database.

The message that the images were supposed to convey was that employees at all levels of the organisation would have ready access to all of the data that they required. (Most of these projects actually failed and their demise gave birth to the insidious ERP)

The data pond or lake might seem like a laudable concept. However, it is fatally flawed at many levels and leaves unanswered too many critical questions such as:
o What data do you put into the lake?
o How do you get the data into the lake?
o What structure does it have?
o Who defines this structure?
o How do you transform disparate data sets into this structure?
o Will this structure support the functional needs of the enterprise?
o How do you avoid generating Fifth Normal Form errors (which are an almost guaranteed byproduct of such sets of structured and unstructured data)

Thomas Redman is right about pollution. This would not be a data lake, it would be a data dump – and a toxic one at that.

This is one piece of data archaeology that Capgemini would be best advised to bury again.

Regards
John

Reply
- Henrik Liliendahl Sørensen 10th December 2013 / 12:24
  
  Thanks for commenting John. So we have gone from a pond to a lake in the big data era.
  
  Reply

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph

Liliendahl on Data Quality

A blog about Master Data Management, Product Information Management, Data Quality Management and more

Do You Like the Lake?

Related

2 thoughts on “Do You Like the Lake?”

Leave a comment Cancel reply