Do You Like the Lake?

CapgemeniToday Capgemini as a result of a co-innovation partnership with Pivotal released their take on information management in the big data era in a piece called The Principles of the Business Data Lake.

The business data lake concept is a new try on getting rid of all the excel spreadsheets business people operate because of limitations in today’s enterprise data warehouses and the business intelligence solutions sitting on top of those extracted, transformed and loaded data.

In the business data lake you load raw data including unstructured data sources. Single view and related governance is restricted to master and reference data.

It’s not that you are going to load all the data in the world in your business data lake. You will link internal and external data based on where and when needed.

Thomas Redman has made a famous metaphor in the data quality realm about a polluted lake where the best option to deal with that is to prevent polluted water from streaming into the lake. I guess the rise of big data challenges that take as told some years ago in the post Extreme Data Quality.

In the business data lake we will have polluted data. In that view I think it’s a good thing that master and reference data has a special place in the lake.

What do you think? Do you like the lake – the old and/or the new one?

Bookmark and Share

2 thoughts on “Do You Like the Lake?

  1. johnowensblog 5th December 2013 / 00:20

    Hi Henrik

    The concept of the “Data Lake” is an old one. Cartoon images of employees sitting and fishing in “Corporate Data Pond” was one that was commonplace in every large enterprise 15-20 years ago as part of marketing why an enterprise ought to invest in a corporate database.

    The message that the images were supposed to convey was that employees at all levels of the organisation would have ready access to all of the data that they required. (Most of these projects actually failed and their demise gave birth to the insidious ERP)

    The data pond or lake might seem like a laudable concept. However, it is fatally flawed at many levels and leaves unanswered too many critical questions such as:
    o What data do you put into the lake?
    o How do you get the data into the lake?
    o What structure does it have?
    o Who defines this structure?
    o How do you transform disparate data sets into this structure?
    o Will this structure support the functional needs of the enterprise?
    o How do you avoid generating Fifth Normal Form errors (which are an almost guaranteed byproduct of such sets of structured and unstructured data)

    Thomas Redman is right about pollution. This would not be a data lake, it would be a data dump – and a toxic one at that.

    This is one piece of data archaeology that Capgemini would be best advised to bury again.


    • Henrik Liliendahl Sørensen 10th December 2013 / 12:24

      Thanks for commenting John. So we have gone from a pond to a lake in the big data era.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s