Whether the data lake concept is a good idea or not is discussed very intensively in the data management social media community.
The fear, and actual observations made, is that that a data lake will become a data dump. No one knows what is in there, where it came from, who is going to clean up the mess and eventually have a grip on how it should be handled in the future – if there is a future for the data lake concept.
Please folks. We have some concepts from the small data world that we must apply. Here are three of the important ones:
In short, metadata is data about data. Even though the great thing about a data lake is that the structure and all purposes of the data does not have to be cut in stone beforehand, at least all data that is delivered to a data lake must be described. An example of such an implementation is examined in the post Sharing Metadata.
You must also have the means to tag who delivered the data. If your data lake is within a business ecosystem, this should include the legal entity that has provided the data as told in the post Using a Business Entity Identifier from Day One.
Above all, you must have a framework to govern ownership (Responsibility, Accountability, Consultancy and who must be Informed), policies and standards and other stuff we know from a data governance framework. If the data lake expand across organizations by incorporating second party and third party data, we need a cross company data governance framework as for example highlighted on Product Data Lake Documentation and Data Governance.