Whether the data lake concept is a good idea or not is discussed very intensively in the data management social media community.
The fear, and actual observations made, is that that a data lake will become a data dump. No one knows what is in there, where it came from, who is going to clean up the mess and eventually have a grip on how it should be handled in the future – if there is a future for the data lake concept.
Please folks. We have some concepts from the small data world that we must apply. Here are three of the important ones:
In short, metadata is data about data. Even though the great thing about a data lake is that the structure and all purposes of the data does not have to be cut in stone beforehand, at least all data that is delivered to a data lake must be described. An example of such an implementation is examined in the post Sharing Metadata.
You must also have the means to tag who delivered the data. If your data lake is within a business ecosystem, this should include the legal entity that has provided the data as told in the post Using a Business Entity Identifier from Day One.
Above all, you must have a framework to govern ownership (Responsibility, Accountability, Consultancy and who must be Informed), policies and standards and other stuff we know from a data governance framework. If the data lake expand across organizations by incorporating second party and third party data, we need a cross company data governance framework as for example highlighted on Product Data Lake Documentation and Data Governance.
This question was raised on this blog back in January this year in the post Tough Questions About MDM.
Since then the use of the term blockchain has been used more and more in general and related to Master Data Management (MDM). As you know, we love new fancy terms in our else boring industry.
However, there are good reasons to consider using the blockchain approach when it comes to master data. A blockchain approach can be coined as centralized consensus, which can be seen as opposite to centralized registry. After the MDM discipline has been around for more than a decade, most practitioners agree that the single source of truth is not practically achievable within a given organization of a certain size. Moreover, in the age of business ecosystems, it will be even harder to achieve that between trading partners.
This way of thinking is at the backbone of the MDM venture called Product Data Lake I’m working with right now. Yes, we love buzzwords. As if cloud computing, social network thinking, big data architecture and preparing for Internet of Things wasn’t enough, we can add blockchain approach as a predicate too.
In Product Data Lake this approach is used to establish consensus about the information and digital assets related to a given product and each instance of that product (physical asset or thing) where it makes sense. If you are interested in how that develops, why not follow Product Data Lake on LinkedIn.
The importance of looking at your enterprise as a part of business ecosystems was recently stressed by Gartner, the analyst firm, as reported in an article with the very long title stating: Gartner Says CIOs Need to Take a Leadership Role in Creating a Business Ecosystem to Drive a Digital Platform Strategy.
In my eyes, this trend will have a huge impact on how data management platforms should be delivered in the future. Until now much of the methodology and technology for data management platforms have been limited to how these things are handled within the corporate walls. We will need a new breed of data management platforms build for business ecosystems.
Such platforms will have the characteristics of other new approaches to handling data. They will resemble social networks where you request and accept connections. They will embrace data as big data and data lakes, where every purpose of data consumption are not cut in stone before collecting data. These platforms will predominately be based in the cloud.
Right now I am working with putting such a data management service up in the cloud. The aim is to support product data sharing for business ecosystems. I will welcome you, and your trading partners, as subscriber to the service. If you help trading partners with Product Information Management (PIM) there is a place for you as ambassador. Anyway, please start with following Product Data Lake on LinkedIn.
The differences between a data warehouse and a data lake has been discussed a lot as for example here and here.
To summarize, the main point in my eyes is: In a data warehouse the purpose and structure is determined before uploading data while the purpose with and structure of data can be determined before downloading data from a data lake. This leads to that a data warehouse is characterized by rigidity and a data lake is characterized by agility.
Agility is a good thing, but of course, you have to put some control on top of it as reported in the post Putting Context into Data Lakes.
Furthermore, there are some great opportunities in extending the use of the data lake concept beyond the traditional use of a data warehouse. You should think beyond using a data lake within a given organization and vision how you can share a data lake within your business ecosystem. Moreover, you should consider not only using the data lake for analytical purposes but commence on a mission to utilize a data lake for operational purposes.
The venture I am working on right now have this second take on a data lake. The Product Data Lake exists in the context of sharing product information between trading partners in an agile and process driven way. The providers of product information, typically manufacturers and upstream distributors, uploads product information according to the data management maturity level of that organization. This information may very well for now be stored according to traditional data warehouse principles. The receivers of product information, typically downstream distributors and retailers, download product information according to the data management maturity level of that organization. This information may very well for now end up in a data store organized by traditional data warehouse principles.
As I have seen other approaches for sharing product information between trading partners these solutions are built on having a data warehouse like solution between trading partners with a high degree of consensus around purpose and structure. Such solutions are in my eyes only successful when restricted narrowly in a given industry probably within a given geography for a given span of time.
By utilizing the data lake concept in the exchange zone between trading partners you can share information according to your own pace of maturing in data management and take advantage of data sharing where it fits in your roadmap to digitalization. The business ecosystems where you participate are great sources of data for both analytical and operational purposes and we cannot wait until everyone agrees on the same purpose and structure. It only takes two to start the tango.
The title of this blog post is a topic on my international keynote at the Stammdaten Management Forum 2016 in Düsseldorf, Germany on the 8th November 2016. You can see the agenda for this conference that starts on the 7th and end the on 9th here.
Data Quality 3.0 is a term I have used over the years here on the blog to describe how I see data quality, along with other disciplines within data management, changing. This change is about going from focusing on internal data stores and cleansing within them to focusing on external sharing of data and using your business ecosystem and third party data to drastically speed up data quality improvement.
Industry 4.0 is the current trend of automation and data exchange in manufacturing technologies. When we talk about big data most will agree that success with big data exploitation hinges on proper data quality within master data management. In my eyes, the same can be said about success with industry 4.0. The data exchange that is the foundation of automation must be secured by common understood master data.
So this is the promising way forward: By using data exchange in business ecosystems you improve data quality of master data. This improved master data ensures the successful data exchange within industry 4.0.
The term data lake has become popular along with the raise of big data. A data lake is a new of way of storing data that is more agile than what we have been used to in data warehouses. This is mainly based on the principle that you should not have thought through every way of consuming data before storing the data.
This agility is also the main reason for fear around data lakes. Possible lack of control and standardization leads to warnings about that a data lake will quickly develop into a data swamp.
In my eyes we need solutions build on the data lake concept if we want business agility – and we do want that. But I also believe that we need to put data in data lakes in context.
Fortunately, there are many examples of movements in that direction. A recent article called The Informed Data Lake: Beyond Metadata by Neil Raden has a lot of good arguments around a better context driven approach to data lakes.
As reported in the post Multi-Domain MDM 360 and an Intelligent Data Lake the data management vendor Informatica is on that track too.
In all humbleness, my vision for data lakes is that a context driven data lake can serve purposes beyond analytical use within a single company and become a driver for business agility within business ecosystems like cross company supply chains as expressed in the LinkedIn Pulse post called Data Lakes in Business Ecosystems.
This week I had the pleasure of being at the Informatica MDM 360 event in Paris. The “360” predicate is all over in the Informatica communication. There are the MDM 360 events around the world. The Product 360 solution – the new wrap of the old Heiler PIM solution, as I understand it. The Supplier 360 solution. Some Customer 360 stuff including the Cloud Customer 360 for Salesforce edition.
All these solutions constitutes one of the leading Multi-Domain MDM offerings on the market – if not the leading. We will be wiser on that question when Gartner (the analyst firm) makes their first Multi-Domain MDM Magic Quadrant later this year as reported in the post Gravitational Waves in the MDM World.
Until now, Informatica has been very well positioned for Customer MDM, but not among the leaders for Product MDM in the ranking according to Gartner. Other analysts, as Information Difference, have Informatica in the top right corner of the (Multi-Domain) MDM landscape as seen here.
MDM and big data is another focus area for Informatica and Informatica has certainly been one of the first MDM vendors who have embraced big data – and that not just with wording in marketing. Today we cannot say big data without saying data lake. Informatica names their offering the Intelligent Data Lake.
For me, it will be interesting to see how Informatica can take full Multi-Domain MDM leadership with combining a good Product MDM solution with an Intelligent Data Lake.