The concept of a data lake has until now mainly been used to describe how data gathered by a given organization can be organized in order to provide for analytical purposes. The data lake concept is closely tied to the term big data, which means that a data lake caters for handling huge volumes of data, with high velocity and where data comes in heaps of varieties. A data lake is much more agile than data marts and data warehouses, where you have to determine the purpose of the use of data beforehand.
In my eyes the idea about that every organization should gather all the data of interest behind its own firewall does not make sense. Some data will for sure be only for eyes of people behind the corporate walls. Here you should indeed put your private data into your own data lake. But a lot of data are common known data that everyone should not spend time on collecting and eventually cleansing. Here you should collaborate within your industry or other sphere around data lakes for public data.
Perhaps most importantly you should share data lakes with other members of your business ecosystem. You probably already do share data within business ecosystems with your trading partners. Until now such sharing has resembled the concepts of data marts and data warehouses being very purpose specific exchanges of data build on common beforehand understood standards for data.
Right now I am working on a cloud service called the Product Data Lake. Here we host a data lake for sharing product data in the business ecosystems of manufacturers, distributors, merchants and large end users of product information. You can join our journey by following us here on LinkedIn at the Product Data Lake LinkedIn Company Page.
In today’s blog post over on The Disruptive Master Data Management Solutions List the CEO of AllSight, David Corrigan, examines 3 Reasons MDM No Longer Delivers a Customer 360.
In here David explores the topics in the new era of the customer 360 degree view being encompassing all customer data, covering analytical and operational usages and improving customer experience.
The post includes this testimonial from Deotis Harris, Senior Director, MDM at Dell EMC: “We saw an opportunity to leverage AllSight’s modern technology (Customer Intelligence), coupled with our legacy systems such as Master Data Management (MDM), to provide the insight required to enable our sellers, marketers and customer service reps to create better experiences for our customers.”
By the way: Being a MDM practitioner who have spent many years with customer 360 and now spending equal chunks of time with product 360, I find the forward-looking topics being very similar between customer 360 and product 360. In short:
- The span of product data to handle has increased dramatically in recent years as told in the post Self-service Ready Product Data.
- We can use the same data architecture for analytical and operational purposes as mentioned in the post The Intersection of MDM and Big Data.
- It is all about creating better experiences for your customers.
Back in 2015 Gartner, within a Magic Quadrant for MDM, described two different ways observed in how you may connect big data and master data management as reported in the post Two Ways of Exploiting Big Data with MDM.
In short, the two ways observed were:
- Capabilities to perform MDM functions directly against copies of big data sources such as social network data copied into a Hadoop environment. Gartner then found that there have been very few successful attempts (from a business value perspective) to implement this use case, mostly as a result of an inability to perform governance on the big datasets in question.
- Capabilities to link traditionally structured master data against those sources. Gartner then found that this use case is also sparse, but more common and more readily able to prove value. This use case is also gaining some traction with other types of unstructured data, such as content, audio and video.
In my eyes the ability to perform governance on big datasets is key. In fact, master data will tend to be more externally generated and maintained, just like big data usually is. This will change our ways of doing information governance as for example discussed in the post MDM and SCM: Inside and outside the corporate walls.
Eventually, we will see use cases of intersections of MDM and big data. The one I am working with right now is about how you can improve sharing of product master data (product information) between trading partners. While this quest may be used for analytical purposes, which is the said aim with big data, this service will fundamentally serve operational purposes, which is the predominant aim with master data management.
This big data, or rather data lake, approach is about how we by linking metadata connects different perceptions of product information that exists in cross company supply chains. While everyone being on the same standard at the same time would be optimal, this is quite utopic. Therefore, we must encourage pushing product information (including rich textual content, audio and video) with the provider’s standard and do the “schema-on-read” stuff when each of the receivers pulls the product information for their purposes.
If you want to learn more about how that goes, you can follow Product Data Lake here.
The rise of big data is very much driven by a craving for getting more insight on your (prospective) customers. However, the coin has a (better) flip side.
Looking at it from the other side
As a customer, we will strike back. We do not need to be told what to buy. But we do want to know what we are buying. This means we want to be able to see rich product information when making a self-service purchase. This subject was examined in the post You Must Supplement Customer Insight with Rich Product Data.
Many companies who are involved in selling to private and business customers are ramping up maintenance of product data by implementing inhouse Product Information Management (PIM) solutions as told in yesterday’s guest post on this blog. The article is called The Relation of PIM to Retail Success.
One further challenge is that you have to get product information from the source, usually being the manufacturers.
Big data approaches work for both
As data lakes are used to being the place to harvest customer insight, the data lake concept can be the approach to provide product insight to end customers as well.
The problem with having product data flowing from manufacturers to distributors and merchants is that everyone does not use the same standard, format, structure and taxonomy for product information.
The solution is a data lake shared by the business ecosystem. It is called Product Data Lake.
Product information is the kind of data that usually flows cross company. The most common routes start with that the hard facts about a product originates at the manufacturer. Then the information may be used on the brands own website, propagated to a marketplace (online shop-in-shop) and also propagated downstream to distributors and merchants.
The challenge to the manufacturer is that this represent many different ways of providing product information, not at least when it comes to distributors and merchants, as these will require different structures and formats using various standards and not being on the same maturity level.
Looking at this from the downstream side, the distributors and merchants, we have the opposite challenge. Manufacturers provide product information in different structurers and formats using various standards and are not on the same maturity level.
Supply chain participants can challenge this in a passive or an active way. Unfortunately, many have chosen – or are about to choose – the passive way. It goes like this:
- As a manufacturer, we have a product data portal where trading partners who wants to do business with us, who obviously is the best manufacturer in our field, can download the product information we have in our structure and format using the standards we have found best.
- As a distributor/merchant we have a supplier product data portal where trading partners who wants to do business with us, the leading player in our field, can upload the product information we for the time being will require in our structure and format using the standard(s) we have found best.
This approach seems to work if you are bigger than your trading partner. And many times one will be bigger than the other. But unless you are very big, you will in many cases not be the biggest. And in all cases where you are the biggest, you will not be seen as a company being easy to do business with, which eventually will decide how big you will stay.
The better way is the active way creating a win-win situation for all trading partners as described in the article about Product Data Lake Business Benefits.
A man with one watch knows what time it is, but a man with two watches is never quite sure. This old saying could be modernized to, that a person with one smart device knows the truth, but a person with two smart devices is never quite sure.
An example from my own life is measuring my daily steps in order to motivate me to be more fit. Currently I have two data streams coming in. One is managed by the app Google Fit and one is managed by the app S Health (from Samsung).
This morning a same time shot looked like this:
So, how many steps did I take this morning? 2,047 or 2413?
The steps are presented on the same device. A smartphone. They are though measured on two different devices. Google Fit data are measured on the smartphone itself while S Health data are measured on a connected smartwatch. Therefore, I might not be wearing these devices in the exact same way. For example, I am the kind of Luddite that do not bring the phone to the loo.
With the rise of the Internet of Things (IoT) and the expected intensive use of the big data streams coming from all kinds of smart devices, we will face heaps of similar cases, where we have two or more sets of data telling the same story in a different way.
A key to utilize these data in the best fit way is to understand from what and where these data comes. Knowing that is achieved through modern Master Data Management (MDM).
At Product Data Lake we in all humbleness are supporting that by sharing data about the product models for smart devices and in the future by sharing data about each device as told in the post Adding Things to Product Data Lake.
Whether the data lake concept is a good idea or not is discussed very intensively in the data management social media community.
The fear, and actual observations made, is that that a data lake will become a data dump. No one knows what is in there, where it came from, who is going to clean up the mess and eventually have a grip on how it should be handled in the future – if there is a future for the data lake concept.
Please folks. We have some concepts from the small data world that we must apply. Here are three of the important ones:
In short, metadata is data about data. Even though the great thing about a data lake is that the structure and all purposes of the data does not have to be cut in stone beforehand, at least all data that is delivered to a data lake must be described. An example of such an implementation is examined in the post Sharing Metadata.
You must also have the means to tag who delivered the data. If your data lake is within a business ecosystem, this should include the legal entity that has provided the data as told in the post Using a Business Entity Identifier from Day One.
Above all, you must have a framework to govern ownership (Responsibility, Accountability, Consultancy and who must be Informed), policies and standards and other stuff we know from a data governance framework. If the data lake expand across organizations by incorporating second party and third party data, we need a cross company data governance framework as for example highlighted on Product Data Lake Documentation and Data Governance.
This question was raised on this blog back in January this year in the post Tough Questions About MDM.
Since then the use of the term blockchain has been used more and more in general and related to Master Data Management (MDM). As you know, we love new fancy terms in our else boring industry.
However, there are good reasons to consider using the blockchain approach when it comes to master data. A blockchain approach can be coined as centralized consensus, which can be seen as opposite to centralized registry. After the MDM discipline has been around for more than a decade, most practitioners agree that the single source of truth is not practically achievable within a given organization of a certain size. Moreover, in the age of business ecosystems, it will be even harder to achieve that between trading partners.
This way of thinking is at the backbone of the MDM venture called Product Data Lake I’m working with right now. Yes, we love buzzwords. As if cloud computing, social network thinking, big data architecture and preparing for Internet of Things wasn’t enough, we can add blockchain approach as a predicate too.
In Product Data Lake this approach is used to establish consensus about the information and digital assets related to a given product and each instance of that product (physical asset or thing) where it makes sense. If you are interested in how that develops, why not follow Product Data Lake on LinkedIn.
The importance of looking at your enterprise as a part of business ecosystems was recently stressed by Gartner, the analyst firm, as reported in an article with the very long title stating: Gartner Says CIOs Need to Take a Leadership Role in Creating a Business Ecosystem to Drive a Digital Platform Strategy.
In my eyes, this trend will have a huge impact on how data management platforms should be delivered in the future. Until now much of the methodology and technology for data management platforms have been limited to how these things are handled within the corporate walls. We will need a new breed of data management platforms build for business ecosystems.
Such platforms will have the characteristics of other new approaches to handling data. They will resemble social networks where you request and accept connections. They will embrace data as big data and data lakes, where every purpose of data consumption are not cut in stone before collecting data. These platforms will predominately be based in the cloud.
Right now I am working with putting such a data management service up in the cloud. The aim is to support product data sharing for business ecosystems. I will welcome you, and your trading partners, as subscriber to the service. If you help trading partners with Product Information Management (PIM) there is a place for you as ambassador. Anyway, please start with following Product Data Lake on LinkedIn.
The differences between a data warehouse and a data lake has been discussed a lot as for example here and here.
To summarize, the main point in my eyes is: In a data warehouse the purpose and structure is determined before uploading data while the purpose with and structure of data can be determined before downloading data from a data lake. This leads to that a data warehouse is characterized by rigidity and a data lake is characterized by agility.
Agility is a good thing, but of course, you have to put some control on top of it as reported in the post Putting Context into Data Lakes.
Furthermore, there are some great opportunities in extending the use of the data lake concept beyond the traditional use of a data warehouse. You should think beyond using a data lake within a given organization and vision how you can share a data lake within your business ecosystem. Moreover, you should consider not only using the data lake for analytical purposes but commence on a mission to utilize a data lake for operational purposes.
The venture I am working on right now have this second take on a data lake. The Product Data Lake exists in the context of sharing product information between trading partners in an agile and process driven way. The providers of product information, typically manufacturers and upstream distributors, uploads product information according to the data management maturity level of that organization. This information may very well for now be stored according to traditional data warehouse principles. The receivers of product information, typically downstream distributors and retailers, download product information according to the data management maturity level of that organization. This information may very well for now end up in a data store organized by traditional data warehouse principles.
As I have seen other approaches for sharing product information between trading partners these solutions are built on having a data warehouse like solution between trading partners with a high degree of consensus around purpose and structure. Such solutions are in my eyes only successful when restricted narrowly in a given industry probably within a given geography for a given span of time.
By utilizing the data lake concept in the exchange zone between trading partners you can share information according to your own pace of maturing in data management and take advantage of data sharing where it fits in your roadmap to digitalization. The business ecosystems where you participate are great sources of data for both analytical and operational purposes and we cannot wait until everyone agrees on the same purpose and structure. It only takes two to start the tango.