The rise of big data is very much driven by a craving for getting more insight on your (prospective) customers. However, as always, the coin has a flipside.
Looking at it from the other side
As a customer, we will strike back. We do not need to be told what to buy. But we do want to know what we are buying. This means we want to be able to see rich product information when making a self-service purchase. This subject was examined in the post You Must Supplement Customer Insight with Rich Product Data.
Many companies who are involved in selling to private and business customers are ramping up maintenance of product data by implementing inhouse Product Information Management (PIM) solutions as told in yesterday’s guest post on this blog. The article is called The Relation of PIM to Retail Success.
One further challenge is that you have to get product information from the source, usually being the manufacturers.
Big data approaches work for both
As data lakes are used to being the place to harvest customer insight, the data lake concept can be the approach to provide product insight to end customers as well.
The problem with having product data flowing from manufacturers to distributors and retailers is that everyone does not use the same standard, format, structure and taxonomy for product information.
The solution is a data lake shared by the business ecosystem. It is called Product Data Lake.
Product information is the kind of data that usually flows cross company. The most common routes start with that the hard facts about a product originates at the manufacturer. Then the information may be used on the brands own website, propagated to a marketplace (online shop-in-shop) and also propagated downstream to distributors and retailers.
The challenge to the manufacturer is that this represent many different ways of providing product information, not at least when it comes to distributors and retailers, as these will require different structurers and formats using various standards and not being on the same maturity level.
Looking at this from the downstream side, the distributors and retailers, we have the opposite challenge. Manufacturers provide product information in different structurers and formats using various standards and are not on the same maturity level.
Supply chain participants can challenge this in a passive or an active way. Unfortunately, many have chosen – or are about to choose – the passive way. It goes like this:
- As a manufacturer, we have a product data portal where trading partners who wants to do business with us, who obviously is the best manufacturer in our field, can download the product information we have in our structure and format using the standards we have found best.
- As a distributor/retailer we have a supplier product data portal where trading partners who wants to do business with us, the leading player in our field, can upload the product information we for the time being will require in our structure and format using the standard(s) we have found best.
This approach seems to work if you are bigger than your trading partner. And many times one will be bigger than the other. But unless you are very big, you will in many cases not be the biggest. And in all cases where you are the biggest, you will not be seen as a company being easy to do business with, which eventually will decide how big you will stay.
The better way is the active way creating a win-win situation for all trading partners as described in the article about Product Data Lake Business Benefits.
A man with one watch knows what time it is, but a man with two watches is never quite sure. This old saying could be modernized to, that a person with one smart device knows the truth, but a person with two smart devices is never quite sure.
An example from my own life is measuring my daily steps in order to motivate me to be more fit. Currently I have two data streams coming in. One is managed by the app Google Fit and one is managed by the app S Health (from Samsung).
This morning a same time shot looked like this:
So, how many steps did I take this morning? 2,047 or 2413?
The steps are presented on the same device. A smartphone. They are though measured on two different devices. Google Fit data are measured on the smartphone itself while S Health data are measured on a connected smartwatch. Therefore, I might not be wearing these devices in the exact same way. For example, I am the kind of Luddite that do not bring the phone to the loo.
With the rise of the Internet of Things (IoT) and the expected intensive use of the big data streams coming from all kinds of smart devices, we will face heaps of similar cases, where we have two or more sets of data telling the same story in a different way.
A key to utilize these data in the best fit way is to understand from what and where these data comes. Knowing that is achieved through modern Master Data Management (MDM).
At Product Data Lake we in all humbleness are supporting that by sharing data about the product models for smart devices and in the future by sharing data about each device as told in the post Adding Things to Product Data Lake.
Whether the data lake concept is a good idea or not is discussed very intensively in the data management social media community.
The fear, and actual observations made, is that that a data lake will become a data dump. No one knows what is in there, where it came from, who is going to clean up the mess and eventually have a grip on how it should be handled in the future – if there is a future for the data lake concept.
Please folks. We have some concepts from the small data world that we must apply. Here are three of the important ones:
In short, metadata is data about data. Even though the great thing about a data lake is that the structure and all purposes of the data does not have to be cut in stone beforehand, at least all data that is delivered to a data lake must be described. An example of such an implementation is examined in the post Sharing Metadata.
You must also have the means to tag who delivered the data. If your data lake is within a business ecosystem, this should include the legal entity that has provided the data as told in the post Using a Business Entity Identifier from Day One.
Above all, you must have a framework to govern ownership (Responsibility, Accountability, Consultancy and who must be Informed), policies and standards and other stuff we know from a data governance framework. If the data lake expand across organizations by incorporating second party and third party data, we need a cross company data governance framework as for example highlighted on Product Data Lake Documentation and Data Governance.
This question was raised on this blog back in January this year in the post Tough Questions About MDM.
Since then the use of the term blockchain has been used more and more in general and related to Master Data Management (MDM). As you know, we love new fancy terms in our else boring industry.
However, there are good reasons to consider using the blockchain approach when it comes to master data. A blockchain approach can be coined as centralized consensus, which can be seen as opposite to centralized registry. After the MDM discipline has been around for more than a decade, most practitioners agree that the single source of truth is not practically achievable within a given organization of a certain size. Moreover, in the age of business ecosystems, it will be even harder to achieve that between trading partners.
This way of thinking is at the backbone of the MDM venture called Product Data Lake I’m working with right now. Yes, we love buzzwords. As if cloud computing, social network thinking, big data architecture and preparing for Internet of Things wasn’t enough, we can add blockchain approach as a predicate too.
In Product Data Lake this approach is used to establish consensus about the information and digital assets related to a given product and each instance of that product (physical asset or thing) where it makes sense. If you are interested in how that develops, why not follow Product Data Lake on LinkedIn.
The importance of looking at your enterprise as a part of business ecosystems was recently stressed by Gartner, the analyst firm, as reported in an article with the very long title stating: Gartner Says CIOs Need to Take a Leadership Role in Creating a Business Ecosystem to Drive a Digital Platform Strategy.
In my eyes, this trend will have a huge impact on how data management platforms should be delivered in the future. Until now much of the methodology and technology for data management platforms have been limited to how these things are handled within the corporate walls. We will need a new breed of data management platforms build for business ecosystems.
Such platforms will have the characteristics of other new approaches to handling data. They will resemble social networks where you request and accept connections. They will embrace data as big data and data lakes, where every purpose of data consumption are not cut in stone before collecting data. These platforms will predominately be based in the cloud.
Right now I am working with putting such a data management service up in the cloud. The aim is to support product data sharing for business ecosystems. I will welcome you, and your trading partners, as subscriber to the service. If you help trading partners with Product Information Management (PIM) there is a place for you as ambassador. Anyway, please start with following Product Data Lake on LinkedIn.
The differences between a data warehouse and a data lake has been discussed a lot as for example here and here.
To summarize, the main point in my eyes is: In a data warehouse the purpose and structure is determined before uploading data while the purpose with and structure of data can be determined before downloading data from a data lake. This leads to that a data warehouse is characterized by rigidity and a data lake is characterized by agility.
Agility is a good thing, but of course, you have to put some control on top of it as reported in the post Putting Context into Data Lakes.
Furthermore, there are some great opportunities in extending the use of the data lake concept beyond the traditional use of a data warehouse. You should think beyond using a data lake within a given organization and vision how you can share a data lake within your business ecosystem. Moreover, you should consider not only using the data lake for analytical purposes but commence on a mission to utilize a data lake for operational purposes.
The venture I am working on right now have this second take on a data lake. The Product Data Lake exists in the context of sharing product information between trading partners in an agile and process driven way. The providers of product information, typically manufacturers and upstream distributors, uploads product information according to the data management maturity level of that organization. This information may very well for now be stored according to traditional data warehouse principles. The receivers of product information, typically downstream distributors and retailers, download product information according to the data management maturity level of that organization. This information may very well for now end up in a data store organized by traditional data warehouse principles.
As I have seen other approaches for sharing product information between trading partners these solutions are built on having a data warehouse like solution between trading partners with a high degree of consensus around purpose and structure. Such solutions are in my eyes only successful when restricted narrowly in a given industry probably within a given geography for a given span of time.
By utilizing the data lake concept in the exchange zone between trading partners you can share information according to your own pace of maturing in data management and take advantage of data sharing where it fits in your roadmap to digitalization. The business ecosystems where you participate are great sources of data for both analytical and operational purposes and we cannot wait until everyone agrees on the same purpose and structure. It only takes two to start the tango.