The title of this blog post is a topic on my international keynote at the Stammdaten Management Forum 2016 in Düsseldorf, Germany on the 8th November 2016. You can see the agenda for this conference that starts on the 7th and end the on 9thhere.
Data Quality 3.0 is a term I have used over the years here on the blog to describe how I see data quality, along with other disciplines within data management, changing. This change is about going from focusing on internal data stores and cleansing within them to focusing on external sharing of data and using your business ecosystem and third party data to drastically speed up data quality improvement.
Industry 4.0 is the current trend of automation and data exchange in manufacturing technologies. When we talk about big data most will agree that success with big data exploitation hinges on proper data quality within master data management. In my eyes, the same can be said about success with industry 4.0. The data exchange that is the foundation of automation must be secured by common understood master data.
So this is the promising way forward: By using data exchange in business ecosystems you improve data quality of master data. This improved master data ensures the successful data exchange within industry 4.0.
The term data lake has become popular along with the raise of big data. A data lake is a new of way of storing data that is more agile than what we have been used to in data warehouses. This is mainly based on the principle that you should not have thought through every way of consuming data before storing the data.
This agility is also the main reason for fear around data lakes. Possible lack of control and standardization leads to warnings about that a data lake will quickly develop into a data swamp.
In my eyes we need solutions build on the data lake concept if we want business agility – and we do want that. But I also believe that we need to put data in data lakes in context.
In all humbleness, my vision for data lakes is that a context driven data lake can serve purposes beyond analytical use within a single company and become a driver for business agility within business ecosystems like cross company supply chains as expressed in the LinkedIn Pulse post called Data Lakes in Business Ecosystems.
This week I had the pleasure of being at the Informatica MDM 360 event in Paris. The “360” predicate is all over in the Informatica communication. There are the MDM 360 events around the world. The Product 360 solution – the new wrap of the old Heiler PIM solution, as I understand it. The Supplier 360 solution. Some Customer 360 stuff including the Cloud Customer 360 for Salesforce edition.
All these solutions constitutes one of the leading Multi-Domain MDM offerings on the market – if not the leading. We will be wiser on that question when Gartner (the analyst firm) makes their first Multi-Domain MDM Magic Quadrant later this year as reported in the post Gravitational Waves in the MDM World.
Until now, Informatica has been very well positioned for Customer MDM, but not among the leaders for Product MDM in the ranking according to Gartner. Other analysts, as Information Difference, have Informatica in the top right corner of the (Multi-Domain) MDM landscape as seen here.
MDM and big data is another focus area for Informatica and Informatica has certainly been one of the first MDM vendors who have embraced big data – and that not just with wording in marketing. Today we cannot say big data without saying data lake. Informatica names their offering the Intelligent Data Lake.
For me, it will be interesting to see how Informatica can take full Multi-Domain MDM leadership with combining a good Product MDM solution with an Intelligent Data Lake.
The Product Data Lake is a cloud service for sharing product data in the eco-systems of manufacturers, distributors, retailers and end users of product information.
As an upstream provider of products data, being a manufacturer or upstream distributor, you have these requirements:
When you introduces new products to the market, you want to make the related product data and digital assets available to your downstream partners in a uniform way
When you win a new downstream partner you want the means to immediately and professionally provide product data and digital assets for the agreed range
When you add new products to an existing agreement with a downstream partner, you want to be able to provide product data and digital assets instantly and effortless
When you update your product data and related digital assets, you want a fast and seamless way of pushing it to your downstream partners
When you introduce a new product data attribute or digital asset type, you want a fast and seamless way of pushing it to your downstream partners.
The Product Data Lake facilitates these requirements by letting you push your product data into the lake in your in-house structure that may or may not be fully or partly compliant to an international standard.
As an upstream provider, you may want to push product data and digital assets from several different internal sources.
The product data lake tackles this requirement by letting you operate several upload profiles.
As a downstream receiver of product data, being a downstream distributor, retailer or end user, you have these requirements:
When you engage with a new upstream partner you want the means to fast and seamless link and transform product data and digital assets for the agreed range from the upstream partner
When you add new products to an existing agreement with an upstream partner, you want to be able to link and transform product data and digital assets in a fast and seamless way
When your upstream partners updates their product data and related digital assets, you want to be able to receive the updated product data and digital assets instantly and effortless
When you introduce a new product data attribute or digital asset type, you want a fast and seamless way of pulling it from your upstream partners
If you have a backlog of product data and digital asset collection with your upstream partners, you want a fast and cost effective approach to backfill the gap.
The Product Data Lake facilitates these requirements by letting you pull your product data from the lake in your in-house structure that may or may not be fully or partly compliant to an international standard.
In the Product Data Lake, you can take the role of being an upstream provider and a downstream receiver at the same time by being a midstream subscriber to the Product Data Lake. Thus, Product Data Lake covers the whole supply chain from manufacturing to retail and even the requirements of B2B (Business-to-Business) end users.
The Product Data Lake uses the data lake concept for big data by letting the transformation and linking of data between many structures be done when data are to be consumed for the first time. The goal is that the workload in this system has the resemblance of an iceberg where 10% of the ice is over water and 90 % is under water. In the Product Data Lake manually setting up the links and transformation rules should be 10 % of the duty and the rest being 90 % of the duty will be automated in the exchange zones between trading partners.
Product Information Management (PIM) have over the recent years emerged as an important technology enabled discipline for every company taking part in a supply chain. These companies are manufacturers, distributor, retailers and large end users of tangible products requiring a drastic increased variety of product data to be used in ecommerce and other self-service based ways of doing business.
At the same time we have seen the raise of big data. Now, if you look at every single company, product data handled by PIM platforms perhaps does not count as big data. Sure, the variety is a huge challenge and the reason of being for PIM solutions as they handle this variety better than traditional Master Data Management (MDM) solutions and ERP solutions.
The variety is about very different requirements in data quality dimensions based on where a given product sits in the product hierarchy. Measuring completeness has to be done for the concrete levels in the hierarchy, as a given attribute may be mandatory for one product but absolutely ridiculous for another product. An example is voltage for a power tool versus for a hammer. With consistency, there may be attributes with common standards (for example product name) but many attributes will have specific standards for a given branch in the hierarchy.
Product information also encompasses digital assets, being PDF files with product sheets, line drawings and lots of other stuff, product images and an increasing amount of videos with installation instructions and other content. The volume is then already quite big.
Volume and velocity really comes into the game when we look at eco-systems of manufacturers, distributors and retailers. The total flow of product data can then be described with the common characteristics of big data: Volume, velocity and variety. Even if you look at it for a given company and their first degree of separation with trading partners, we are talking about big data where there is an overwhelming throughput of new product links between trading partners and updates to product information that are – or not least should have been – exchanged.
Within big data we have the concept of a data lake. A key success factor of a data lake solution is minimizing the use of spreadsheets. In the same way, we can use a data lake, sitting in the exchange zone between trading partners, for product information as elaborated further in the post Gravitational Collapse in the PIM Space.
Master Data Management (MDM) is a bit more than 10 years old as told in the post from last year called Happy 10 Years Birthday MDM Solutions. MDM has developed from the two disciplines called Customer Data Integration (CDI) and Product Information Management (PIM). For example, the MDM Institute was originally called the The Customer Data Integration Institute and still have this website:http://www.tcdii.com/.
Today Multi-Domain MDM is about managing customer, or rather party, master data together with product master data and other master data domains as visualized in the post A Master Data Mind Map.
You may argue that PIM (Product Information Management) is not the same as Product MDM. This question was examined in the post PIM, Product MDM and Multi-Domain MDM. In my eyes the benefits of keeping PIM as part of Multi-Domain MDM are bigger than the benefits of separating PIM and MDM. It is about expanding MDM across the sell-side and the buy-side of the business eventually by enabling wide use of customer self-service and supplier self-service.
The external self-service theme will in my eyes be at the centre of where MDM is going in the future. In going down that path there will be consequences for how we see data governance as discussed in the post Data Governance in the Self-Service Age. Another aspect of how MDM is going to be seen from the outside and in is the increased use of third party reference data and the link between big data and MDM as touched in the post Adding 180 Degrees to MDM.
Besides Multi-Domain MDM and the links between MDM and big data a much mentioned future trend in MDM is doing MDM in the cloud. The latter is in my eyes a natural consequence of the external self-service themes and increased use of third party reference data.
We all know the pain of receiving e-mails with offers that is totally beside what you need.
Now Twitter has joined this spamming habit, which is a bit surprising, because with all the talk about big data and what it can do for prospect and customer insight, you should think that Twitter knows something about you.
Well, apparently not.
I operate two Twitter accounts. One named @hlsdk used for my general interaction with the data management community and one named @ProductDataLake used for a start-up service called Product Data Lake.
For both accounts, I am flooded with e-mails from Twitter about increasing my Holiday sales by using their ad services.
My businesses is not Business-to-Consumer (B2C) being about selling stuff to consumers, where the coming season is a high peak in the Western World. My business is Business-to-Business (B2B) where the coming season when it comes to sales is a stand still in the Western World.
Back in 2011 Gartner, the analyst firm, predicted that these three things would shape the Master Data Management (MDM) market:
MDM in the Cloud
MDM and Social Networks
The third point was in 2012, after the raise of big data, rephrased to MDM and Big Data as reported in the post called The Big MDM Trend.
In my experience all these three themes are still valid with slowly but steadily uptake.
But, have any new trends showed up in the past years?
In a 2015 post called “Master Data Management Merger Tardis and The Future of MDM” Ramon Chen of Reltio puts forward some new possibilities to be discussed, among those Machine Learning & Cognitive computing. I agree with Ramon on this theme, though these have been topics around in general for decades without really breaking through. But we need more of this in MDM for sure.
My own favourite MDM trend is a shift from focussing on internally captured master data to collaboration with external business partners as explained in the post MDM 3.0 Musings.
In that quest, I am looking forward to my next speaking session, which will be in Helsinki, Finland on the 8th December. There is an interview on that with yours truly available on the Talentum Master Data Management 2015 site.
The Gartner 2015 Magic Quadrant for Master Data Management of Customer Data Solutions is out. One way of getting the report without being a Gartner customer is through this link on the Informatica site.
Successful providers of Mater Data Management (MDM) solutions will sooner or later need to offer ways of connecting MDM with big data.
In the Customer MDM quadrant Gartner, without mentioning if this relates to customer MDM only or multi-Domain MDM in general, mentions two ways of connecting MDM with big data:
Capabilities to perform MDM functions directly against copies of big data sources such as social network data copied into a Hadoop environment. Gartner have found that there have been very few successful attempts (from a business value perspective) to implement this use case, mostly as a result of an inability to perform governance on the big datasets in question.
Capabilities to link traditionally structured master data against those sources. Gartner have found that this use case is also sparse, but more common and more readily able to prove value. This use case is also gaining some traction with other types of unstructured data, such as content, audio and video.
Also I think the ability to perform governance on big datasets is key. In fact, in my eyes master data will tend to be more externally generated and maintained, just like big data usually is. This will change our ways of doing information governance as discussed in my previous post on this blog. This post was by the way inspired by the Gartner product MDM person. The post is called MDM and SCM: Inside and outside the corporate walls.
In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”
A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.
If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.
Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.
The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.