Is big data all about analytics?

My answer to the question in the title of this blog post is NO. In my eyes big data is not just data warehouse 3.0. It is also data quality 3.0.

The concept of the data lake is growing in popularity in the big data world and so are the counts of warnings about your data lake becoming a data swamp, a data marsh or a data cesspool. Doing analytic work on a nice data lake sounds great. Doing it in a huge swamp, a large marsh or a giant cesspool does not sound so nice.

Figure 1In nature a lake stays fresh by having good upstream supply of water and a downstream system as well. In kind of the same way your data lake should not be a closed system or a dump within your organization.

Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share

CDI, PIM, MDM and Beyond

The TLAs (Three Letter Acronyms) in the title of this blog post stands for:

  • Customer Data Integration
  • Product Information Management
  • Master Data Management

CDI and PIM are commonly seen as predecessors to MDM. For example, the MDM Institute was originally called the The Customer Data Integration Institute and still have this website: http://www.tcdii.com/.

Today Multi-Domain MDM is about managing customer, or rather party, master data together with product master data and other master data domains as visualized in the post A Master Data Mind Map. Some of the most frequent other master domains are location master data and asset master data, where the latter one was explored in the post Where is the Asset? A less frequent master data domain is The Calendar MDM Domain.

QuadrantYou may argue that PIM (Product Information Management) is not the same as Product MDM. This question was examined in the post PIM, Product MDM and Multi-Domain MDM. In my eyes the benefits of keeping PIM as part of Multi-Domain MDM are bigger than the benefits of separating PIM and MDM. It is about expanding MDM across the sell-side and the buy-side of the business eventually by enabling wide use of customer self-service and supplier self-service.

The external self-service theme will in my eyes be at the centre of where MDM is going in the future. In going down that path there will be consequences for how we see data governance as discussed in the post Data Governance in the Self-Service Age. Another aspect of how MDM is going to be seen from the outside and in is the increased use of third party reference data and the link between big data and MDM as touched in the post Adding 180 Degrees to MDM.

Besides Multi-Domain MDM and the links between MDM and big data a much mentioned future trend in MDM is doing MDM in the cloud. The latter is in my eyes a natural consequence of the external self-service themes and increased use of third party reference data which all together with the general benefits of the SaaS (Software as a Service) and DaaS (Data as a Service) concepts will make MDM morph into something like MDaaS (Master Data as a Service) – an at least nearly ten year old idea by the way, as seen in this BeyeNetwork article by Dan E Linstedt.

Bookmark and Share

The Pros and Cons of MDM 3.0

A recent post on this blog was called Three Stages of MDM Maturity. This post ponders the need to extend your Master Data Management (MDM) solution to external business partners and take more advantage of third party data providers. We may call this MDM 3.0.

In a comment on LinkedIn Bernard PERRINEAU says:

MDM 3.0 Pros and Cons

Starting with the most often mentioned point against extending your MDM solution to the outside Vipul Aroh of Verdantis rightfully in a comment to the post mentions a wide spread hesitancy around. I think/hope this hesitancy is the same as the hesitancy we saw when Salesforce.com first emerged. Many people didn’t foresee a great future for Salesforce.com, because putting your customer base into the cloud was seen as a huge risk. But eventually the operational advantages in most cases have trumped the thought risks.

Ironically the existents of CRM systems, in the cloud or not, is a hindrance for MDM solutions to be system of entry or support data entry for the customer master data domain.  I remember when talking to a MDM vendor CEO about putting such features for customer data entry into a MDM solution his reply was something like: “Clients don’t want that, they want to consolidate downstream”. I think it is a pity that “clients want” to automate the mess and that MDM and other vendors wants to help them with that.

That said, there are IT system landscape circumstances to be overcome in order to put your MDM solution to the forefront.

But when doing that, and even when starting to do that, the advantages are plentiful. A story about a start of such a journey for customer master data is shared in the post instant Data Quality at Work. This approach is examined more in the post instant Single Customer View. To summarize you will gain both on getting data quality right the first time and at the same time save time (and time is money) in the data collection stage.

When it comes to product master data I think everyone working in that field acknowledges the insanity in how the same data are retyped, or messed around in spreadsheets, between manufactures, distributors, retailers and end users. Some approaches to overcome this are explored in the post Sharing Product Master Data. Each of these approaches has their pros and cons.

The rise of big data also points in the direction of having your MDM solution exposed to the outside as touched in the post Adding 180 Degrees to MDM.

Bookmark and Share

The Countryside Data Quality Journey Through 2015

I guess this is the time for blog posts about big things that is going to happen in 2015. But you see, we could also take a route away from the motorways and highways and see how the traditional way of life is still unfolding the data quality landscape.

LostWhile the innovators and early adopters are fighting with big data quality the late majority are still trying get the heads around how to manage small data. And that is a good thing, because you cannot utilize big data without solving small data quality problems not at least around master data as told in the post How important is big data quality?

ShittertonSolving data quality problems is not just about fixing data. It is very much also about fixing the structures around data as explained in a post, featuring the pope, called When Bad Data Quality isn’t Bad Data.

No Mans LandA common roadblock on the way to solving data quality issues is that things that what are everybody’s problem tends to be no ones problem. Implementing a data governance programme is evolving as the answer to that conundrum. As many things in life data governance is about to think big and start small as told in the post Business Glossary to Full-Blown Metadata Management or Vice Versa.

UgleyData governance revolves a lot around peoples roles and there are also some specific roles within data governance. Data owners have been known for a long time, data stewards have been around some time and now we also see Chief Data Officers emerge as examined in the post The Good, the Bad, and the Ugly Data Governance Role.

As experienced recently, somewhere in the countryside, while discussing how to get going with a big and shiny data governance programme there is however indeed still a lot to do with trivial data quality issues as fields being too short to capture the real world as reported in the post Everyday Year 2000 Problems.

Wales

Bookmark and Share

Customer MDM Magic Wordles

The Gartner Magic Quadrant for Master Data Management of Customer Data 2014 is out. One place to get it for free is by using the Informatica registry style page offered in the Informatica communication here.

So, what is good and what is bad when looking for a MDM vendor if you are focusing on customer data right now?

Some words in the strengths assessment of vendors are:

Magic plus

Some words in the cautions assessment of vendors are:

Magic minus

Bookmark and Share

The Scary Data Lake

The concept of the data lake seems to have a revival these days. Perhaps it reemerged about a year ago as told in the post Do You Like the Lake?

The idea of having a data lake scares the hell out of data quality people as seen in the title used by Garry Allemann in the post Data Lake vs Data Cesspool.

The data lake is mostly promoted as a data source for analytics opposite to something being part of daily operations. That is horrifying enough. Imagine Joe last month using 80 % of his time fixing data quality issues when doing one batch of analytics. And this month Sue spend 80 % of her time fixing data quality issues in the same data lake in her analytic quest and 50 % of Sue’s data quality issues are in fact the same as Joe’s challenges from last month.

As Halloween is just around the corner, it is time to ask: What is your data lake horror story?

Hadooween

Bookmark and Share

Post No. 666

666This is post number 666 on this blog. 666 is the number of the beast. Something diabolic.

The first post on my blog came out in June 2009 and was called Qualities in Data Architecture. This post was about how we should talk a bit less about bad data quality and instead focus a bit more on success stories around data quality. I haven’t been able to stick to that all the time. There are so many good data quality train wrecks out there, as the one told in the post called Sticky Data Quality Flaws.

Some of my favorite subjects around data quality were lined up in Post No. 100. They are:

The biggest thing that has happened in the data quality realm during the five years this blog has been live is probably the rise of big data. Or rather the rise of the term big data. This proves to me that changes usually starts with technology. Then we after sometime starts thinking about processes and finally peoples roles and responsibilities.

Bookmark and Share

Data Quality 3.0 Revisited

Back in 2010 I played around with the term Data Quality 3.0. This concept is about how we increasingly use external data within data management opposite to the traditional use of internal data, which are data that has been typed into our databases by employees or has been internally collected in other ways.

cropped-great-belt-brdige.jpg

The rise of big data has definitely fueled the thinking around using external data as reported in the post Adding 180 Degrees to MDM.

There are other internal and external aspects for example internal and external business rules as examined in the post Two Kinds of Business Rules within Data Governance. This post has been discussed in the Data Governance Know How group on LinkedIn.

In a comment Thomas Tong says:

“It’s really fun when the internal components of governance are running smooth, giving the opportunity to focus on external connections to your data governance program. Finding the right balance between internal and external influences is key, as external governance partners can reduce the load/complexity of your overall governance program. It also helps clarify the difference between a “external standard” vs “internal standard”, as well as what is “reference data” vs “master data”… and a little preview of your probable integration strategy with external.”

This resonates very much with my mindset. Since 2010 my own data quality journey has increasingly embraced Master Data Management (MDM) and Data Governance as told in the recent blog post called Data Governance, Data Quality and MDM.

So, in my quest to coin these 3 disciplines into one term I, besides the word information, also may put 3.0 into the naming: “Information Quality 3.0”, hmmm …..

Bookmark and Share

Fitness, Data Quality, Big Data and IT Projects

This weekend I’m in Copenhagen where I, opposite to when in London, enjoy a bicycle ride.

In the old days I had a small cycle computer that gave you a few key performance indicators about your ride as time of riding, distance covered, average and maximum speed. Today you can use an app on your smartphone and along the way have current figures displayed on your smartwatch.

As explained in the post American Exceptionalism in Data Management the first thing I do when installing an app is to change Fahrenheit to Celsius, date format to an useable one and in this context not at least miles to kilometers.

The cool thing is that the user interface on my smartwatch reports my usual speed in kilometer per hour as miles per hour making me 60 % faster than I used to be. So next year I will join Tour de France making Jens Voigt (aka Der Alte) look like a youngster.

Viking tour
A Viking tour around Roskilde and Vallø Borgring. Click for report with a wonderful mixup of date formats.

Using such an app is also a good example of why we have big data today. The app tracks a lot of data as detailed route on map with x, y and z coordinates, split speed per kilometer and other useful stuff. Analyzing these data tells me Tour de France maybe isn’t a good idea. After what I thought was 100 miles, but was 100 kilometers, my speed went from slow to grandpa.

That’s a bit like IT projects by the way. Regardless of timeframe, they slows down in progress after 80 % of plan has been covered.

Bookmark and Share

EU to regulate the term ”big data”

Today it has been announced that the European Union will regulate the use of the term “big data”.

“Volumes of misuse of the term big data has gone way over what is acceptable” says an EU spokesperson. Therefore the Commission will initiate a snap roadmap for legislation leading to that every use of the term big data has to be approved by the authorities beforehand.

A variety of ways to declare that your use of the term big data has been approved will be put into force for the different languages used within the Union. So far France has announced that “big data appellation d’originalité contrôlée” will be used there.

Velocity is the word that best describes the planned process for clamping down on the misuse of the term big data. As soon as in 2020 every member state must have started the legislation process and not later than 2025 the rules must be implemented in national laws. However there is a great deal of skepticism over if things could move that fast.

Say big data one more time

Bookmark and Share