The concept of the data lake seems to have a revival these days. Perhaps it reemerged about a year ago as told in the post Do You Like the Lake?
The idea of having a data lake scares the hell out of data quality people as seen in the title used by Garry Allemann in the post Data Lake vs Data Cesspool.
The data lake is mostly promoted as a data source for analytics opposite to something being part of daily operations. That is horrifying enough. Imagine Joe last month using 80 % of his time fixing data quality issues when doing one batch of analytics. And this month Sue spend 80 % of her time fixing data quality issues in the same data lake in her analytic quest and 50 % of Sue’s data quality issues are in fact the same as Joe’s challenges from last month.
As Halloween is just around the corner, it is time to ask: What is your data lake horror story?
It has often been said, written, blogged and tweeted that data itself is useless. It is all about information.
Indeed. In the same way money itself is worthless. It is all about all the good stuff you can buy for money.
So, if you care about money, you should care about data too.
This is post number 666 on this blog. 666 is the number of the beast. Something diabolic.
The first post on my blog came out in June 2009 and was called Qualities in Data Architecture. This post was about how we should talk a bit less about bad data quality and instead focus a bit more on success stories around data quality. I haven’t been able to stick to that all the time. There are so many good data quality train wrecks out there, as the one told in the post called Sticky Data Quality Flaws.
Some of my favorite subjects around data quality were lined up in Post No. 100. They are:
The biggest thing that has happened in the data quality realm during the five years this blog has been live is probably the rise of big data. Or rather the rise of the term big data. This proves to me that changes usually starts with technology. Then we after sometime starts thinking about processes and finally peoples roles and responsibilities.
A frequent update on my LinkedIn home page these days is about the HiPPO principle. The HiPPO principle is used to describe a leadership style based on priority for the leader’s opinion opposite to using data as explained in the Forbes article here.
The hippo (hippopotamus) is one of largest animals on this planet. So is the rhino (rhinoceros). The rhino is critically endangered because it is hunted by humans due to a very little part of its body, being the horn.
I guess anyone who has been in business for some years has met the hippo. Probably you also have experienced a rhino hunt being a project or programme of very big size but aiming at a quite narrow business objective that may have been expressed as a simple slogan by a hippo.
Gartner (the analyst firm), represented by Saul Judah, takes data quality back to basics in the recent post called Data Quality Improvement.
While I agree with the sentiment around measuring the facts as expressed in the post I have cautions about relying on that everything is good when data are fit for the purpose for business operations.
Some clues lies in the data quality dimensions mentioned in the post:
Accuracy (for now):
As said in the Gartner post data are indeed temporal. The real world changes and so does business operations. When you got your data fit for the purpose of use the business operations has changed. And when you got your data re-fit for the new purpose of use the business operations has changed again.
Furthermore most organizations can’t take all business operations into account at the same time. If you go down the fit for purpose track you will typically address a single business objective and make data fit for that purpose. Not at least when dealing with master data there are many business objectives and derived purposes of use. In my experience that leads to this conclusion:
“While we value that data are of high quality if they are fit for the intended use we value more that data correctly represent the real-world construct to which they refer in order to be fit for current and future multiple purposes”
Existence – an aspect of completeness:
The Gartner post mentions a data quality dimension being existence. I tend to see this as an aspect of the broader used term completeness.
For example having a fit for purpose completeness related to product master data has been a huge challenge for many organizations within retail and distribution during the last years as explained in the post Customer Friendly Product Master Data.
Back in 2010 I played around with the term Data Quality 3.0. This concept is about how we increasingly use external data within data management opposite to the traditional use of internal data, which are data that has been typed into our databases by employees or has been internally collected in other ways.
The rise of big data has definitely fueled the thinking around using external data as reported in the post Adding 180 Degrees to MDM.
There are other internal and external aspects for example internal and external business rules as examined in the post Two Kinds of Business Rules within Data Governance. This post has been discussed in the Data Governance Know How group on LinkedIn.
In a comment Thomas Tong says:
“It’s really fun when the internal components of governance are running smooth, giving the opportunity to focus on external connections to your data governance program. Finding the right balance between internal and external influences is key, as external governance partners can reduce the load/complexity of your overall governance program. It also helps clarify the difference between a “external standard” vs “internal standard”, as well as what is “reference data” vs “master data”… and a little preview of your probable integration strategy with external.”
This resonates very much with my mindset. Since 2010 my own data quality journey has increasingly embraced Master Data Management (MDM) and Data Governance as told in the recent blog post called Data Governance, Data Quality and MDM.
So, in my quest to coin these 3 disciplines into one term I, besides the word information, also may put 3.0 into the naming: “Information Quality 3.0”, hmmm …..
Yesterday Daragh O Brien posted an Open Letter to my Information Quality Peers. The essence is that Daragh isn’t completely satisfied with how things are in The International Association for Information and Data Quality (IAIDQ).
That reminds me of that I was a charter member of IAIDQ.
But now checking I probably haven’t renewed the membership. This is not deliberate. It just may have slipped. Maybe, as being one of Daragh’s critique points, because broadcasting from IAIDQ has decreased the last years.
> Correction: Double checking I am actually still a member. I renewed for 2 years last time (usually I’m not that careless with money). I just lost my Charter Mbr designation in the process.
Another critique point raised by Daragh is the failed mission to make the organization truly international, as the organization have had difficulties maintaining chapters around the world.
Forming and maintaining regional chapters is about getting and upholding a critical mass of active members. An example of that this is possible is the German Information Quality Society – Deutsche Gesellschaft für Informations- und Datenqualität e. V. However, this organization doesn’t seem to be a IAIDQ chapter, but being another church obeying the same god.
The current unrest in IAIDQ is not the first of its kind. I remember that some years ago one of the founding members, Larry English, sent a strange email to members telling that he quitted the organization not being satisfied with something.
It is ironic that information quality practitioners are preaching communication and collaboration, but we don’t seem to get it when it comes to organizing our own little world.