If you search on Google for “data quality” you will find the ever-recurring discussion on how we can define data quality.
This is also true for the top ranked none sponsored articles as the Wikipedia page on data quality and an article from Profisee called Data Quality – What, Why, How, 10 Best Practices & More!
The two predominant definitions are that data is of high quality if the data:
- Is fit for the intended purpose of use.
- Correctly represent the real-world construct that the data describes.
Personally, I think it is a balance.
In theory I am on the right side. This is probably because I most often work with master data, where the same data have multiple purposes.
However, as a consultant helping organizations with getting the funding in place and getting the data quality improvement done within time and budget I do end up on the other side.
What about you? Where do you stand in this question?
The differences between a data warehouse and a data lake has been discussed a lot as for example here and here.
To summarize, the main point in my eyes is: In a data warehouse the purpose and structure is determined before uploading data while the purpose with and structure of data can be determined before downloading data from a data lake. This leads to that a data warehouse is characterized by rigidity and a data lake is characterized by agility.
Agility is a good thing, but of course, you have to put some control on top of it as reported in the post Putting Context into Data Lakes.
Furthermore, there are some great opportunities in extending the use of the data lake concept beyond the traditional use of a data warehouse. You should think beyond using a data lake within a given organization and vision how you can share a data lake within your business ecosystem. Moreover, you should consider not only using the data lake for analytical purposes but commence on a mission to utilize a data lake for operational purposes.
The venture I am working on right now have this second take on a data lake. The Product Data Lake exists in the context of sharing product information between trading partners in an agile and process driven way. The providers of product information, typically manufacturers and upstream distributors, uploads product information according to the data management maturity level of that organization. This information may very well for now be stored according to traditional data warehouse principles. The receivers of product information, typically downstream distributors and retailers, download product information according to the data management maturity level of that organization. This information may very well for now end up in a data store organized by traditional data warehouse principles.
As I have seen other approaches for sharing product information between trading partners these solutions are built on having a data warehouse like solution between trading partners with a high degree of consensus around purpose and structure. Such solutions are in my eyes only successful when restricted narrowly in a given industry probably within a given geography for a given span of time.
By utilizing the data lake concept in the exchange zone between trading partners you can share information according to your own pace of maturing in data management and take advantage of data sharing where it fits in your roadmap to digitalization. The business ecosystems where you participate are great sources of data for both analytical and operational purposes and we cannot wait until everyone agrees on the same purpose and structure. It only takes two to start the tango.
The term evergreen is known from botany as plants staying green all year and from music as songs not just being a hit for a few months but capable of generating royalties for years and years.
Data should also stay evergreen. I am a believer in the “first time right” principle as explained in the post instant Single Customer View. However, you must also keep your data quality fresh as examined in the post Ongoing Data Maintenance.
If we look at customer, or rather party, Master Data Management (MDM) it is much about real world alignment. In party master data management you describe entities as persons and legal entities in the real world and you should have descriptions that reflect the current state (and sometimes historical states) of these entities. Some reflections will be The Relocation Event. And as even evergreen trees go away, and “My Way” hopefully will go away someday, you also must be able to perform Undertaking in MDM.
With product MDM it is much about data being fit for multiple future purposes of use as reported in the post Customer Friendly Product Master Data.
This is post number 666 on this blog. 666 is the number of the beast. Something diabolic.
The first post on my blog came out in June 2009 and was called Qualities in Data Architecture. This post was about how we should talk a bit less about bad data quality and instead focus a bit more on success stories around data quality. I haven’t been able to stick to that all the time. There are so many good data quality train wrecks out there, as the one told in the post called Sticky Data Quality Flaws.
Some of my favorite subjects around data quality were lined up in Post No. 100. They are:
The biggest thing that has happened in the data quality realm during the five years this blog has been live is probably the rise of big data. Or rather the rise of the term big data. This proves to me that changes usually starts with technology. Then we after sometime starts thinking about processes and finally peoples roles and responsibilities.
Gartner (the analyst firm), represented by Saul Judah, takes data quality back to basics in the recent post called Data Quality Improvement.
While I agree with the sentiment around measuring the facts as expressed in the post I have cautions about relying on that everything is good when data are fit for the purpose for business operations.
Some clues lies in the data quality dimensions mentioned in the post:
Accuracy (for now):
As said in the Gartner post data are indeed temporal. The real world changes and so does business operations. When you got your data fit for the purpose of use the business operations has changed. And when you got your data re-fit for the new purpose of use the business operations has changed again.
Furthermore most organizations can’t take all business operations into account at the same time. If you go down the fit for purpose track you will typically address a single business objective and make data fit for that purpose. Not at least when dealing with master data there are many business objectives and derived purposes of use. In my experience that leads to this conclusion:
“While we value that data are of high quality if they are fit for the intended use we value more that data correctly represent the real-world construct to which they refer in order to be fit for current and future multiple purposes”
Existence – an aspect of completeness:
The Gartner post mentions a data quality dimension being existence. I tend to see this as an aspect of the broader used term completeness.
For example having a fit for purpose completeness related to product master data has been a huge challenge for many organizations within retail and distribution during the last years as explained in the post Customer Friendly Product Master Data.
Data is of high quality if they are fit for the purpose of use. This mantra has been around in the data management realm for many years.
In a recent article by Andy Hayler on CIO about MDM at Harrods there is a good example of a piece of data of such a high quality. It is a product description:
XX 6621/74 BLK VNN SS TOP 969B S
This product description was nicely fit for the purpose of use when Harrods handled their product data in a material master in an ERP system I guess. But when switching from buy-side focus to sell-side focus in a multi-channel world, this product description gives no meaning to the customer.
Such problems with changing purposes of use for product master data is not only a luxury problem at Harrods but a common challenge within retail and distribution. The challenge involve having customer friendly product descriptions, a range of atomized product attributes that varies by product category and having related digital assets that helps the customer.
Organizations around are, as explained by Andy Hayler, tackling this challenge by implementing Master Data Management (MDM) solutions – in this case those ones specialized in Product Information Management (PIM).
MDM is said to be about a single version of the truth. While this in the customer (or rather party) MDM world is much about achieving uniqueness by matching and merging several different representations of the same real world individual or legal entity, the main challenge in product MDM is a bit different. Here completeness is a big issue. This involves gathering several different pieces of the truth from different sources. And a certain level of completeness may be fit for the purpose of use today but not fit enough tomorrow.
So, how can organizations overcome the huge task of gathering so much product data? I think it is much about Sharing Product Master Data.
A recent post on this blog was called Omni-purpose MDM. Herein it is discussed in what degree MDM solutions should cover all business cases where Master Data Management plays a part.
Master Data Management (MDM) is very much about data quality. A recurring question in the data quality realm is about if data quality should be seen as in what degree data are fit for the purpose of use or if the degree of real world alignment is a better measurement.
The other day Jim Harris published a blog post called Data Quality has a Rotating Frame of Reference. In a comment Jim takes up the example of having a valid address in your database records and how measuring address validity may make no sense for measuring how data quality supports a certain business objective.
My experience is that if you look at each business objective at a time measuring data quality against the purpose of use is sound of course. However, if you have several different business objectives using the same data you will usually discover that aligning with the real world fulfills all the needs. This is explained further within the concept of Data Quality 3.0.
Using the example of a valid address measurements, and actual data quality prevention, typically work with degrees of validity as notably:
- The validity in different levels as area, entrance and specific unit as examined in the post A Universal Challenge.
- The validity of related data elements as an address may be valid but the addressee is not as examined in the post Beyond Address Validation.
Data quality needs for a specific business objective also changes over time. As a valid address may be irrelevant for invoicing if either the mail carrier gets it there anyway or we invoice electronically, having a valid address and addressee suddenly becomes fit for the purpose of use if the invoice is not paid and we have to chase the debt.
In MDM (Master Data Management) there is the term Multi-Domain MDM being how we manage respectively parties, products, locations and other entity types and handling master data within a Multi-Channel environment encompassing offline, online and social channels is a huge challenge within MDM today. Yet another multi view of MDM is handling different facets of master data being:
Handling entities is the core of master data management. Ensuring that master data are fit for multiple purposes most often by ensuring real world alignment is the basic goal of master data management. Entity resolution is at key discipline in doing that. In the party master data domain doing Customer Data Integration (CDI) is the good old activity aiming at compiling all the customer data silos in the enterprise into a golden copy with golden records. Product Information Management (PIM) is another ancestor in the MDM evolution history predominately focusing at the entities.
A possible distinction between Master Entity Management and Master Relation Management is discussed in the post Another Facet of MDM: Master Relationship Management.
As we get better and better solutions for handling entities the innovation shifts to handling the relationships between entities. These relations exists for example in Multi-Channel environments by linking entities in the old systems of record with the same real world entities in the new systems of engagement as told in the post Social MDM and Systems of Engagement.
Getting the master data right the first time is crucial.
In product master data management getting to that stage is often done by managing a flow of events where the product data are completed and approved by a team of knowledge workers.
In party master data management a way of ensuring first time right is examined in the post instant Single Customer View. But that is only the start. Party master data has a life cycle with important events as:
Babbling about data quality, real world alignment and maps is a regular topic on this blog and this Saturday is no exception.
This week I stumbled on a discussion in the “Data, Data, Data” community on Google Plus. There was a map:
The map visualizes how the world would look like if every internet user had an equal amount of space to live on. This turns the land masses on the earth to have a different shape than in reality given:
- Population density
- Internet penetration
As internet penetration is the main purpose of the map the penetration percentage for the different countries are highlighted by color in order to be fit for the purpose of use and thus showing highest penetration in Canada, Northern Europe, Qatar, South Korea and New Zealand.
Some countries seem to have disappeared from the planet as mentioned in the comments on Google Plus: Singapore, Taiwan (officially Republic of China) and North Korea (officially Democratic People’s Republic of Korea). The latter one has probably gone because of no data or no users. Well, probably both reasons.
On a side note it’s a bit peculiar that countries on the map are labeled by the ISO 3 character code and not the 2 character code that more resembles country domains on the internet.
In a recent comment here on this blog the relevance of Master Data Management (MDM) solutions was questioned because in real business life different business units sees master data very differently though the data describes the same real world entity. And it’s not the first time I hear this argument.
The issue is similar to the Greenland problem in geography. When using the most common projection for visualizing a round earth on a flat map, the Mercator projection, Greenland has a true shape but will look as being of same size as Africa, though Africa is over 10 times as large as Greenland.
As examined in the post Sharing data is key to a single version of the truth this is similar to the problems in fulfilling multiple uses embracing all business units in an enterprise:
- If a map shows a limited part of the world the difference doesn’t matter that much. This is similar to fitting the purpose of use in a single business unit.
- If the map shows the whole world we may have all kind of different projections offering different kind of views on the world having some advantages and disadvantages like when we do enterprise MDM.
Today we have new technology coming to the rescue. If you go into Google Earth the world indeed looks round and you may have any high altitude view of an apparently round world. If you go closer the map tends to be more and more flat.
My guess is that the solutions to fit the multiple uses conundrum within MDM also will be offered from the cloud by having innovative solutions reflecting the real world entities and relate those to a variety of business functions used in different business units offering a range of views that supports multiple purposes of use.