Gartner (the analyst firm), represented by Saul Judah, takes data quality back to basics in the recent post called Data Quality Improvement.
While I agree with the sentiment around measuring the facts as expressed in the post I have cautions about relying on that everything is good when data are fit for the purpose for business operations.
Some clues lies in the data quality dimensions mentioned in the post:
Accuracy (for now):
As said in the Gartner post data are indeed temporal. The real world changes and so does business operations. When you got your data fit for the purpose of use the business operations has changed. And when you got your data re-fit for the new purpose of use the business operations has changed again.
Furthermore most organizations can’t take all business operations into account at the same time. If you go down the fit for purpose track you will typically address a single business objective and make data fit for that purpose. Not at least when dealing with master data there are many business objectives and derived purposes of use. In my experience that leads to this conclusion:
“While we value that data are of high quality if they are fit for the intended use we value more that data correctly represent the real-world construct to which they refer in order to be fit for current and future multiple purposes”
Existence – an aspect of completeness:
The Gartner post mentions a data quality dimension being existence. I tend to see this as an aspect of the broader used term completeness.
For example having a fit for purpose completeness related to product master data has been a huge challenge for many organizations within retail and distribution during the last years as explained in the post Customer Friendly Product Master Data.
This weekend I’m in Copenhagen where I, opposite to when in London, enjoy a bicycle ride.
In the old days I had a small cycle computer that gave you a few key performance indicators about your ride as time of riding, distance covered, average and maximum speed. Today you can use an app on your smartphone and along the way have current figures displayed on your smartwatch.
As explained in the post American Exceptionalism in Data Management the first thing I do when installing an app is to change Fahrenheit to Celsius, date format to an useable one and in this context not at least miles to kilometers.
The cool thing is that the user interface on my smartwatch reports my usual speed in kilometer per hour as miles per hour making me 60 % faster than I used to be. So next year I will join Tour de France making Jens Voigt (aka Der Alte) look like a youngster.
Using such an app is also a good example of why we have big data today. The app tracks a lot of data as detailed route on map with x, y and z coordinates, split speed per kilometer and other useful stuff. Analyzing these data tells me Tour de France maybe isn’t a good idea. After what I thought was 100 miles, but was 100 kilometers, my speed went from slow to grandpa.
That’s a bit like IT projects by the way. Regardless of timeframe, they slows down in progress after 80 % of plan has been covered.
Usually data models are made to fit a specific purpose of use. As reported in the post A Place in Time this often leads to data quality issues when the data is going to be used for purposes different from the original intended. Among many examples we not at least have heaps of customer tables like this one:
Compared to how the real world works this example has some diversity flaws, like:
state code as a key to a state table will only work with one country (the United States)
zipcode is a United States description only opposite to the more generic “Postal Code”
fname (First name) and lname (Last name) don’t work in cultures where given name and surname have the opposite sequence
The length of the state, zipcode and most other fields are obviously too small almost anywhere
More seriously we have:
fname and lname (First name and Last name) and probably also phone should belong to an own party entity acting as a contact related to the company
company name should belong to an own party entity acting in the role as customer
address1, address2, city, state, zipcode should belong to an own place entity probably as the current visiting place related to the company
In my experience looking at the real world will help a lot when making data models that can survive for years and stand use cases different from the one in immediate question. I’m not talking about introducing scope creep but just thinking a little bit about how the real world looks like when you are modelling something in that world, which usually is the case when working with Master Data Management (MDM).
Real world alignment is often seen as a competing measure of data quality opposite to the popular approach of data quality being seen as fitness for purpose of use.
When we try to narrow down what constitutes quality of data we may use data quality dimensions. So, how does data quality dimensions look like in the light of real world alignment? Here is a few thoughts:
Uniqueness is probably the data quality dimension that most closely relates to real world alignment as the opposite of uniqueness is duplication which in the data quality world means that two or more different data records describes the same real world entity.
Accuracy is best measured as in what degree data describes something in the real world.
Credibility was recently proposed as an important data quality dimension by Malcolm Chisholm on Information Management in the article called Data Credibility: A New Dimension of Data Quality? Here credibility is if data is without any malicious manipulation performed to fulfill an evil purpose of use.
Updating with some of these events may be done automatically and some events requires manual intervention.
Right now I’m working with data stewardship functionality in the instant Data Quality MDM Edition where the relocation event, the deceased event and other important events in party master data life-cycle management is supported as part of a MDM service.
A recent post on this blog was called Omni-purpose MDM. Herein it is discussed in what degree MDM solutions should cover all business cases where Master Data Management plays a part.
Master Data Management (MDM) is very much about data quality. A recurring question in the data quality realm is about if data quality should be seen as in what degree data are fit for the purpose of use or if the degree of real world alignment is a better measurement.
The other day Jim Harris published a blog post called Data Quality has a Rotating Frame of Reference. In a comment Jim takes up the example of having a valid address in your database records and how measuring address validity may make no sense for measuring how data quality supports a certain business objective.
My experience is that if you look at each business objective at a time measuring data quality against the purpose of use is sound of course. However, if you have several different business objectives using the same data you will usually discover that aligning with the real world fulfills all the needs. This is explained further within the concept of Data Quality 3.0.
Using the example of a valid address measurements, and actual data quality prevention, typically work with degrees of validity as notably:
The validity in different levels as area, entrance and specific unit as examined in the post A Universal Challenge.
The validity of related data elements as an address may be valid but the addressee is not as examined in the post Beyond Address Validation.
Data quality needs for a specific business objective also changes over time. As a valid address may be irrelevant for invoicing if either the mail carrier gets it there anyway or we invoice electronically, having a valid address and addressee suddenly becomes fit for the purpose of use if the invoice is not paid and we have to chase the debt.
In a recent tweet Ted Friedman of Gartner (the analyst firm) said:
I think he is right.
Duplicates has always been pain number one in most places when it comes to the cost of poor data quality.
Though I have been in the data matching business for many years and been fighting duplicates with dedupliaction tools in numerous battles the war doesn’t seem to be won by using deduplication tools alone as told in the post Somehow Deduplication Won’t Stick.
Eventually deduplication always comes down to entity resolution when you have to decide which results are true positives, which results are useless false positives and wonder how many false negatives you didn’t catch, which means how much money you didn’t have in return of your deduplication investment.
The article is about the implications for marketing caused by the rise of social media which now finally seems to eliminate what we have known as business-to-business (B2B) and more or less merges B2B and business-to-consumer (B2C).
As discussed here on the blog several times starting way back in 2009 in the post Echoes in the Database a problem with B2B indeed is that while business transactions takes place between legal entities a lot of business processes are done between employees related to the selling and buying entities. You may call that employee-to-employee (E2E), people-to-people (P2P) or indeed human-to-human (H2H).
Related to databases, data quality and Master Data Management (MDM) this means we need real world alignment with two kinds of parties:
While B2B and B2C may melt together in the way we do messaging the distinction between B2B and B2C will be there in many other aspects. Even in social media we see it as for example two of the most used social networks being FaceBook and LinkedIn clearly belongs mainly to B2C and B2B respectively for marketing and social selling purposes.
The location domain is after the customer, or rather party, domain and the product domain the most frequent addressed domain for Master Data Management (MDM).
In my recent work I have seen a growing interest in handling location data as part of a MDM program.
Traditionally location data in many organizations have been handled in two main ways:
As a part of other domains typically as address attributes for customer and other party entities
As a silo for special business processes that involves spatial data using Geographic Information Systems (GIS) as for example in engineering and demographic market research.
Handling location data most often involves using external reference data as location data doesn’t have the same privacy considering as party data, not at least data describing natural personals, tend to have and opposite to product data location data are pretty much the same to everyone.
MDM for the location domain is very much about bringing the two above mentioned ways of working with locations together while consistently exploiting external reference data.
As in all MDM work data quality is the important factor and the usual data quality dimensions are indeed in place here as well. Some challenges are:
Uniqueness and precision: Locations comes in hierarchies. As told in the post The Postal Address Hierarchy we when referring to textual addresses have levels as country, region, city or district, thoroughfare (street) or block, building number and unit within a building. Uniqueness may be defined within one of these levels. A discussed in the post Where is the Spot? the precision and use case for coordinates may cause uniqueness issues too.
Yesterday we had a call from British Gas (or probably a call centre hired by British Gas) explaining the great savings possible if switching from the current provider – which by the way is: British Gas. This is a classic data quality issue in direct marketing operations being accurately separating your current customers and entities belonging to new market.
As I have learned that your premier identity proof in the United Kingdom is your utility bill, this incident may be seen as somewhat disturbing – or by further thinking, maybe a business opportunity 🙂
At iDQ we develop a solution that may be positioned in the space between data quality prevention and identity check by addressing the identity resolution aspect during data capture.