MDM Vendor Revenues According to Gartner

A recent post on this blog has the title MDM Spending Might be 5 Billion USD per Year.

The 5 B USD figure was a guestimate based on an estimate by Information Difference about the total yearly revenue at 1.6 B USD collected by MDM software vendors.

Prash Chandramohan, who has his daily work at Informatica, made a follow up blog post with the title The Size of the Global Master Data Management Market. In here Prash mentions some of the uncertainties there are when making such a guestimate.

In a Linkedin discussion on that post Ben Rund, who is at Riversand, asks about other sources – Gartner and others.

The latest Gartner MDM Magic Quadrant mentions the 2017 revenues as estimated by Gartner:

MDM market vendors re Gartner

It is worth noticing that Oracle is not a Gartner MDM Magic Quadrant vendor anymore and the Gartner report indicate that Oracle still have an MDM (or is it ADM?) revenue from the installed base resembling the ones of the other mega-vendors being SAP, IBM and Informatica.

Update: The revenues mentioned are assumed to be software license and maintenance. The vendors may then have additional professional services revenue.

The 14 MDM vendors that qualified for inclusion in the latest quadrant constituted, according to Gartner estimates, 84% of the estimated MDM market revenue (software and maintenance) for 2017  – which according to Gartner criteria must be excluding Oracle.

The Trouble with Data Quality Dimensions

Data Quality Dimensions

Data quality dimensions are some of the most used terms when explaining why data quality is important, what data quality issues can be and how you can measure data quality. Ironically, we sometimes use the same data quality dimension term for two different things or use two different data quality dimension terms for the same thing. Some of the troubling terms are:

Validity / Conformity – same same but different

Validity is most often used to describe if data filled in a data field obeys a required format or are among a list of accepted values. Databases are usually well in doing this like ensuring that an entered date has the day-month-year sequence asked for and is a date in the calendar or to cross check data values against another table and see if the value exist there.

The problems arise when data is moved between databases with different rules and when data is captured in textual forms before being loaded into a database.

Conformity is often used to describe if data adheres to a given standard, like an industry or international standard. This standard may due to complexity and other circumstances not or only partly be implemented as database constraints or by other means. Therefore, a given piece of data may seem to be a valid database value but not being in compliance with a given standard.

For example, the code value for a colour being “0,255,0” may be the accepted format and all elements are in the accepted range between 0 and 255 for a RGB colour code. But the standard for a given product colour may only allow the value “Green” and the other common colour names and “0,255,0” will when translated end up as “Lime” or “High green”.

Accuracy / Precision – true, false or not sure

The difference between accuracy and precision is a well-known statistical subject.

In the data quality realm accuracy is most often used to describe if the data value corresponds correctly to a real-world entity. If we for example have a postal address of the person “Robert Smith” being “123 Main Street in Anytown” this data value may be accurate because this person (for the moment) lives at that address.

But if “123 Main Street in Anytown” has 3 different apartments each having its own mailbox, the value does not, for a given purpose, have the required precision.

If we work with geocoordinates we have the same challenge. A given accurate geocode may have the sufficient precision to tell the direction to the nearest supermarket is, but not precise enough to know in which apartment the out-of-milk smart refrigerator is.

Timeliness / Currency – when time matters

Timeliness is most often used to state if a given data value is present when it is needed. For example, you need the postal address of “Robert Smith” when you want to send a paper invoice or when you want to establish his demographic stereotype for a campaign.

Currency is most often used to state if the data value is accurate at a given time – for example if “123 Main Street in Anytown” is the current postal address of “Robert Smith”.

Uniqueness / Duplication – positive or negative

Uniqueness is the positive term where duplication is the negative term for the same issue.

We strive to have uniqueness by avoiding duplicates. In data quality lingo duplicates are two (or more) data values describing the same real-world entity. For example, we may assume that

  • “Robert Smith at 123 Main Street, Suite 2 in Anytown”

is the same person as

  • “Bob Smith at 123 Main Str in Anytown”

Completeness / Existence – to be, or not to be

Completeness is most often used to tell in what degree all required data elements are populated.

Existence can be used to tell if a given dataset has all the needed data elements for a given purpose defined.

So “Bob Smith at 123 Main Str in Anytown” is complete if we need name, street address and city, but only 75 % complete if we need name, street address, city and preferred colour and preferred colour is an existent data element in the dataset.

More on data quality dimensions:

Human Errors and Data Quality

Every time there is a survey about what causes poor data quality the most ticked answer is human error. This is also the case in the Profisee 2019 State of Data Management Report where 58% of the respondents said that human error is among the most prevalent causes of poor data quality within their organization.

This topic was also examined some years ago in the post called The Internet of Things and the Fat-Finger Syndrome.

Errare humanum estEven the Romans new this as Seneca the Younger said that “errare humanum est” which translates to “to err is human”. He also added “but to persist in error is diabolical”.

So, how can we not persist in having human errors in data then? Here are three main approaches:

  • Better humans: There is a whip called Data Governance. In a data governance regime you define data policies and data standards. You build an organizational structure with a data governance council (or any better name), have data stewards and data custodians (or any better title). You set up a business glossary. And then you carry on with a data governance framework.
  • Machines: Robotic Processing Automation (RPA) has, besides operational efficiency, the advantage of that machines, unlike humans, do not make mistakes when they are tired and bored.
  • Data Sharing: Human errors typically occur when typing in data. However, most data are already typed in somewhere. Instead of retyping data, and thereby potentially introduce your misspelling or other mistake, you can connect to data that is already digitalized and validated. This is especially doable for master data as examined in the article about Master Data Share.

10 Years

This blog has now been online for 10 years.

pont_du_gard
Pont du Gard

Looking back at the first blog posts I think the themes touched are still valid.

The first post from June 2009 was about data architecture. 2000 years ago, the roman writer, architect and engineer Marcus Vitruvius Pollio wrote that a structure must exhibit the three qualities of firmitas, utilitas, venustas — that is, it must be strong or durable, useful, and beautiful. This is true today – both in architecture and data architecture – as told in the post Qualities in Data Architecture.

A recurring topic on this blog has been a discussion around the common definition of data quality as being that the data is fit for the intended purpose of use. The opening of this topic as made in the post Fit for what purpose?

brueghel-tower-of-babel
Tower of Babel by Brueghel

Diversity in data quality has been another repeating topic. Several old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed, we still are as pondered in the post The Tower of Babel.

Thanks to all who are reading this blog and not least to all who from time to time takes time to make a comment, like and share.

greatbeltbridge
Great Belt Bridge

Data Quality and the Climate Issue

The similarities between getting awareness for data quality issues and the climate issue was touched 10 years ago here on this blog in the post Data Quality and Climate Politics.

The challenges are still the same.

There are many examples published where the results of climate change are pictured. A recent one is the image from Greenland showing huskies pulling sleds not over the usual ice, but through water.

Greenland-melting-ice-sheet-0613-01-exlarge-169

(Image taken by Steffen Malskær Olsen, @SteffenMalskaer, here published on CNN)

We also see statistics showing a development towards melting ice masses with rising sea levels as the foreseeable result. However, statistics can always be questioned. Is the ice thickening somewhere else? Has this happened many times before?

These kind of questions shows the layers we must go through getting from data quality to information quality, then decision quality and on top the wisdom in applying the right knowledge whether that is to achieve business outcomes or avoiding climate change.

DIKW data quality

 

Marathon, Spartathlon and Data Quality

Tomorrow there is a Marathon race in my home city Copenhagen. 8 years ago, a post on this blog revolved around some data quality issues connected with the Marathon race. The post was called How long is a Marathon?

Marathon
Pheidippides at the end of his Marathon race in a classic painting

However, another information quality issue is if there ever was a first Marathon race ran by Pheidippides? Historians toady do not think so. It has something to do with data lineage. The written mention of the 42.192 (or so) kilometre effort from Marathon to Athens by Pheidippides is from Plutarch whose records was made 500 years after the events. The first written source about the Battle of Marathon is from Herodotus. It was written (in historian perspective) only 40 years after the events. He did not mention the Marathon run. However, he wrote, that Pheidippides ran from Athens to Sparta. That is 245 kilometres.

By the way: His mission in Sparta was to get help. But the Spartans did not have time. They were in the middle of an SAP roll-out (or something similar festive).

Some people make the 245-kilometre track in what is called a Spartathlon. In data and information quality context this reminds me that improving data quality and thereby information quality is not a sprint. Not even a Marathon. It is a Spartathlon.

 

Machine Learning, Artificial Intelligence and Data Quality

Using machine learning (ML) and then artificial intelligence (AI) to automate business processes is a hot topic and on the wish list at most organizations. However, many, including yours truly, warn that automating business processes based on data with data quality issues is a risky thing.

In my eyes we need to take a phased approach and double use ML and AI to ensure the right business outcomes from AI automated business processes. ML and AI can be used to rationalize data and overcome data quality issues as exemplified in the post The Art in Data Matching.

Instead of applying ML and AI using a dirty dataset at hand for a given business process, the right way will be to use ML and AI to understand and asses relevant datasets within the organization and then use thereon rationalized data to be understood my machines and used for sustainable automation of business processes.

ML AI DQ

Most of these rationalized data will be master data, where there is a movement to include ML and AI in Master Data Management solutions by forward looking vendors as examined in the post Artificial Intelligence (AI) and Master Data Management (MDM).

Looking at The Data Quality Tool World with Different Metrics

The latest market report on data quality tools from Information Difference is out. In the introduction to the data quality landscape Q1 2019 this example of the consequences of  a data quality issue is mentioned: “Christopher Columbus accidentally landed in America when he based his route on calculations using the shorter 4,856 foot Roman mile rather than the 7,091 foot Arabic mile of the Persian geographer that he was relying on.”.

Information Difference has the vendors on the market plotted this way:

Information Difference DQ Landscape Q1 2019

As reported in the post Data Quality Tools are Vital for Digital Transformation also Gartner recently published a market report with vendor positions. The two reports are, in terms on evaluating vendors, like Roman and Arabic miles. Same same but different and may bring you to a different place depending on which one you choose to use.

Vendors evaluated by Information Difference but not Gartner are veteran solution providers Melissa and Datactics. On the other side Gartner has evaluated for example Talend, Information Builders and Ataccama. Gartner has a more spread out evaluation than Information Difference, where most vendors are equal.

PS: If you need any help in your journey across the data quality world, here are some Popular Offerings.