What’s New in The Data Quality Magic Quadrant?

The Gartner Magic Quadrant for Data Quality Tools 2013 is out. If you don’t want to pay Gartner’s fee for having a look, you can sign up for a free copy on one of the vendor’s websites for example here at Trillium Software Insights.

So, what’s new this year?

It is pretty much the same picture as last year with X88 as the only new intruder. Else the news is that some vendors “now appear under slightly different names”. And now Ted Friedman is the only author.

The most exciting part, in my eyes, is the words about how the market will develop. Some seen and foreseen trends are:

  • Information governance programs drive the need for data quality tools.
  • Cloud based deployments are gaining traction.
  • Growth expected for embracing less-structured data, not at least social data, by using big data techniques and sources.

That’s good news.

Data Quality Tools

Bookmark and Share

Famous False Positives

You should Beware of False Positives in Data Matching. A false positive in the data quality realm is a match of two (or more) identities that actually isn’t the same real world entity.

Throughout history and within art we have seen some false positives too. Here are my three favorites:

The Piltdown Man

In 1912 a British amateur archeologist apparently found a fossil claimed to be the missing link between apes and man: The so called Piltdown Man. Backed up by the British Museum it was a true discovery until 1953 when it was finally revealed as a hoax. It was however disputed during all the years but defended by the British establishment maybe due to envy on the French having a Cro-Magnon man first found there and the Germans having a name giving true discovery in Neandertal.

Eventually the Piltdown Man was exposed as a middle age human upper skull, an orangutan jawbone and chimpanzee teeth.

Barry_Nelson_as_Jimmy_Bond_in_1954
Jimmy Bond in Casino Royale

James and Jimmy Bond

As told in the post My Name is Bond. Jimmy Bond: James Bond is British intelligence and Jimmy Bond is an American agent. It’s always a question if two identities residing in different countries are the same as discussed (about me) in the post Hello Leading MDM Vendor.

Dupond et Dupont

In English they are known as Thomson and Thompson. In the original Belgian/French (and in my childhood Danish comics) piece of art about the adventures of Tintin they are known as Dupond et Dupont. They are two incompetent detectives who look alike and have names with a low edit distance and same phonetic sound. For twin names in a lot of other languages check the Wikipedia article here.

And hey, today I’m going to the creator of these two guy’s home country Belgium to be at the Belgian Data Quality Association congress tomorrow.

Bookmark and Share

Entity Resolution and Big Data

FingerprintThe Wikipedia article on Identity Resolution has this catch on the difference between good old data matching and Entity Resolution:

”Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Laboratory for Advanced Research in Entity Resolution and Information Quality:

  • Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
  • Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
  • Utilizes non-matching, asserted linking (associate) information in addition to direct matching
  • Uncovers non-obvious relationships and association networks (i.e. who’s associated with whom)”

I have a gut feeling that Data Matching and Entity (or Identity) Resolution will melt together in the future as expressed in the post Deduplication vs Identity Resolution.

If you look at the above mentioned factors that distinguish data matching from identity resolution, some of the often mentioned features in the new big data technology shine through:

  • Working with unstructured and semi-structured data is probably the most mentioned difference between working with small data versus working with big data.
  • Working with associations is a feature of graph databases or other similar technologies as mentioned in the post Will Graph Databases become Common in MDM?

So, in the quest of expanding matching small data to evolve into Entity (or Identity) Resolution we will be helped by general developments in working with big data.

Bookmark and Share

Data Quality, Real World Alignment and Visualization by Maps

Babbling about data quality, real world alignment and maps is a regular topic on this blog and this Saturday is no exception.

This week I stumbled on a discussion in the “Data, Data, Data” community on Google Plus. There was a map:

InternetPopulation2011_HexCartogram_v6_2_LD

The map visualizes how the world would look like if every internet user had an equal amount of space to live on. This turns the land masses on the earth to have a different shape than in reality given:

  • Population density
  • Internet penetration

As internet penetration is the main purpose of the map the penetration percentage for the different countries are highlighted by color in order to be fit for the purpose of use and thus showing highest  penetration in Canada, Northern Europe, Qatar, South Korea and New Zealand.

Some countries seem to have disappeared from the planet as mentioned in the comments on Google Plus: Singapore, Taiwan (officially Republic of China) and North Korea (officially Democratic People’s Republic of Korea). The latter one has probably gone because of no data or no users. Well, probably both reasons.

On a side note it’s a bit peculiar that countries on the map are labeled by the ISO 3 character code and not the 2 character code that more resembles country domains on the internet.

Bookmark and Share

Hello Leading MDM Vendor

This morning I received messages from a leading MDM vendor about an upcoming webinar the 12th September.

INFA 01

As we have the 3rd October today this is strange and the vendor of course sent out a correction later today:

INFA 02

That’s OK. Shit happens. Even at data quality and MDM vendors marketing departments.

I am probably a kind of a strange person been living in two countries lately, so I got the original message and the correction both to my Scandinavian identity from the vendor’s Scandinavian body:

INFA 03

As well as to my UK identity from the vendor’s UK body:

INFA 04

That’s OK. Getting a 360 degree view of migrating persons is difficult as discussed in the post 180 Degree Prospective Customer View isn’t Unusual.

Both (double) messages have a salutation.

UK:

INFA 05

Scandinavian:

INFA 06

Being Mr. Sorensen in the UK is OK. Using Mister and surname fits with an English stiff upper lip and The Letter ø could be o in the English alphabet.

I’m not sure if Dear Mr. Sørensen is OK in a Scandinavian context. Hello Henrik would be a better fit.

Bookmark and Share

Big Data Veracity

Veracity is often mentioned as the 4th V of big data besides Volume, Velocity and Variety.

While veracity of course is paramount for a data quality geek like me veracity is kind of a different thing compared to volume, velocity and variety as these three terms are something that defines big data and veracity is more a desirable capacity of big data. This argument is often prompted by Doug Laney of Gartner (the analyst firm) who is behind the Volume, Velocity and Variety concept that also was coined as Extreme Data at some point.

Doug Laney on Veracity
Comment in discussion on the Big Data Quality LinkedIn group

As mentioned in the post Five Flavors of Big Data the challenges with data quality – or veracity – is very different with the various types of big data. If I should order the mentioned types of big data I would say that veracity has more challenges in this order going from some challenges to huge challenges:

  • Big reference data
  • Big transaction data
  • Web logs
  • Sensor data
  • Social data

It’s interesting that you may say that variety has the same increasing order, but volume and velocity doesn’t necessarily follow that order apart from that big reference data is less challenging in all respects and therefore maybe isn’t big data at all. However I like it to be. That is because big reference data in my eyes will play a big role in order to solve the veracity challenge for the other types of big data.

Bookmark and Share

Why don’t MDM Implementations Stick?

puzzleFormer Gartner (the analyst firm) MDM guru John Radcliffe has established his own business and blog and started off revealing some dirty secrets about how sticky MDM implementations are.  Quote:

“Another interesting thing was something that we found during Magic Quadrant reference checking. Increasingly the initial MDM champion, who made the business case, chose the software and led the MDM program had now moved on. The new guy (or gal) in the role often didn’t have the same enthusiasm (putting it politely) for MDM generally, for the MDM software that was installed or for the incumbent MDM software supplier.”

You may read John Radcliffe’s blog here.

A pretty bad review of MDM vendors merits indeed. But, as I have experienced during several decades in the IT business, this is an observation that probably could be made not only in the MDM realm.

However it could be good to learn how MDM implementations could be stickier. What are MDM implementations missing? Is it:

  • The functionality in MDM solutions that needs improvement?
  • The often massive consultancy that comes with a MDM tool that doesn’t meet expectations?
  • Enterprises not actually being ready for MDM?

My take is: All of above in mentioned order. Your take is?

Bookmark and Share

Somehow Deduplication won’t Stick

psychographic MDM18 years ago I cruised into the data quality realm when making my first deduplication tool. Then it was an attempt to solve a business case of two companies who were considering merging and wanted to know the intersection of customers. So far, so good.

Since then I have worked intensively with deduplication and other data matching tools and approaches and also co-authored a leading eLearning course on the matter as seen here.

Deduplication capability is a core feature of many data quality tools and indeed the probably most mentioned data quality pain is lack of uniqueness not at least in party master data management.

However, most deduplication efforts don’t in my experience stick. Yes, we can process a file ready for direct marketing and purge the messages that might end up in the same offline or online inbox despite of spelling differences. But taking it from there and use the techniques in achieving a single customer view is another story. Some obstacles are:

In the comments to the latter 3 year old post the intersection (and non-intersection) of Entity Resolution and Master Data Management (MDM) was discussed.

During my latest work I have become more and more convinced that achieving a single view of something is a lot about entity resolution as expressed in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

Will Graph Databases become Common in MDM?

One of my pet peeves in data quality for CRM and ERP systems is the often used way at looking at entities, not at least party entities, in a flat data model as told in the post A Place in Time.

Party master data, and related location master data, will eventually be modeled in very complex models and surely we see more and more examples of that. For example I remember that I long time ago worked with the ERP system that later became Microsoft Dynamics AX.  Then I had issues with the simplistic and not role aware data model. While I’m currently working in a project using the AX 2012 Address Book it’s good to see that things have certainly developed.

This blog has quite a few posts on hierarchy management in Master Data Management (MDM) and even Hierarchical Data Matching. But I have to admit that even complex relational data models and hierarchical approaches in fact don’t align completely with the real world.

In a comment to the post Five Flavors of Big Data Mike Ferguson asked about graph data quality. In my eyes using graph databases in master data management will indeed bring us closer to the real world and thereby deliver a better data quality for master data.

I remember at this year’s MDM Summit Europe that Aaron Zornes suggested that a graph database will be the best choice for reflecting the most basic reference dataset being The Country List. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

So it could be good to know if you have seen or worked with graph databases in master data management beyond representing a static analysis result as a graph database.

GraphDatabase_PropertyGraph
Wikiopedia article on graph database

Bookmark and Share

Undertaking in MDM

Pluto's moon CharonIn the post Last Time Right the bad consequences of not handling that one of your customers aren’t among us anymore was touched.

This sad event is a major trigger in party master data lifecycle management like The Relocation Event I described last week.

In the data quality realm handling so called deceased data has been much about suppression services in direct marketing. But as we develop more advanced master data services handling the many aspects of the deceased event turns up as an important capability.

Like with relocation you may learn about the sad event in several ways:

  • A message from relatives
  • Subscription to external reference data services, which will be different from country to country
  • Investigation upon returned mail via postal services

Apart from in Business-to-Consumer (B2C) activities the deceased event also has relevance in Business-to-Business (B2B) where we may call it the dissolved event.

One benefit of having a central master data management functionality is that every party role and related business processes can be notified about the status which may trigger a workflow.

An area where I have worked with handling this situation was in public transit where subscription services for public transport is cancelled when learning about a decease thus lifting some burden on relatives and also avoiding processes for paying back money in this situation.

Right now I’m working with data stewardship functionality in the instant Data Quality MDM Edition where the relocation event, the deceased event and other important events in party master data lifecycle management must be supported by functionality embracing external reference data and internal master data.

Bookmark and Share