Quality of Data Behind the Data Quality Magic Quadrant

Last week the Gartner Magic Quadrant for Data Quality Tools was published. You may have a free look thru some of the vendor’s sites. For example SAP has a link here.

I’m not going into who are leaders, visionaries, challengers or niche players. I’m a bit puzzled about who is in there at all.

We may look at two UK based vendors:

  • Datactics has a good position among the niche players
  • Experian QAS is not in the quadrant, but is mentioned among the vendors not meeting the inclusion criteria

If you look up Datactics on LinkedIn there are 14 employees there. If you look up Experian QAS UK on LinkedIn there are 369 employees there (and QAS has subsidiaries around the world too). This balance of strength resembles what I know from business directories.

Now, the inclusion criteria set up by Gartner may make a lot of sense, but I find it strange that it so obviously fails to reflect market reality.

Please find more information about how another analyst includes players (compared to Gartner) in the post The Data Quality Tool Vendor Difference.

Bookmark and Share

Developing LEGO® bricks and SOA components

These days the Lego company is celebrating 80 years in business. The celebration includes a Youtube video telling The LEGO® Story.

As I was born close to the Lego home in Billund, Denmark, I also remember having a considerable amount of Lego bricks to play with as a child in the 60’s.

In computer software the use of Lego bricks is often used as a metaphor for building systems with Service Oriented Architecture (SOA) components as discussed for example in this article called Can SOA and architecture really be described with ‘Lego blocks’?

Today using SOA components in order to achieve data quality improvement with master data is a playground for me.

As described in the post Service Oriented Data Quality SOA components have a lot to offer:

• Reuse is one of the core principles of SOA. Having the same data quality rules applied to every entry point of the same sort of data will help with consistency.

• Interoperability will make it possible to deploy data quality prevention as close to the root as possible.

Composability makes it possible to combine functionality with different advantages – e.g. combining internal checks with external reference data.

Bookmark and Share

We Need Better Search

Often we have all the information we need. What we don’t have is the right means to search in and make sense of all the information.

It’s now been a little more than a year since the terrible terrorist attacks in Norway carried out by a right-wing extremist.

Since then an investigation have been done in order to find out if the tragic incident could have been avoided. A report is due for tomorrow, but bits and pieces are already flowing in the press now.

Today the Norwegian newspaper Aftenposten has an article telling about the inadequate searching features available to the Norwegian Police Intelligence. Article in Norwegian here.

As I understand it the Police Intelligence did have a few registrations about suspicious activities by the terrorist. Probably not enough to act upon before the tragedy. But even if they had more information they wouldn’t have been able to match it with the technology available and prevent the attacks.

It’s a shame.

Bookmark and Share

Olympic Darlings and Big Data Experts

The Olympic Games produces two kinds of darlings.

One kind is the big winners as Usain Bolt and Michael Phelps.

The other kind is the big losers. As reported in the post Olympic Moments the 1988 Winter Games had the Brit “Eddie the Eagle” in ski jumping. The 2000 Sydney Summer Games had the swimmer Eric “The Eel” Moussambani. The 2012 London Summer Games now has Hamadou Djibo Issaka in rowing.

The ski jumper Eddie the Eagle came from a country that hates snow and comes to a full stop at the first sight of the white fluffy stuff from above. The rower Hamadou Djibo Issaka comes from Niger, a country almost only covered by desert.

Such braveness in competing way out of your comfort zone naturally brings me to the subject of big data experts.

A while ago I noticed a tweet by Neil Raden:

Oh yes. It’s amazing how many big data experts we have seen emerging in the short life of the big data buzz.

Bookmark and Share

Searching for Data Quality (and Decency)

As I have mentioned here on the blog (and maybe even too often) I am right now involved in making the roadmap for and promoting a tool for getting better data quality by searching and mashing up available external information in the cloud and in internal master databases.

The tool is called iDQ (instant Data Quality).

In promoting such a solution we are interested in engaging in a dialogue with people who are searching for data quality.

So are a lot of other vendors in the data quality tool market of course.

In that quest vendors are looking for having a better ranking in search engines when people are searching for data quality, data cleansing and similar terms.

An often used technique for that is link building. Here you (over) use the terms data quality, data cleansing and so and every time you make a link from the term to your home page.

Examples are the blog posts form DQ Global and an endless stream of data quality news from Experian QAS.

However some vendors link building is done not only on own blogs and news lists but also on other sites for example by making comments on this blog.

Examples are this one linking to Experian QAS and this one linking to HelpIT.

It is my impression that these comments are made by SEO agencies hired by the vendors. The agencies make comments with a random name like in these cases “Smith” (ah, John Smith, I know him) and “Peter Parker” (or is it Spider-Man).

Methinks: This may help promoting tools when searching for data quality. But it doesn’t help with finding decency.

Bookmark and Share

Hot and Magic Medal Counting

In the ongoing Olympic Games one often displayed list is the list of medals per nation.

The list reminds me about the occasional analyst report ranking of Data Quality tools and Master Data Management (MDM) solutions. The latest one is fresh pressed as told in the post called Product Information Management is HOT for Business by Ventana Research, where the PIM vendors are ranked with Stibo Systems being the most HOT.

The counting of medals in the Olympic Games in London this afternoon looks like this:

As expected the top race is between the big teams from United States and China just as the mega vendors of tools also always receives good rankings by analysts though with a few exceptions as reported in the post The Data Quality Tool Vendor Difference, where the Gartner MAGIC Quadrant is compared with the ranking from Information Difference.

As often seen the home team, Great Britain and Northern Ireland, is also doing very well. With tools we also see that the Most Times the Home Team Wins despite of analyst ranking when a local client selects a tool.

Other big teams as Russia, Japan and Australia are currently struggling to get more gold medals to climb the list if ranked by gold (instead of total number of medals). Perhaps we will see a closer race with more teams in the last week just as expected with MDM tools as reported in the post Photo Finish in MDM Vendor Race.

The smaller nations often does it better in a small range of disciplines, like Ethiopia in running and Denmark in rowing and sailing resembling the situation described in the post Who is not Using Data Quality MAGIC, as there are plenty of Data Quality tools out there very feasible in certain tasks and local circumstances.

Bookmark and Share

Naming the Olympians

The British newspaper The Guardian has a feature on their website where you can get data about the Olympians. Link here: London 2012 Olympic athletes: the full list.

Browsing the list is a good reminder of the world-wide diversity we have with person names.

The names are here formatted with the surname(s) followed by the given name(s). The surname is in upper case.

The sequence of names is for the Chinese and other East Asian Olympians like they are used to opposite to other Olympians from places where we have the first name being the given name and last name being our surname.

Having the surname in upper case also shows where Olympians have two surnames as it is custom in Spanish cultures.

And oh yes. The South African guy has JIM as his surname.

Finally from this screen shot there is a good question. Is JIANG Wenwen superb at both synchronized swimming and track cycling – or is it two different Olympians with the same name. Some names are very common in China. A little goggling tells me it is two different persons. The synchronized swimmer is more related to her twin sister and swimming partner JIANG Tingting.

Let’s check if there is more than one “John Smith”.

Nope.

But it could be fun if “Kim Smith” and “Kimberley Smith” came from the same country.

Many Olympians actually don’t have the names reflected in this sheet as many have names in a different alphabet or script system.

The Danish cycling rider “SORENSEN Nicki” actually share my last name, as we know him as “Nicki Sørensen”. The Serbs, Ukrainians and Russian Olympians have their original name in the Cyrillic alphabet, but they have been transliterated to the English alphabet and Olympians from countries with other script systems than an alphabet have had their names gone through a transcription to the (English) alphabet.

So, is the list bad data quality?

Bookmark and Share

Photo Finish in MDM Vendor Race

With the London Olympics going on we will probably see a lot of winners after a photo finish.

I noticed another photo finish in a recent analyst report called The MDM Landscape Q2 2012 by the Information Difference.

The MDM (Master Data Management) vendors are scored by technology and market strength. If we look at the technology axis – the vertical one, there is a close race.

Orchestra shared the victory on twitter:

Kalido was also mentioned on twitter:

The linked press release from Kalido has a subtitle telling that Kalido was in front of the megavendors.

As mentioned in the report the vendors are actually not competing in the exact same discipline. Some vendors MDM offerings are part of a larger suite, some vendors focus on a single domain (like product) or industry and some vendors are generalists embracing multi-domain MDM.

This situation is also why another analyst firm, Gartner, have two magic quadrants for MDM vendors: One for customer MDM and one for product MDM.

However the trend is that more and more vendors are going towards multi-domain MDM. I know that for sure as I have been involved in one of the product MDM specialists journeys within multi-domain MDM.

So we could expect an even closer match in the Multi-Domain MDM race in the years to come.

Bookmark and Share

Probabilistic Learning in Data Matching

One of the techniques in data matching I have found most exciting is using machine learning techniques as probabilistic learning where manual inspected results of previous automated matching results are used to make the automated matching results better in the future.

Let’s look at an example. Below we are comparing two data rows with legal entities from Argentina:

The names are a close match (colored blue) as we have two swapped words.

The street addresses are an exact match (colored green),

The places are a mismatch (colored red).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other means, end up with confirming these two records as belonging to same real world legal entity.

Later we may encounter the two records:

The names are a close match (colored blue).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are learning that “Buenos Aires” and “Capital Federal” may be the same, it is now a close match (colored blue).

All in all we may have a dubious match to be forwarded for manual inspection.  This inspection may, based on additional information or other me mans, end up with confirming these two records as belonging to same real world legal entity.

In a next match run we may meet these two records:

The names are an exact match (colored green).

The street addresses are an exact match (colored green),

The places are basically a mismatch, but as we are consistently learning that “Buenos Aires” and “Capital Federal” may be the same, it is now an exact match (colored green).

We have a confident automated match with no need of costly manual inspection.

This example is one of many more you may learn about in the new eLerningCurve course called Data Parsing, Matching and De-Duplication.

Bookmark and Share

Doctor Livingstone, I Presume?

The title of this blog post is a famous quote from history (which as most quotes are disputed) said by Henry Morton Stanley (who actually was born John Rowlands) when he found Doctor Livingstone (David Livingstone) deep into the African jungle in 1871 after a 6 month expedition with 200 men through unknown territory.

Today it’s much easier to find people. Mobile phone use, credit card transactions and tweet positions leads the way, unless of course you really, really don’t want to be found as it was with Osama bin Mohammed bin Awad bin Laden.

One of the biggest issues in data quality is real world alignment of the data registered about persons. As told in the post out Out of Africa there are some issues in the way we handle such data, as:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Besides that I have found that many organizations don’t use the sources available today in getting data quality right when it comes to contact data.

It’s not that I suggest actually hacking into mobile phone use logs and so. There are a lot of sources not compromising with privacy that let you exploit external reference data as explained in the post Beyond Address Validation.

Bookmark and Share