Data Quality 2.0 meets MDM 2.0

My current “Data Quality 2.0” endeavor started as a spontaneous heading on the topic of where the data quality industry in my opinion are going in the near future. But partly encouraged by being friendly slammed on the buzzword bingo I have surfed the Web 2.0 for finding other 2.0’s. They are plenty and frequent.

handshake_after_matchThis piece by Mehmet Orun called “MDM 2.0: Comprehensive MDM” really caught my interest. Data Quality and MDM (Master Data Management) is closely related. When you do MDM you work much of the time with Data Quality issues, and doing Data Quality is most often doing Master Data Quality.

So assuming “Data Quality 2.0” and “MDM 2.0” is about what is referenced in the links above it’s quite natural that many points are shared between the two terms.

Service Oriented Architecture (SOA) is one of the binding elements as Data Quality solutions and MDM solutions will share Reference and Master Data Management services handling data stewardship, match-link, match-merge, address lookup, address standardization, address verification, data change management by doing Information Discrepancy Resolution Processes embracing internal and external data.

The mega-vendors will certainly bundle their Data Quality and MDM offerings by using more or less SOA. The ongoing vendor consolidation adds to that wave. But hopefully we will also see some true SOA where best-of-bread “Data Quality 2.0” and “MDM 2.0” technology will be implemented with strong business support under a broader solution plan to meet the intended business need by focusing on how the information is created, used, and managed for multiple purposes in a multi-cultural environment.

Actually I should have added a (part 1) to the heading of this post. But I will try to make 2.0 free headings in following posts on the next generation milestones in Data Quality and MDM coexistence. It is possible – I did that in my previous post called Master Data Quality: The When Dimension.

Bookmark and Share

Master Data Quality: The When Dimension

Often we use the who, what and where terms in defining master data opposite to transaction data, like saying:

  • Transaction data accurately identifies who, what, where and when and
  • Master data accurately describes who, what and where

Who is easily related to our business partners, what to the products we sell, buy and use – where is the locations of the events.

In some industries when is also easily related to master data entities like in public transportation a time table valid for a given period. Also a fiscal year in financial reporting belongs to the when side of things.

But when is also a factor in improving and preventing data quality related to our business partners, products and locations and assigned categories because the description of these entities do change over time.

This fact is named as “slowly changing dimensions” when building data warehouses and attempting to make sense of data with business intelligence.

But also in matching, deduplication and identity resolution the “when” dimension matters. Having data with the finest actuality doesn’t necessary lead to a good match as you may compare with data not having the same actuality. Here history tracking is a solution by storing former names, addresses, phones, e-mail addresses, descriptions, roles and relations.

Clouds_and_their_shadowsSuch a complexity is often not handled in master data containers around – and even less in matching environments.

My guess is that the future will bring public accessible reference data in the cloud describing our master data entities with a rich complexity including the when – the time – dimension and capable matching environments around.

Bookmark and Share

Alignment of business and IT

teamworkBeing a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start.

Many blog posts made around the data quality conundrum discusses the role of the business side versus the role of the technology side and various weights in different contexts are given to these sides. It should not be surprising for a Data Quality professional that there is no absolute true or absolute false simple answer to such a question. Fortunately I find most discussions, when they are taken, ends up with the “peace on earth” sentiment:

  • Of course it’s the business requirements striving for business value that governs any initiative using technology in order to improve business performance
  • Of course the emerge (or discovery) of new technology may change the way you arrange business processes in order to gain on competitive business performance

From that point of view I am looking forward to continued discussions over all the important issues around data and information quality improvement and prevention as, but not limited to:

  • What is the business value of better information quality
  • How to gather business requirement related to information quality in order to make data fit for purpose(s)
  • Who is needed to accomplish the data quality improvement tasks – probably people from business, IT and all those mixed ones (credit: Jim Harris of OCDQblog)
  • When is the data quality technology so mature that it will cope with issues in a way not seen before
  • Which different kinds of methodologies and techniques are best for different sort of data quality challenges
  • Where on earth is the answers to all these questions

Bookmark and Share

Data Quality Milestones

milestoneI have a page on this blog with the heading “Data Quality 2.0”. The page is about what the near future in my opinion will bring in the data quality industry. In recent days there were some comments on the topic. My current summing up on the subject is this:

Data Quality X.X are merely maturity milestones where:

Data Quality 0.0 may be seen as a Laissez-faire state where nothing is done.

Data Quality 1.0 may be seen as projects for improving downstream data quality typically using batch cleansing with national oriented techniques in order to make data fit for purpose.

Data Quality 2.0 may be seen as agile implementation of enterprise wide and small business data quality upstream prevention using multi-cultural combined techniques exploiting cloud based reference data in order to maintain data fit for multiple purposes.

The art of Business Directory Matching

A business directory is a list of companies in a given area and perhaps a given industry. One very useful type of such a directory related to data quality is a list of all companies in a given country. In many countries the authorities maintains such a list, other places it’s a matter of assembling local lists or other forms of data capture. Many private service providers offer such lists often with added information value of different kinds.

If you take the customer/prospect master table from an enterprise doing B2B in a given country one should believe that the rows in that table would match 100% to the business directory of that country. I am not talking about that all data are spelled exactly as in the directory but “only” about that it’s the same real world object reflected.

neural1During many years of providing solutions for business directory match and tuning these as well as handling such match services from colleagues in the business I have very, very seldom seen a 100% match – even 90% matches are very rare.

Why is that so? Some of the reasons – related to the classic data quality dimensions – I have stumbled over has been:

Completeness of business directories varies from country to country and between the lists provided by vendors. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordics and others have a tight registration and then it is less tight from countries in North America, other European countries and the rest of the world.

Actuality in business directories also differs a lot. Also it is important if the business directory covers dissolved entities and includes history tracking like former names and addresses. Then take the actuality of the customer/prospect table to be matched and once again the time dimension has a lot to say.

Validity, accuracy, consistency both concerning the directory and the table to be matched is a natural course of mismatch. Also many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.

Uniqueness may be different defined in the directory and table to be matched. This includes the perception of hierachies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also different roles as those of a small business owner makes challenges. The same is true about roles as franchise takers and the use of trading styles.

Then of course the applied automated match technique and the human interaction executed are factors of the resulting match rate and the quality of the match measured as frequency of false positives.

Data Quality and Common Sense

My favourite story is the fairytale “The Emperor’s new clothes” by Hans Christian Andersen.

Hans_Christian_AndersenIn this tale an emperor hires two swindlers (aka consultants) who offer him the finest dress from the most beautiful cloth. This cloth, they tell him, is invisible to anyone who is either stupid or unfit for his position. In fact there is no cloth at all, but no one (but at the end a little child) dares to say.

The Data Quality discipline is tormented by belonging to both the business side and the technology side of practice. This means that we have to live with the buzzwords and the smartness coming from both the management consultants and the technology consultants and vendors – including myself.

So you really have to believe in a lot of things and terms said in order not to look stupid or unfit for your position.

A way to cope with this is to look behind all the fine terms and recognize that most things said and presented is just another way of expressing common sense. Some examples:

Business Process: What you do at work – e.g. selling some stuff and putting data about it into a database so it’s ready for invoicing.

Referential Integrity Error: When you sold something not in the database. You may pick another item from the current list. Bad Change Management: When someone tells you to do it in another way. Now.

Organisational Resistance: When you find that way completely ridiculous because no one tells you why.

Fuzzy logic: This is about the common nature of most questions in life. Statements are not absolutely true or absolutely false but somewhere in between depending on the angle from where you observe.

Business Intelligence: When someone puts your data along with some other data into a new context visualised in a graph in order to replace human gut feeling.

Poor Enterprise Wide Data Quality: The invoicing went well. The decision made from the graph didn’t. 

Data Governance: Meetings and documents about what went wrong with the data and how we can do better.

My experience is that the most successful data quality improvements is made when it is guided by common sense and expressed as being that. From there you may find great inspiration and practical skills and tools in each area of expertise.

The Tower of Babel

Brueghel-tower-of-babelSeveral old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed we still are.

Multi-cultural issues is one of the really big challenges in data quality improvement. This includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name the ones I have experienced.

As globalization moves forward these challenges becomes more and more important. Enterprises tend to standardize world wide on tools and services, shared service centres takes care of data covering many countries and so on. When an employee works with data from another country he often wrongly adapts his local standards to these data and thereby challenges the data quality more than seen before.

Recently I updated this site with pages around “The art of Matching”. One topic is “Match Techniques” and comments posted here were exactly very much around the need for methods that solves the problems arising from having multi-cultural data. Have a look.

International and multi-cultural aspects of data quality improvement has been a favourite topic of mine for a long time.

Whether and when an organisation has to deal with international issues is of course dependent on whether and in what degree that organisation is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality. 

Some of the many different observations I have made includes the following:

  • Nicknames is a top issue in name matching in some cultures, but not of much importance in other cultures
  • Family names is key element in identifying households in some cultures, but not very useful in other cultures
  • Address verification and correction is very useful in some countries but close to impossible in other countries
  • Business directories are complete, consistent and available in some countries, but not that good in other countries
  • Citizen information is available for private entities in some countries, but is a no go in other countries

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

Bookmark and Share

The GlobalMatchBox

dnbLogo10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.

bisnode-logoThis solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.

Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.

It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.

So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.

Bookmark and Share

The Statue of Liberty versus The Little Mermaid

Statue_of_Liberty_NYThe Statue of Liberty in New York harbor is 46 metres (151 ft) high – 93 metres (305 ft) with foundation and pedestal.

The Little Mermaid sits on a rock in the Copenhagen harbour. The relatively small size of the statue typically surprises tourists visiting for the first time. The Little Mermaid statue is only 1.25 metres (4 ft) high.

Little_Mermaid_CopenhagenActually most things in Denmark are smaller than in the US – also the size of companies. Of course there are Maersk, Carlsberg and Lego, but most of companies from there are SMB’s (Small and Medium sized Business’s) in a global sense.

As Graham Rhind points out in his blog http://grcdi.blogspot.com/2009/05/what-about-rest-of-data.html most literature about data quality is fixed completely on data held in large corporate entities. Statistically the relative number of SMB’s are probably close to the same – but having only a few large companies somehow shifts the focus more to the SMB’s in my country (and our Nordic neighbours).

This is why I have actually worked with data quality improvement both at SMB’s and at large companies.

Most significant differences as I have seen is probably not surprising on the data governance part, where you have to use much more agile (guerrilla) approaches with the SMB’s.

The technology part is pretty much the same – but ROI is king as ever. With SMB’s results must show up almost immediately, there is no room for months of tuning. Software must be user friendly, there is no room for excessive consultancy.

I can recommend all data quality professionals to do a SMB implementation in order to sharpen your skills and tools.

Bookmark and Share

Service Oriented Data Quality

puzzle

Service Oriented Architecture (SOA) has been a buzzword for some years.

In my opinion SOA is a golden opportunity for getting the benefits from data quality tools that we haven’t been able to achieve so much with the technology and approaches seen until now (besides the other SOA benefits being independent to technology).

Many data quality implementations until now have been batch cleansing operations suffering from very little sustainability. I have seen lots of well cleansed data never making it back to the sources or only being partially updated in operational databases. And even then a great deal of those updated cleansed data wasn’t maintained and prevented from there.

Embedded data quality functionality in different ERP, CRM, ETL solutions has been around for a long time. These solutions may serve their purpose very well when implemented. But often they are not implemented due to bundling of distinct ERP, CRM, ETL solutions and consultancies with specific advantages and data quality tools with specific advantages, which may not always be a perfect match. Also having different ERP, CRM, ETL solutions then often means different data quality tools and functionality probably not doing the same thing the same way.

Data Quality functionality deployed as SOA components have a lot to offer:

Reuse is one of the core principles of SOA. Having the same data quality rules applied to every entry point of the same sort of data will help with consistency.

Interoperability will make it possible to deploy data quality prevention as close to the root as possible.

Composability makes it possible to combine functionality with different advantages – e.g. combining internal checks with external reference data.

During the last years I have been on projects implementing data quality as SOA components. The results seem to be very promising so far, but I think we just started.

Bookmark and Share