Driving Data Quality in 2 Lanes

Yesterday I visited a client in order to participate in a workshop on using a Data Quality Desktop tool by more users within that organisation.

This organisation makes use of 2 different Data Quality tools from Omikron:

  • The Data Quality Server, a complete framework of SOA enabled Data Quality functionality where we need the IT-department to be a critical part of the implementation.
  • The Data Quality Desktop tool, a user friendly piece of windows software installable by any PC user, but with sophisticated cleansing and matching features.

During the few hours of this workshop we were able to link several different departmental data sources to the server based MDM hub, setting up and confirming the business rules for this and reporting the foreseeable outcome of this process if it were to be repeated.

Some of the scenarios exercised will continue to run as ad hoc departmental processes and others will be upgraded into services embraced by the enterprise wide server implementation.

As I – for some reasons – went to this event going by car over a larger distance I had the time to compare the data quality progress made by different organisations with the traffic on the roads where we have:

  • Large busses with persons and large lorries with products being the most sustainable way of transport – but they are slow going and not too dynamic. Like the enterprise wide server implementations of Data Quality tools.
  • Private cars heading at different destinations in different but faster speeds. Like the desktop Data Quality tools.

 I noticed that:

  • One lane with busses or lorries works fine but slowly.
  • One lane with private cars is bit of a mess with some hazardous driving.
  • One lane with busses, lorries and private cars tends to be mortal.
  • 2 (or more) lanes works nice with good driving habits.

800px-E20_53So, encouraged by the workshop and the ride I feel comfortable with the idea of using both kind of Data Quality tools to have coherent user involved agile processes backed by some tools and a sustainable enterprise wide solution at the same time.

Bookmark and Share

Upstream prevention by error tolerant search

Fuzzy matching techniques were originally developed for batch processing in order to find duplicates and consolidate database rows with no unique identifiers with the real world.

These processes have traditionally been implemented for downstream data cleansing.

As we know that upstream prevention is much more effective than tidy up downstream, real time data entry checking is becoming more common.

But we are able to go further upstream by introducing error tolerant search capabilities.

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for either introducing duplicates or challenge the real time checking.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

SOA componentImplementation of such features may be as embedded functionality in CRM and ERP systems or as my favourite term: SOA components. So besides classic data quality elements for monitoring and checking we can add error tolerant search to the component catalogue needed for a good MDM solution.

Bookmark and Share

The new face of Data Matching

When matching database records holding data about a person we traditionally use string attributes as Citizen/Tax ID, Name, Address, Phone, Email.

PolarRoseToday I stumbled over a company called Polar Rose that specialize in recognition of peoples faces on pictures. Current use is tagging people on Facebook pictures, but really, this technology could make Data Matching, Identity Resolution and Deduplication better.

We already know fuzzy matching with names and addresses have plenty of challenges with false positives and false negatives. Surely I also do imaging same issues with facial recognition. But we also know from comparing with strings that the more different information we may gather, the better we are at avoiding false matching. So combining fuzzy string matching and facial recognition (where picture is available) could add more human mimic to matching technology reliability.

Right now I am considering whether to add this feature to Data Quality 2.0 or leave it for Data Quality 3.0.

Sweden meets United States

obama-ikea

Finding duplicate customers may be very different tasks depending on from which country you are and from which country the data origins.

Besides all the various character sets, naming traditions and address formats also the alternative possibilities with external reference data makes something easy – and then something very hard.

Most technology, descriptions and presented examples around are from the United States.

But say you are a Swedish company having Swedish persons in your database and among those these 2 rows (name, address, postal code and city):

  • Oluf Palme, Sveagatan 67, 10001 Stockholm
  • Oluf Palme, Savegatan 76, 10001 Stockholm

What you do is that you plug into the government provided citizen master data hub and ask for a match. The outcome can be:

  • The same citizen ID is returned because the person has relocated. It’s a duplicate.
  • Two different citizen ID’s is returned. It’s not a duplicate.
  • Either only one or no citizen ID is returned. Leave it or do fuzzy matching.

If you go for fuzzy matching then you better be good, because all the easy ones are handled and you are left with the ones where false positives and false negatives are most likely. Often you will only do fuzzy matching if you have phone numbers, email addresses or other data to support the match.

Another angle is that it is almost only Swedish companies who use this service with the government provided reference data – but everyone having Swedish data may use it upon an approval.

Data quality solutions with party master data is not only about fuzzy matching but also about integrating with external reference data exploiting all the various world wide possibilities and supporting the logic and logistics in doing that. Also we know that upstream prevention as close to the root as possible is better than downstream cleansing.

Deployment of such features as composable SOA components is described in a previous post here.

Master Data meets the Customer

In the old days Master Data was predominately created, maintained and used by the staff in the organisation having these data. This is in many cases not the fact anymore. Besides exchanging data with partners in doing business, today the customer – and prospect – has become an important person to be considered when doing Data Governance and implementing technology around Master Data.

In the online world the customer works with your Master Data when:

  • The customer creates and maintains name, address and communication information by using registration functions
  • The customer searches for and reads product information on web shops and information sites

Having the prospects and customers helping with the name and address (party) data is apparently great news for lowering costs in the organisation. But in the long run you got yourself another silo with data and your Data Quality issues has become yet more challenging.

First thing to do is to optimise your registration forms. An important thing to consider here is that online is worldwide (unless you restrict your site to visitors from a single country). When doing business online with multi national customers then take care that the sequence, formats and labels are useful to everyone and that mandatory checks and other validations are in line with rules for the country in question.

External reference data may be used for lookup and validation integrated in the registration forms.

The concept of “one version of the truth” is a core element in most Master Data Management solutions. Doing deduplication within online registration have privacy considerations. When asking for personal data you can’t prompt “Possible duplicate found” and then present the data about someone else. Here you need more than one data quality firewall.

Many organisations are not just either offline or online but are operating in both worlds. To maintain the 360 degree view on customer in this situation you need strong data matching techniques capable of working with offline and online captured data. As the business case for online registration is very much about reducing staff involvement, this is about using technology and keeping human interaction to a minimum.

Search and navigationWhen a prospect comes to your site and tries to find information about your products, the first thing to do is very often using the search function. From deduplication of names and addresses we know that spelling is difficult and that sometimes we use other synonyms than used in the Master Data descriptions. Add to that the multi-cultural aspect. The solution here is that you use the same fuzzy search techniques that we use for data matching. This is a kind of reuse. I like that.

Bookmark and Share

Data Quality 2.0 meets MDM 2.0

My current “Data Quality 2.0” endeavor started as a spontaneous heading on the topic of where the data quality industry in my opinion are going in the near future. But partly encouraged by being friendly slammed on the buzzword bingo I have surfed the Web 2.0 for finding other 2.0’s. They are plenty and frequent.

handshake_after_matchThis piece by Mehmet Orun called “MDM 2.0: Comprehensive MDM” really caught my interest. Data Quality and MDM (Master Data Management) is closely related. When you do MDM you work much of the time with Data Quality issues, and doing Data Quality is most often doing Master Data Quality.

So assuming “Data Quality 2.0” and “MDM 2.0” is about what is referenced in the links above it’s quite natural that many points are shared between the two terms.

Service Oriented Architecture (SOA) is one of the binding elements as Data Quality solutions and MDM solutions will share Reference and Master Data Management services handling data stewardship, match-link, match-merge, address lookup, address standardization, address verification, data change management by doing Information Discrepancy Resolution Processes embracing internal and external data.

The mega-vendors will certainly bundle their Data Quality and MDM offerings by using more or less SOA. The ongoing vendor consolidation adds to that wave. But hopefully we will also see some true SOA where best-of-bread “Data Quality 2.0” and “MDM 2.0” technology will be implemented with strong business support under a broader solution plan to meet the intended business need by focusing on how the information is created, used, and managed for multiple purposes in a multi-cultural environment.

Actually I should have added a (part 1) to the heading of this post. But I will try to make 2.0 free headings in following posts on the next generation milestones in Data Quality and MDM coexistence. It is possible – I did that in my previous post called Master Data Quality: The When Dimension.

Bookmark and Share

Alignment of business and IT

teamworkBeing a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start.

Many blog posts made around the data quality conundrum discusses the role of the business side versus the role of the technology side and various weights in different contexts are given to these sides. It should not be surprising for a Data Quality professional that there is no absolute true or absolute false simple answer to such a question. Fortunately I find most discussions, when they are taken, ends up with the “peace on earth” sentiment:

  • Of course it’s the business requirements striving for business value that governs any initiative using technology in order to improve business performance
  • Of course the emerge (or discovery) of new technology may change the way you arrange business processes in order to gain on competitive business performance

From that point of view I am looking forward to continued discussions over all the important issues around data and information quality improvement and prevention as, but not limited to:

  • What is the business value of better information quality
  • How to gather business requirement related to information quality in order to make data fit for purpose(s)
  • Who is needed to accomplish the data quality improvement tasks – probably people from business, IT and all those mixed ones (credit: Jim Harris of OCDQblog)
  • When is the data quality technology so mature that it will cope with issues in a way not seen before
  • Which different kinds of methodologies and techniques are best for different sort of data quality challenges
  • Where on earth is the answers to all these questions

Bookmark and Share

Data Quality Milestones

milestoneI have a page on this blog with the heading “Data Quality 2.0”. The page is about what the near future in my opinion will bring in the data quality industry. In recent days there were some comments on the topic. My current summing up on the subject is this:

Data Quality X.X are merely maturity milestones where:

Data Quality 0.0 may be seen as a Laissez-faire state where nothing is done.

Data Quality 1.0 may be seen as projects for improving downstream data quality typically using batch cleansing with national oriented techniques in order to make data fit for purpose.

Data Quality 2.0 may be seen as agile implementation of enterprise wide and small business data quality upstream prevention using multi-cultural combined techniques exploiting cloud based reference data in order to maintain data fit for multiple purposes.

The art of Business Directory Matching

A business directory is a list of companies in a given area and perhaps a given industry. One very useful type of such a directory related to data quality is a list of all companies in a given country. In many countries the authorities maintains such a list, other places it’s a matter of assembling local lists or other forms of data capture. Many private service providers offer such lists often with added information value of different kinds.

If you take the customer/prospect master table from an enterprise doing B2B in a given country one should believe that the rows in that table would match 100% to the business directory of that country. I am not talking about that all data are spelled exactly as in the directory but “only” about that it’s the same real world object reflected.

neural1During many years of providing solutions for business directory match and tuning these as well as handling such match services from colleagues in the business I have very, very seldom seen a 100% match – even 90% matches are very rare.

Why is that so? Some of the reasons – related to the classic data quality dimensions – I have stumbled over has been:

Completeness of business directories varies from country to country and between the lists provided by vendors. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordics and others have a tight registration and then it is less tight from countries in North America, other European countries and the rest of the world.

Actuality in business directories also differs a lot. Also it is important if the business directory covers dissolved entities and includes history tracking like former names and addresses. Then take the actuality of the customer/prospect table to be matched and once again the time dimension has a lot to say.

Validity, accuracy, consistency both concerning the directory and the table to be matched is a natural course of mismatch. Also many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.

Uniqueness may be different defined in the directory and table to be matched. This includes the perception of hierachies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also different roles as those of a small business owner makes challenges. The same is true about roles as franchise takers and the use of trading styles.

Then of course the applied automated match technique and the human interaction executed are factors of the resulting match rate and the quality of the match measured as frequency of false positives.

The Tower of Babel

Brueghel-tower-of-babelSeveral old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed we still are.

Multi-cultural issues is one of the really big challenges in data quality improvement. This includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name the ones I have experienced.

As globalization moves forward these challenges becomes more and more important. Enterprises tend to standardize world wide on tools and services, shared service centres takes care of data covering many countries and so on. When an employee works with data from another country he often wrongly adapts his local standards to these data and thereby challenges the data quality more than seen before.

Recently I updated this site with pages around “The art of Matching”. One topic is “Match Techniques” and comments posted here were exactly very much around the need for methods that solves the problems arising from having multi-cultural data. Have a look.

International and multi-cultural aspects of data quality improvement has been a favourite topic of mine for a long time.

Whether and when an organisation has to deal with international issues is of course dependent on whether and in what degree that organisation is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality. 

Some of the many different observations I have made includes the following:

  • Nicknames is a top issue in name matching in some cultures, but not of much importance in other cultures
  • Family names is key element in identifying households in some cultures, but not very useful in other cultures
  • Address verification and correction is very useful in some countries but close to impossible in other countries
  • Business directories are complete, consistent and available in some countries, but not that good in other countries
  • Citizen information is available for private entities in some countries, but is a no go in other countries

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

Bookmark and Share