Driving Data Quality in 2 Lanes

Yesterday I visited a client in order to participate in a workshop on using a Data Quality Desktop tool by more users within that organisation.

This organisation makes use of 2 different Data Quality tools from Omikron:

  • The Data Quality Server, a complete framework of SOA enabled Data Quality functionality where we need the IT-department to be a critical part of the implementation.
  • The Data Quality Desktop tool, a user friendly piece of windows software installable by any PC user, but with sophisticated cleansing and matching features.

During the few hours of this workshop we were able to link several different departmental data sources to the server based MDM hub, setting up and confirming the business rules for this and reporting the foreseeable outcome of this process if it were to be repeated.

Some of the scenarios exercised will continue to run as ad hoc departmental processes and others will be upgraded into services embraced by the enterprise wide server implementation.

As I – for some reasons – went to this event going by car over a larger distance I had the time to compare the data quality progress made by different organisations with the traffic on the roads where we have:

  • Large busses with persons and large lorries with products being the most sustainable way of transport – but they are slow going and not too dynamic. Like the enterprise wide server implementations of Data Quality tools.
  • Private cars heading at different destinations in different but faster speeds. Like the desktop Data Quality tools.

 I noticed that:

  • One lane with busses or lorries works fine but slowly.
  • One lane with private cars is bit of a mess with some hazardous driving.
  • One lane with busses, lorries and private cars tends to be mortal.
  • 2 (or more) lanes works nice with good driving habits.

800px-E20_53So, encouraged by the workshop and the ride I feel comfortable with the idea of using both kind of Data Quality tools to have coherent user involved agile processes backed by some tools and a sustainable enterprise wide solution at the same time.

Bookmark and Share

Upstream prevention by error tolerant search

Fuzzy matching techniques were originally developed for batch processing in order to find duplicates and consolidate database rows with no unique identifiers with the real world.

These processes have traditionally been implemented for downstream data cleansing.

As we know that upstream prevention is much more effective than tidy up downstream, real time data entry checking is becoming more common.

But we are able to go further upstream by introducing error tolerant search capabilities.

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for either introducing duplicates or challenge the real time checking.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

SOA componentImplementation of such features may be as embedded functionality in CRM and ERP systems or as my favourite term: SOA components. So besides classic data quality elements for monitoring and checking we can add error tolerant search to the component catalogue needed for a good MDM solution.

Bookmark and Share

The new face of Data Matching

When matching database records holding data about a person we traditionally use string attributes as Citizen/Tax ID, Name, Address, Phone, Email.

PolarRoseToday I stumbled over a company called Polar Rose that specialize in recognition of peoples faces on pictures. Current use is tagging people on Facebook pictures, but really, this technology could make Data Matching, Identity Resolution and Deduplication better.

We already know fuzzy matching with names and addresses have plenty of challenges with false positives and false negatives. Surely I also do imaging same issues with facial recognition. But we also know from comparing with strings that the more different information we may gather, the better we are at avoiding false matching. So combining fuzzy string matching and facial recognition (where picture is available) could add more human mimic to matching technology reliability.

Right now I am considering whether to add this feature to Data Quality 2.0 or leave it for Data Quality 3.0.

Sweden meets United States

obama-ikea

Finding duplicate customers may be very different tasks depending on from which country you are and from which country the data origins.

Besides all the various character sets, naming traditions and address formats also the alternative possibilities with external reference data makes something easy – and then something very hard.

Most technology, descriptions and presented examples around are from the United States.

But say you are a Swedish company having Swedish persons in your database and among those these 2 rows (name, address, postal code and city):

  • Oluf Palme, Sveagatan 67, 10001 Stockholm
  • Oluf Palme, Savegatan 76, 10001 Stockholm

What you do is that you plug into the government provided citizen master data hub and ask for a match. The outcome can be:

  • The same citizen ID is returned because the person has relocated. It’s a duplicate.
  • Two different citizen ID’s is returned. It’s not a duplicate.
  • Either only one or no citizen ID is returned. Leave it or do fuzzy matching.

If you go for fuzzy matching then you better be good, because all the easy ones are handled and you are left with the ones where false positives and false negatives are most likely. Often you will only do fuzzy matching if you have phone numbers, email addresses or other data to support the match.

Another angle is that it is almost only Swedish companies who use this service with the government provided reference data – but everyone having Swedish data may use it upon an approval.

Data quality solutions with party master data is not only about fuzzy matching but also about integrating with external reference data exploiting all the various world wide possibilities and supporting the logic and logistics in doing that. Also we know that upstream prevention as close to the root as possible is better than downstream cleansing.

Deployment of such features as composable SOA components is described in a previous post here.

Master Data Quality: The When Dimension

Often we use the who, what and where terms in defining master data opposite to transaction data, like saying:

  • Transaction data accurately identifies who, what, where and when and
  • Master data accurately describes who, what and where

Who is easily related to our business partners, what to the products we sell, buy and use – where is the locations of the events.

In some industries when is also easily related to master data entities like in public transportation a time table valid for a given period. Also a fiscal year in financial reporting belongs to the when side of things.

But when is also a factor in improving and preventing data quality related to our business partners, products and locations and assigned categories because the description of these entities do change over time.

This fact is named as “slowly changing dimensions” when building data warehouses and attempting to make sense of data with business intelligence.

But also in matching, deduplication and identity resolution the “when” dimension matters. Having data with the finest actuality doesn’t necessary lead to a good match as you may compare with data not having the same actuality. Here history tracking is a solution by storing former names, addresses, phones, e-mail addresses, descriptions, roles and relations.

Clouds_and_their_shadowsSuch a complexity is often not handled in master data containers around – and even less in matching environments.

My guess is that the future will bring public accessible reference data in the cloud describing our master data entities with a rich complexity including the when – the time – dimension and capable matching environments around.

Bookmark and Share

Data Quality Milestones

milestoneI have a page on this blog with the heading “Data Quality 2.0”. The page is about what the near future in my opinion will bring in the data quality industry. In recent days there were some comments on the topic. My current summing up on the subject is this:

Data Quality X.X are merely maturity milestones where:

Data Quality 0.0 may be seen as a Laissez-faire state where nothing is done.

Data Quality 1.0 may be seen as projects for improving downstream data quality typically using batch cleansing with national oriented techniques in order to make data fit for purpose.

Data Quality 2.0 may be seen as agile implementation of enterprise wide and small business data quality upstream prevention using multi-cultural combined techniques exploiting cloud based reference data in order to maintain data fit for multiple purposes.

The Tower of Babel

Brueghel-tower-of-babelSeveral old tales including in the Genesis and the Qur’an have stories about a great tower built by mankind at a time with a single language of all people. Since then mankind was confused by having multiple languages. And indeed we still are.

Multi-cultural issues is one of the really big challenges in data quality improvement. This includes not only language variations but also different character sets reflecting different alphabets and script systems, naming traditions, address formats, measure units, privacy norms, government registration practice to name the ones I have experienced.

As globalization moves forward these challenges becomes more and more important. Enterprises tend to standardize world wide on tools and services, shared service centres takes care of data covering many countries and so on. When an employee works with data from another country he often wrongly adapts his local standards to these data and thereby challenges the data quality more than seen before.

Recently I updated this site with pages around “The art of Matching”. One topic is “Match Techniques” and comments posted here were exactly very much around the need for methods that solves the problems arising from having multi-cultural data. Have a look.

International and multi-cultural aspects of data quality improvement has been a favourite topic of mine for a long time.

Whether and when an organisation has to deal with international issues is of course dependent on whether and in what degree that organisation is domestic or active internationally. Even though in some countries like Switzerland and Belgium having several official languages the multi-cultural topic is mandatory. Typically in large countries companies grows big before looking abroad while in smaller countries, like my home country Denmark, even many fairly small companies must address international issues with data quality. 

Some of the many different observations I have made includes the following:

  • Nicknames is a top issue in name matching in some cultures, but not of much importance in other cultures
  • Family names is key element in identifying households in some cultures, but not very useful in other cultures
  • Address verification and correction is very useful in some countries but close to impossible in other countries
  • Business directories are complete, consistent and available in some countries, but not that good in other countries
  • Citizen information is available for private entities in some countries, but is a no go in other countries

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

Bookmark and Share

The GlobalMatchBox

dnbLogo10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.

bisnode-logoThis solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.

Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.

It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.

So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.

Bookmark and Share

The Statue of Liberty versus The Little Mermaid

Statue_of_Liberty_NYThe Statue of Liberty in New York harbor is 46 metres (151 ft) high – 93 metres (305 ft) with foundation and pedestal.

The Little Mermaid sits on a rock in the Copenhagen harbour. The relatively small size of the statue typically surprises tourists visiting for the first time. The Little Mermaid statue is only 1.25 metres (4 ft) high.

Little_Mermaid_CopenhagenActually most things in Denmark are smaller than in the US – also the size of companies. Of course there are Maersk, Carlsberg and Lego, but most of companies from there are SMB’s (Small and Medium sized Business’s) in a global sense.

As Graham Rhind points out in his blog http://grcdi.blogspot.com/2009/05/what-about-rest-of-data.html most literature about data quality is fixed completely on data held in large corporate entities. Statistically the relative number of SMB’s are probably close to the same – but having only a few large companies somehow shifts the focus more to the SMB’s in my country (and our Nordic neighbours).

This is why I have actually worked with data quality improvement both at SMB’s and at large companies.

Most significant differences as I have seen is probably not surprising on the data governance part, where you have to use much more agile (guerrilla) approaches with the SMB’s.

The technology part is pretty much the same – but ROI is king as ever. With SMB’s results must show up almost immediately, there is no room for months of tuning. Software must be user friendly, there is no room for excessive consultancy.

I can recommend all data quality professionals to do a SMB implementation in order to sharpen your skills and tools.

Bookmark and Share

Service Oriented Data Quality

puzzle

Service Oriented Architecture (SOA) has been a buzzword for some years.

In my opinion SOA is a golden opportunity for getting the benefits from data quality tools that we haven’t been able to achieve so much with the technology and approaches seen until now (besides the other SOA benefits being independent to technology).

Many data quality implementations until now have been batch cleansing operations suffering from very little sustainability. I have seen lots of well cleansed data never making it back to the sources or only being partially updated in operational databases. And even then a great deal of those updated cleansed data wasn’t maintained and prevented from there.

Embedded data quality functionality in different ERP, CRM, ETL solutions has been around for a long time. These solutions may serve their purpose very well when implemented. But often they are not implemented due to bundling of distinct ERP, CRM, ETL solutions and consultancies with specific advantages and data quality tools with specific advantages, which may not always be a perfect match. Also having different ERP, CRM, ETL solutions then often means different data quality tools and functionality probably not doing the same thing the same way.

Data Quality functionality deployed as SOA components have a lot to offer:

Reuse is one of the core principles of SOA. Having the same data quality rules applied to every entry point of the same sort of data will help with consistency.

Interoperability will make it possible to deploy data quality prevention as close to the root as possible.

Composability makes it possible to combine functionality with different advantages – e.g. combining internal checks with external reference data.

During the last years I have been on projects implementing data quality as SOA components. The results seem to be very promising so far, but I think we just started.

Bookmark and Share