Upstream prevention by error tolerant search

Fuzzy matching techniques were originally developed for batch processing in order to find duplicates and consolidate database rows with no unique identifiers with the real world.

These processes have traditionally been implemented for downstream data cleansing.

As we know that upstream prevention is much more effective than tidy up downstream, real time data entry checking is becoming more common.

But we are able to go further upstream by introducing error tolerant search capabilities.

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for either introducing duplicates or challenge the real time checking.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

SOA componentImplementation of such features may be as embedded functionality in CRM and ERP systems or as my favourite term: SOA components. So besides classic data quality elements for monitoring and checking we can add error tolerant search to the component catalogue needed for a good MDM solution.

Bookmark and Share

Sweden meets United States

obama-ikea

Finding duplicate customers may be very different tasks depending on from which country you are and from which country the data origins.

Besides all the various character sets, naming traditions and address formats also the alternative possibilities with external reference data makes something easy – and then something very hard.

Most technology, descriptions and presented examples around are from the United States.

But say you are a Swedish company having Swedish persons in your database and among those these 2 rows (name, address, postal code and city):

  • Oluf Palme, Sveagatan 67, 10001 Stockholm
  • Oluf Palme, Savegatan 76, 10001 Stockholm

What you do is that you plug into the government provided citizen master data hub and ask for a match. The outcome can be:

  • The same citizen ID is returned because the person has relocated. It’s a duplicate.
  • Two different citizen ID’s is returned. It’s not a duplicate.
  • Either only one or no citizen ID is returned. Leave it or do fuzzy matching.

If you go for fuzzy matching then you better be good, because all the easy ones are handled and you are left with the ones where false positives and false negatives are most likely. Often you will only do fuzzy matching if you have phone numbers, email addresses or other data to support the match.

Another angle is that it is almost only Swedish companies who use this service with the government provided reference data – but everyone having Swedish data may use it upon an approval.

Data quality solutions with party master data is not only about fuzzy matching but also about integrating with external reference data exploiting all the various world wide possibilities and supporting the logic and logistics in doing that. Also we know that upstream prevention as close to the root as possible is better than downstream cleansing.

Deployment of such features as composable SOA components is described in a previous post here.

Data Quality 2.0 meets MDM 2.0

My current “Data Quality 2.0” endeavor started as a spontaneous heading on the topic of where the data quality industry in my opinion are going in the near future. But partly encouraged by being friendly slammed on the buzzword bingo I have surfed the Web 2.0 for finding other 2.0’s. They are plenty and frequent.

handshake_after_matchThis piece by Mehmet Orun called “MDM 2.0: Comprehensive MDM” really caught my interest. Data Quality and MDM (Master Data Management) is closely related. When you do MDM you work much of the time with Data Quality issues, and doing Data Quality is most often doing Master Data Quality.

So assuming “Data Quality 2.0” and “MDM 2.0” is about what is referenced in the links above it’s quite natural that many points are shared between the two terms.

Service Oriented Architecture (SOA) is one of the binding elements as Data Quality solutions and MDM solutions will share Reference and Master Data Management services handling data stewardship, match-link, match-merge, address lookup, address standardization, address verification, data change management by doing Information Discrepancy Resolution Processes embracing internal and external data.

The mega-vendors will certainly bundle their Data Quality and MDM offerings by using more or less SOA. The ongoing vendor consolidation adds to that wave. But hopefully we will also see some true SOA where best-of-bread “Data Quality 2.0” and “MDM 2.0” technology will be implemented with strong business support under a broader solution plan to meet the intended business need by focusing on how the information is created, used, and managed for multiple purposes in a multi-cultural environment.

Actually I should have added a (part 1) to the heading of this post. But I will try to make 2.0 free headings in following posts on the next generation milestones in Data Quality and MDM coexistence. It is possible – I did that in my previous post called Master Data Quality: The When Dimension.

Bookmark and Share

Service Oriented Data Quality

puzzle

Service Oriented Architecture (SOA) has been a buzzword for some years.

In my opinion SOA is a golden opportunity for getting the benefits from data quality tools that we haven’t been able to achieve so much with the technology and approaches seen until now (besides the other SOA benefits being independent to technology).

Many data quality implementations until now have been batch cleansing operations suffering from very little sustainability. I have seen lots of well cleansed data never making it back to the sources or only being partially updated in operational databases. And even then a great deal of those updated cleansed data wasn’t maintained and prevented from there.

Embedded data quality functionality in different ERP, CRM, ETL solutions has been around for a long time. These solutions may serve their purpose very well when implemented. But often they are not implemented due to bundling of distinct ERP, CRM, ETL solutions and consultancies with specific advantages and data quality tools with specific advantages, which may not always be a perfect match. Also having different ERP, CRM, ETL solutions then often means different data quality tools and functionality probably not doing the same thing the same way.

Data Quality functionality deployed as SOA components have a lot to offer:

Reuse is one of the core principles of SOA. Having the same data quality rules applied to every entry point of the same sort of data will help with consistency.

Interoperability will make it possible to deploy data quality prevention as close to the root as possible.

Composability makes it possible to combine functionality with different advantages – e.g. combining internal checks with external reference data.

During the last years I have been on projects implementing data quality as SOA components. The results seem to be very promising so far, but I think we just started.

Bookmark and Share