What a Happy Day

I have got a lot of good news today.

First a Nigerian gentleman wants to deposit 40.5 M $ on my bank account and 35 % is for me to keep. Wow.

Next it seems that data quality improvement and master data management is not about technology at all. Practically you only need smart people doing smart processes. Wow.

Nobody actually needs my assistance that much and soon I will have plenty of money in the bank.

What is a best-in-class match engine?

Latest in connection with that TIBCO acquires data matching vendor Netrics the term best-in-class match engine has been attached to the Netrics product.

First: I have no doubt that the Netrics product is a capable match engine – I know that from discussions in the LinkedIn Data Matching group and here on this blog.

Next: I don’t think anyone knows what product is the best match engine, because I don’t think that all match engines have been benchmarked with a representative set of data.

There are of course on top the matching capabilities with different entity types to consider. Here party master data (like customer data) are covered by most products whereas capabilities with other entity types (be that considered same same or not) are far less exposed.

As match engine products are acquired and integrated in suites the core matching capabilities somehow becomes mixed up with a lot of other capabilities making it hard to compare the match engine alone.

Some independent match engines work stand alone and some may be embedded into other applications.

These may then be the classes to be best in:

  • Match engines in suites
  • Embedded match engines (for say SAP, MS CRM and so on)
  • Stand alone match engines

Many match engines I have seen are tuned to deal with data from the country (culture) where they are born and had their first triumphs. As the US market is still far the largest for match engines the nomination of best match engine resembles when a team becomes World Champions in American Football. International/multi-cultural capabilities will become more and more important in data matching. But indeed we may define a class for each country (culture).

In the old days I have heard that one match engine was best for marketing data and another match engine was best for credit risk management. I think these days are over too. With Master Data Management you have to embrace all data purposes.

Some match engines are more successful in one industry. The biggest differentiator in match effectiveness is with B2C and/or B2B data. B2C is the easiest, B2B is more complex and embracing both is in my eyes a must for being considered best-in-class – unless we define separate classes for B2C, B2B and both.

As some matching techniques are deterministic and some are probabilistic the evaluation on the latter one will be based on data already processed in a given instance, as the matching gets better and better as the self learning element is warmed up.

So, yes, an endless religious-like discussion I reopened here.

Bookmark and Share

What is Data Quality anyway?

The above question might seem a bit belated after I have blogged about it for 9 months now. But from time to time I ask myself some questions like:

Is Data Quality an independent discipline? If it is, will it continue to be that?

Data Quality is (or should) actually be a part of a lot of other disciplines.

Data Governance as a discipline is probably the best place to include general data quality skills and methodology – or to say all the people and process sides of data quality practice. Data Governance is an emerging discipline with an evolving definition, says Wikipedia. I think there is a pretty good chance that data quality management as a discipline will increasingly be regarded as a core component of data governance.

Master Data Management is a lot about Data Quality, but MDM could be dead already. Just like SOA. In short: I think MDM and SOA will survive getting new life from the semantic web and all the data resources in the cloud. For that MDM and SOA needs Data Quality components. Data Quality 3.0 it is.

You may then replace MDM with CRM, SCM, ERP and so on and here by extend the use of Data Quality components from not only dealing with master data but also transaction data.

Next questions: Is Data Quality tools an independent technology? If it is, will it continue to be that?

It’s clear that Data Quality technology is moving from being stand alone batch processing environments, over embedded modules to, oh yes, SOA components.

If we look at what data quality tools today actually do, they in fact mostly support you with automation of data profiling and data matching, which is probably only some of the data quality challenges you have.

In the recent years there has been a lot of consolidation in the market around Data Integration, Master Data Management and Data Quality which certainly is telling that the market need Data Quality technology as components in a bigger scheme along with other capabilities.

But also some new pure Data Quality players are established – and I think I often see some old folks from the acquired entities at these new challengers. So independent Data Quality technology is not dead and don’t seem to want to be that.

Bookmark and Share

Grandpa’s Story

Now I have become a grandfather it’s time for a blog post about lessons learned in life.

One of my favourite authors as a young man was Cyril Northcote Parkinson, the grand father of the famous Parkinson’s Law saying:

Work expands so as to fill the time available for its completion.

Early in my career I learned how true this is. My first experience was also like the statistics behind Parkinson’s Law from within public administration, but later I learned that private enterprises are just the same.

My first real job after graduation was at the Danish Tax Authorities. After having worked there a few years I was assigned on a mission to assist the Faroe Islands Financial Authorities in developing a modernised tax collection solution.

The Faroe Islands

For those readers that hate old people not sticking to the subject – please continue to the next headline.

For those readers who don’t have a clue about where on earth the Faroe Islands are: Well. 1000 years ago the Vikings sailed out from Scandinavia and finally made it to say hello to the Native Americans – 500 years before Columbus. When doing that they used islands in the Northern Atlantic as stepping stones. First British Isles, then Faroe Islands, Iceland, Greenland and finally Newfoundland at the American coast.

Just like Columbus found America by mistake, as he was actually looking for India, the Vikings probably also found America and the stepping stones by mistake when getting lost on the ocean during storms.

1/100

Back on track. The mission for the Faroe Island Authorities I joined in the early 1980’s seemed impossible. As the Faroese population is only 1/100 of the population of the continental Denmark there were of course only 1/100 of the resources available for making a solution doing exactly the same as the solution built for continental Denmark

But what I learned was that the solution actually was built using only those resources and in surprisingly short time (and with minimal help from me and my colleagues).

While I during my career have worked in both modest sized organisations and large organisations I have noticed numerous examples on how exactly the same task may consume resources not sized by the nature of the task but by the size of the organisation.

People and technology

Maybe this observation is an explanation to the ever recurring subject on whether people or technology is most important when doing projects like improving data quality. If the technology part is (close to) constant but the over-all resource consumption grows with the size of the organisation in question, well, then the people part becomes more and more important by the size of the organisation

Tool making

I have tried single handed to build a data quality tool – or to be more specific a data matching tool. At several occasions it has been benchmarked with products residing as leaders in the Gartner Magic Quadrant for data quality tools, and it didn’t come out short. Some of the features included in the product called SuperMatch are described in the post “When computer says maybe”.

It’s my impression, that if you look at tool vendors with many employees, it’s only a very small group of people who is actually working on the tool

When computer says maybe

When matching customer master data in order to find duplicates or find corresponding real world entities in a business directory or a consumer directory you may use a data quality kind of deduplication tool to do the hard work.

The tool will typically – depending on the capabilities of the tool and the nature of and purpose for the data – find:

A: The positive automated matches.  Ideally you will take samples for manual inspection.

C: The negative automated matches.

B: The dubious part selected for manual inspection.

Humans are costly resources. Therefore the manual inspection of the B pot (and the A sample) may be supported by a user interface that helps getting the job done fast but accurate.

I have worked with the following features for such functionality:

  • Random sampling for quality assurance – both from the A pot and the manual settled from the B pot
  • Check-out and check-in for multiuser environments
  • Presenting a ranked range of computer selected candidates
  • Color coding elements in matched candidates – like:
    • green for (near) exact name,
    • blue for a close name and
    • red for a far from similar name
  • Possibility for marking:
    • as a manual positive match,
    • as a manual negative match (with reason) or
    • as questionable for later or supervisor inspection (with comments)
  • Entering a match found by other methods
  • Removing one or several members from a duplicate group
  • Splitting a duplicate group into two groups
  • Selecting survivorship
  • Applying hierarchy linkage

Anyone else out there who have worked with making or using a man-machine dialogue for this?

Cultural Stereotypes, Matching Engines and an Oscar

Normally I’m not that fond of using cultural stereotypes, but nevertheless prompted by a conversation lately (and inspired by the Oscar show) I came to think about the following scenarios:

Indian Style

I have heard that in India you don’t say no if someone asks you to do something. So a Bollywood story could be:

A boss calls in a product manger. He asks him to make a data matching engine that produces no false positives and no false negatives. The product manager knows it is impossible, but can’t say no. The product manager says it may be complicated, but when told they can double the team he goes back to the developers and initiates the project.

After a month the boss calls the product manger and asks if they are finished. The product manager replies: “Well, we have come a long way, but there are still some unresolved issues and some testing to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have solved the previous issues, but we have run into some new problems and some more testing has to be done”.

After yet a month the boss calls the product manger again and asks if they are finished. The product manager replies: “Well, we have ….

Danish Habits

In Denmark we have a good compensation from the state if we lose our jobs and anyway we are confident that we will find another one. So the short story (we are good at short films) could be:

The boss calls in the product manager and says “Hi Kim, it’s been decided we will make a matching engine that produces no false positives and no false negatives”.

The product manager leans forward, slams the provided business plan onto the table and says: “If you want such a product you can make it yourself” and leaves the room.

The American Way

It’s my impression, that in the United States you (mostly) do what you are told to do. So here the Hollywood story could be:

The boss calls in the product manager and says “Chris, I have got a great idea:  We will make a matching engine that produces no false positives and no false negatives”.

The product manager replies: “That’s impossible”.

The boss says: “Chris, I didn’t ask you about your opinion but told you to make the product”.

The product manager: “You’re the boss”.

The product manager returns to the team. They work hard to make a matching engine with some configurable settings as:

  • No false positives, but false negatives are allowed (recommended)
  • No false negatives, but false positives are allowed
  • No false positives and no false negatives

The boss is satisfied with how the product looks like. He passes it on to marketing. Marketing contacts the analysts. The analysts are excited about the product features and writes about how this great product (from this well established company) will change the game of data matching.

Under new Master Data Management

”Under new management” is a common sign in the window of a restaurant. The purpose of the sign is to tell: Yes, we know: Really bad food was served in a really bad way here. But from now on we have a new management dedicated to serve really good food in a really good way.

By the way: Restaurants are one of the more challenging business entities to handle in Party Master Data Management:

  • They do change owner more often than most other business entities making them a new legal entity each time which is important for some business contexts like credit risk.
  • On the other hand it’s the same address despite a new owner, which makes it being the same entity in the eyes of other business contexts like logistics.
  • In many cases you may have a name (trade style) of the restaurant and another official name of the business – a variant of this is when the restaurant is franchised.

Master Data Management is not trivial – serving restaurants or not.

Improving Master Data Management starts with the sign in the window: Yes, we know: Really bad information was served here in a really bad way. But from now on we have a new master data management dedicated to serve really good information in a really good way.

Then you may have a look at the menu. Do we have the right mix of menu items for the guests we like to serve? How are we going to govern a steady flow of fresh raw data that’s going to be prepared and selected from the menu and end up at the tables?

What about the waiters attitude? Serving is much more fun if you are proud about the dishes coming from the kitchen. It’s pleasant to bring compliments from guests back to the kitchen – not at least given along with great tips.

The information chef have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Bon appetit.

Bookmark and Share

Candidate Selection in Deduplication

When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.

We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.

Lets look at the following example. We have 2 names and addresses:

Banca di Toscana Società per azioni
Machiavelli 12
IT 51234 Firenze

Vanca di Toscana SpA
12, Via Niccolò Machiavelli
Florence
Italy

A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:

  • The ISO country code for Italy is IT
  • Florence is the English name for the city called Firenze in Italian
  • In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
  • In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
  • “Società per azioni” with the acronym SpA or S.p.A is an Italian legal form

But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.

If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.

I have worked with the following techniques for overcoming this challenge:

Parsing and standardization

The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:

IT, 51234, Via Niccolo Machiavelli, 12

Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.

This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.

Fuzzy search

Here you make use of the same fuzzy techniques used in similarity assignment when searching.

Probabilistic learning

If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.

Hybrid

As always in data quality automation, using all different techniques in a given implementation makes your margins better.

Deploying Data Matching

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Bookmark and Share

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share