Having the right element to the left

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

  • In some cultures like in east Asia the family name is written first and the given name is written last.
  • Some notations indicate that the given name isn’t the first element:
    • “DUPONT Michel” is a custom French way of telling, that the family name is the first element
    • “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

  • In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
  • In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
  • The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Bookmark and Share

Under new Master Data Management

”Under new management” is a common sign in the window of a restaurant. The purpose of the sign is to tell: Yes, we know: Really bad food was served in a really bad way here. But from now on we have a new management dedicated to serve really good food in a really good way.

By the way: Restaurants are one of the more challenging business entities to handle in Party Master Data Management:

  • They do change owner more often than most other business entities making them a new legal entity each time which is important for some business contexts like credit risk.
  • On the other hand it’s the same address despite a new owner, which makes it being the same entity in the eyes of other business contexts like logistics.
  • In many cases you may have a name (trade style) of the restaurant and another official name of the business – a variant of this is when the restaurant is franchised.

Master Data Management is not trivial – serving restaurants or not.

Improving Master Data Management starts with the sign in the window: Yes, we know: Really bad information was served here in a really bad way. But from now on we have a new master data management dedicated to serve really good information in a really good way.

Then you may have a look at the menu. Do we have the right mix of menu items for the guests we like to serve? How are we going to govern a steady flow of fresh raw data that’s going to be prepared and selected from the menu and end up at the tables?

What about the waiters attitude? Serving is much more fun if you are proud about the dishes coming from the kitchen. It’s pleasant to bring compliments from guests back to the kitchen – not at least given along with great tips.

The information chef have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Bon appetit.

Bookmark and Share

Candidate Selection in Deduplication

When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.

We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.

Lets look at the following example. We have 2 names and addresses:

Banca di Toscana Società per azioni
Machiavelli 12
IT 51234 Firenze

Vanca di Toscana SpA
12, Via Niccolò Machiavelli
Florence
Italy

A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:

  • The ISO country code for Italy is IT
  • Florence is the English name for the city called Firenze in Italian
  • In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
  • In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
  • “Società per azioni” with the acronym SpA or S.p.A is an Italian legal form

But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.

If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.

I have worked with the following techniques for overcoming this challenge:

Parsing and standardization

The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:

IT, 51234, Via Niccolo Machiavelli, 12

Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.

This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.

Fuzzy search

Here you make use of the same fuzzy techniques used in similarity assignment when searching.

Probabilistic learning

If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.

Hybrid

As always in data quality automation, using all different techniques in a given implementation makes your margins better.

Deploying Data Matching

As discussed in my last post a core part of many Data Quality tools is Data Matching. Data Matching is about linking entities in or between databases, where these entities are not already linked with unique keys.

Data Matching may be deployed in some different ways, where I have been involved in the following ones:

External Service Provider

Here your organization sends extracted data sets to an external service provider where the data are compared and also in many cases related to other reference sources all through matching technology. The provider sends back a “golden copy” ready for uploading in your databases.

Some service provider’s uses a Data Matching tool from the market and others has developed own solutions. Many solutions grown at the providers are country specific equipped with a lot of tips and tricks learned from handling data from that country over the years.

The big advantage here is that you gain from the experience – and the reference data collection – at these providers.

Internal Processing

You may implement a data quality tool from the market and use it for comparing your own data often from disparate internal sources in order to grow the “golden copy” at home.

Many MDM (Master Data Management) products have some matching capabilities build in.

Also many leading Business Intelligence tool providers supplement the offering with a (integrated) Data Quality tool with matching capabilities as an answer to the fact, that Business Intelligence on top of duplicated data doesn’t make sense.

Embedded Technology

Many data quality tool vendors provide plug-ins to popular ERP, CRM and SCM solutions so that data are matched with existing records at the point of entry. For the most popular such solutions as SAP and MS CRM there is multiple such plug-in’s from different Data Quality technology providers. Then again many implementation houses have a favorite combination – so in that way you select the matching tool by selecting an implementation house.

SOA Components

The embedded technology is of course not optimal where you operate with several databases and the commercial bundling may also not be the actual best solution for you.

Here Service Oriented Architecture thinking helps, so that matching services are available as SOA components at any point in your IT landscape based on centralized rule setting.

Cloud Computing

Cloud computing services offered from external service providers takes the best from these two worlds into one offering.

Here the SOA component resides at the external service provider – in best case combining an advanced matching tool, rich external reference data and the tips and tricks for your particular country and industry in question.

Bookmark and Share

Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share

The Myth about a Myth

A sentiment repeated again and again related to Data (Information) Quality improvement goes like this:

“It’s a myth that Data Quality improvement is all about technology”.

In fact you see the same related to a lot of other disciplines as:

  • “It’s a myth that Master Data Management is all about technology”.
  • “It’s a myth that Business Intelligence is all about technology”.
  • “It’s a myth that Customer Relationship Management is all about technology”.

I have a problem with that: I have never heard anyone say that DQ/MDM/BI/CRM… is all about technology and I have never seen anyone writing so.

When I make the above remark the reaction is almost always this:

“Of course not, but I have seen a lot of projects carried out as if they were all about technology – and of course they failed”.

Unquestionable true.

But the next question is then about root cause. Why did those projects seem to be all about technology? I think it was:

  • Poor project management or
  • Bad balance between business and IT involvement or
  • Immature technology alienating business users.

In my eyes there is no myth about that Data Quality (and a lot of other things) is all about technology. It’s a myth it’s a myth.

Bookmark and Share

Bon Appetit

If I enjoy a restaurant meal it is basically unimportant to me what raw ingredients from where were used and which tools the chef used during preparing the meal. My concerns are whether the taste meet my expectations, the plate looks delicious in my eyes, the waiter seems nice and so on.

This is comparable to when we talk about information quality. The raw data quality and the tools available for exposing the data as tasty information in a given context is basically not important to the information consumer.

But in the daily work you and I may be the information chef. In that position we have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Let’s look at some analogies.

Best before

Fresh raw ingredients is similar to actualized raw data. Raw data also has a best before date depending on the nature of the data. Raw data older than that date may be spiced up but will eventually make bad tasting information.

One-stop-shopping

Buying all your raw ingredients and tools for preparing food – or taking the shortcut with ready made cookie cutting stuff – from a huge supermarket is fast and easy (and then never mind the basket usually also is filled with a lot of other products not on the shopping list).

A good chef always selects the raw ingredients from the best specialized suppliers and uses what he consider the most professional tools in the preparing process.

Making information from raw data has the same options.

Compliance

Governments around the world has for long time implemented regulations and inspection regarding food mainly focused at receiving, handling and storing raw ingredients.

The same is now going on regarding data. Regulations and inspections will naturally be directed at data as it is originated, stored and handled.

Diversity

Have you ever tried to prepare your favorite national meal in a foreign country?

Many times this is not straightforward. Some raw ingredients are simply not available and even some tools may not be among the kitchen equipment.

When making information from raw data under varying international conditions you often face the same kind of challenges.

Information and Data Quality Blog Carnival, February 2010


El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on AndrewBrooks.co.uk by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on IQTrainwrecks.com by IAIDQ

Bookmark and Share

Select Company_ID from External_Source where possible

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

  • 2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
  • 3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
  • 4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
  • 5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
  • n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

Create Table Homo_Sapiens

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

  • Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
  • Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

  • Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
  • Identity Resolution is about the same but  – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
  • Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.

Bookmark and Share