The Art in Data Matching

I’ve just investigated a suspicious customer data match:

A Company on Kunstlaan no 99 in Brussel

was matched with high confidence with:

The Company on Avenue des Arts no 99 in Bruxelles

At first glance it perhaps didn’t look as a confident match, but I guess the computer is right.

The diverse facts are:

  • Brussels is the Belgian capital
  • Belgium has two languages: French and Flemish (a variant of Dutch)
  • Some parts of the country is French, some parts is Flemish and the capital is both
  • Brussels is Bruxelles in French and Brussel in Flemish
  • Kunst is Flemish meaning Art (as in Dutch, German and Scandinavian too)
  • Laan is Flemish meaning Avenue (same origin as Lane I guess)
  • Avenue des Arts is French meaning Avenue of Art (French is easy)

Technically the computer in this case did as follows:

  • Compared the names like “A Company” and “The Company” and found a close edit distance between the two names.
  • Remembered from some earlier occasions that “Kunstlaan” and “Avenue des Arts” was accepted as a match.
  • Remembered from numerous earlier occasions that “Brussel”(or “Brüssel) and “Bruxelles” was accepted as a match.

It may also have been told beforehand that “Kunstlaan” and “Avenue des Art” are two names of the same street in some Belgian address reference data which I guess is a must when doing heavy data matching on the Belgian market.

In this case it was a global match environment not equipped with worldwide address reference data, so luckily the probabilistic learning element in the computer program saved the day.

Bookmark and Share

Lots of Product Names

In master data management the two most prominent domains are:

  • Parties and
  • Products

In the quest for finding representations of parties actually being the same real world party and finding representations of products actually being the same real world product we typically execute fuzzy data matching of:

  • Party names as person names and company names
  • Product descriptions

However I have often seen party names being an integral part of matching products.

Some examples:

Manufacturer Names:

A product is most often being regarded as distinct not only based on the description but also based on the manufacturer. So besides being sharp on matching product descriptions for light bulbs you must also consider if for example the following manufacturer company names are the same or not:

  • Koninklijke Philips Electronics N.V.
  • Phillips
  • Philips Electronic

Author Names:

A book is a product. The title of the book is the description. But also the author’s person name counts. So how do we collect the entire works made by the author:

  • Hans Christian Andersen
  • Andersen, Hans Christian
  • H. C. Andersen

as all three representations are superb bad data?

Bear Names:

A certain kind of teddy bears has a product description like “Plush magenta teddy bear”. But each bear may have a pet name like “Lots-O’-Huggin’ Bear” or just short “Lotso” as seen in the film “Toy Story 3”. And seriously: In real business I have worked with building a bear data model and the related data matching.

PS: For those who have seen Toy Story 3: Is that Lotso one or two real world entities?  

Bookmark and Share

So I’m not a Capricorn?

Yesterday was my birthday. Being born the 14th January makes me a Capricorn according to astrology.

Only there is a slight problem. As told in an article on Huffingtonpost an astronomer has kindly remarked that the assignment of signs with the calendar was made thousands of years ago. In the mean time the earth’s orbit has changed, so we should have completely new signs (and personalities?) today.     

I guess astrology qualifies as a data and information quality trainwreck by forgetting one of the most common pitfalls in data quality: Things change.  

Bookmark and Share

Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

Storing a Single Version of the Truth

An ever recurring subject in the data quality and master data management (MDM) realms is whether we can establish a single version of the truth.

The most prominent example is whether an enterprise can implement and maintain a single version of the truth about business partners being customers, prospects, suppliers and so on.

In the quest for establishing that (fully reachable or not) single version of the truth we use identity resolution techniques as data matching and we are exploiting ever increasing sources of external reference data.

However I am often met with the challenge that despite what is possible in aiming for that (fully reachable or not) single version of the truth, I am often limited by the practical possibilities for storing it.

In storing party master data (and other kind of data) we may consider these three different ways:

Flat files

This “Keep It Simple, Stupid” way of storing data has been on an ongoing retreat – however still common, as well as new inventions of big flat file structures of data are emerging.

Also many external sources of reference data is still flat file like and the overwhelming choice of exchanging reference and master data is doing it by flat files.

Despite lots of work around solutions for storing the complex links of the real world in flat files we basically ends up with using very simplified representations of the real world (and the truth derived) in those flat files.  

Relational databases

Most Customer Relationship Management (CRM) systems are based on a relational data model, however mostly quite basic regarding master data structures making it not straight forward to reflect the most common hierarchical structures of the real world as company family trees, contacts working for several accounts and individuals forming a household.  

Master Data Management hubs are of course built for storing exactly these hierarchical kinds of structures. Common challenges here are that there often is no point in doing that as long as the surrounding applications can’t follow and that you often may restrict your use to a simplified model anyway like an industry model.   

Neural networks

The relations between parties in the real world are in fact not truly hierarchical. That is why we look into the inspiration from the network of biological neurons.

Doing that has been an option I have heard about for many years but still waits to meet as a concrete choice when delivering a single version of the truth.   

Bookmark and Share

Entity Revolution vs Entity Evolution

Entity resolution is the discipline of uniquely identifying your master data records, typically being those holding data about customers, products and locations. Entity resolution is closely related to the concept of a single version of the truth.

Questions to be asked during entity resolution are like these ones:

  • Is a given customer master data record representing a real world person or organization?
  • Is a person acting as a private customer and a small business owner going to be seen as the same?
  • Is a product coming from supplier A going to identified as the same as the same product coming from supplier B?
  • Is the geocode for the center of a parcel the same place as the geocode of where the parcel is bordering a public road?

We may come a long way in automating entity resolution by using advanced data matching and exploiting rich sources of external reference data and we may be able to handle the complex structures of the real world by using sophisticated hierarchy management and hereby make an entity revolution in our databases.

But I am often faced with the fact that most organizations don’t want an entity revolution. There are always plenty of good reasons why different frequent business processes don’t require full entity resolution and will only be complicated by having it (unless drastic reengineered). The tangible immediate negative business impact of an entity revolution trumps the softer positive improvement in business insight from such a revolution.

Therefore we are mostly making entity evolutions balancing the current business requirements with the distant ideal of a single version of the truth.

Bookmark and Share

Big Trouble with Big Names

An often seen issue in party master data management is handling information about your most active customers, suppliers and other roles of interest. These are often big companies with many faces.

I remember meeting that problem way back in the 80’s when I was designing a solution for the Danish Maritime Authorities.  

In relation to a ship there are three different main roles:

  • The owner of the ship, who has some legal rights and obligations
  • The operator of ship, who has responsibilities regarding the seaworthiness of the ship
  • The employer, who has responsibilities regarding the seamen onboard the ship

Sometimes these roles don’t belong to the same company (or person) for a given ship. That real world reality was modeled all right. But even if it practically is the same company, then the roles are materialized very different for each role. I remember this was certainly the case with the biggest ship-owner in Denmark (and also by far the biggest company in Denmark) being the A.P. Moller – Maersk Group.

We really didn’t make a golden record for that golden company in my time on the project.

Bookmark and Share

instant Data Quality

My last blog post was all about how data quality issues in most cases are being solved by doing data cleansing downstream in the data flow within an enterprise and the reasons for doing that.

However solving the issues upstream wherever possible is of course the better option. Therefore I am very optimistic about a project I’m involved in called instant Data Quality.

The project is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done. Doing that has two main purposes:

  • Data entry becomes more effective. Less cumbersome investigation and fewer keystrokes.
  • Data quality is safeguarded by better real world alignment.

The combination of a more effective business process that also results in better data quality seems to be good – like a sugar-coated vitamin pill. By the way: The vitamin pill metaphor also serves well as vitamin pills should be supplemented by a healthy life style. It’s the same with data management.

Implementing improved data quality by better real world alignment may go beyond the usual goal for data quality being meeting the requirements for the intended purpose of use.  This means that you instantly are getting more by doing less.

Bookmark and Share

Complicated Matters

A while ago I wrote a short blog post about a tweet from the Gartner analyst Ted Friedman saying that clients are disappointed with the ability to support wide deployment of complex business rules in popular data quality tools.

Speaking about popular data quality tools; on the DataFlux Community of Experts blog Founder of DataQualityPro Dylan Jones posted a piece this Friday asking: Are Your Data Quality Rules Complex Enough?

Dylan says: “Many people I speak to still rely primarily on basic data profiling as the backbone of their data quality efforts”.

The classic answers to the challenge of complex business rules are:

  • Relying on people to enforce complex business rules. Unfortunately people are not as consistent in enforcing complex rules as computer programs are.
  • Making less complex business rules. Unfortunately the complexity may be your competitive advantage.

In my eyes there is no doubt about that data quality tool vendors has a great opportunity in research and development of tools that are better at deploying complex business rules. In my current involvement in doing so we work with features as:

  • Deployment as Service Oriented Architecture components. More on this topic here.
  • Integrating multiple external sources. Further explained here.
  • Combining the best algorithms. Example here.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share