Finding Me

19th April 2012

Many people have many names and addresses. So have I.

A search for me within Danish reference sources in the iDQ tool gives the following result:

Green T is positive in the Danish Telephone Books. Red C is negative in the Danish Citizen hub. Green C is positive in the Danish Citizen Hub.

Even though I have left Denmark I’m still registered with some phone subscriptions there. And my phone company hasn’t fully achieved single customer view yet, as I’m registered there with two slightly different middle (sur)names.

Following me to the United Kingdom I’m registered here with more different names.

It’s not that I’m attempting some kind of fraud, but as my surname contains The Letter Ø, and that letter isn’t part of the English alphabet, my National Insurance Number (kind of similar to the Social Security Number in the US) is registered by the name “Henrik Liliendahl Sorensen”.

But as the United Kingdom hasn’t a single citizen view, I am separately registered at the National Health Service with the name “Henrik Sorensen”. This is due to a sloppy realtor, who omitted my middle (sur)name on a flat rental contract. That name was taken further by British Gas onto my electricity bill. That document is (surprisingly for me) my most important identity paper in the UK, and it was used as proof of address when registering for health service.

How about you, do you also have several identities?

Bookmark and Share


The Big Search Opportunity

3rd April 2012

The other day Bloomberg Businessweek had an article telling that Facebook Delves Deeper Into Search.

I have always been advocating for having better search functionality in order to get more business value from your data. That certainly also applies to big data.

In a recent post called Big Reference Data Musings here on the blog, the challenge of utilizing large external data sources for getting better master data quality was discussed. In a comment Greg Leman pointed out, that there often isn’t a single source of the truth, as you for example could expect from say a huge reference data source as the Dun & Bradstreet WorldBase holding information about business entities from all over the world.

Indeed our search capabilities optimally must span several sources. In the business directory search realm you may include several sources at a time like supplementing the D&B  WorldBase with for example EuroContactPool, if you do business in Europe, or the source called Wiki-Data (under rename to AvoxData) if you are in financial services and wants to utilize the new Legal Entity Identifier (LEI) for counterparty uniqueness in conjunction with other more complete sources.

As examined in Search and if you are lucky you will find combining search on external reference data sources and internal master data sources is a big opportunity too. In doing that you, as described the follow up piece named Wildcard Search versus Fuzzy Search, must get the search technology right.

I see in the Bloomberg article that Facebook don’t intend to completely reinvent the wheel for searching big data, as they have hired a Google veteran, the Danish computer scientist Lars Rasmussen, for the job.

Bookmark and Share


Wildcard Search versus Fuzzy Search

13th February 2012

My last post about search functionality in Master Data Management (MDM) solutions was called Search and if you are lucky you will find.

In the comments the use of wildcards versus fuzzy search was touched.

The problem with wildcards

I have a company called “Liliendahl Limited” as this is the spelling of the name as it is registered with the Companies House for England and Wales.

But say someone is searching using one of the following strings:

  • “Liliendahl Ltd”,
  • “Liliendal Limited” or
  • “Liljendahl Limited”

Search functionality should in these situations return with the hit “Liliendahl Limited”.

Using wildcard characters could, depending on the specific syntax, produce a hit in all combinations of the spelling with a string like this: “lil?enda*l l*”.

The problem is however that most users don’t have the time, patience and skills to construct these search strings with wildcard characters. And maybe the registered name was something slightly else not meeting the wildcard characters used.  

Matching algorithms

Tools for batch matching of name strings have been around for many years. When doing a batch match you can’t practically use wildcard characters. Instead matching algorithms typically rely of one, or in best case a combination, of these techniques:

The same techniques can be used for interactive search thus reaching a hit in one fast search.

Fuzzy search

I have worked with the Omkron FACT algorithm for batch matching. This algorithm has morphed into being implemented as a fuzzy search algorithm as well.

One area of use for this is when webshop users are searching for a product or service within your online shop. This feature is, along with other eCommerce capabilities, branded as FACT-Finder.

The fuzzy search capabilities are also used in a tool I’m involved with called iDQ. Here external reference data sources, in combination with internal master data sources, are searched in an error tolerant way, thus making data available for the user despite heaps of spelling possibilities.

Bookmark and Share


Search and if you are lucky you will find

9th February 2012

This morning I was following the tweet stream from the ongoing Gartner Master Data Management (MDM) conference here in London, when another tweet caught my eyes:

 

This reminded me about that (error tolerant) search is The Overlooked MDM Feature.

Good search functionality is essential for making the most out of your well managed master data.

Search functionality may be implemented in these main scenarios:

Inside Search

You should be able to quickly find what is inside your master data hub.

The business benefits from having fast error tolerant search as a capacity inside your master data management solution are plenty, including:

  • Better data quality by upstream prevention against duplicate entries as explained in this post.
  • More efficiency by bringing down the time users spends on searching for information about entities in the master data hub.
  • Higher employee satisfaction by eliminating a lot of frustration else coming from not finding what you know must be inside the hub already.

MDM inside search capabilities applies to multiple domains: Party, product and location master data.

Search the outside

You should be able to quickly find what you need to bring inside your master data hub.

Data entry may improve a lot by having fast error tolerant search that explores the cloud for relevant data related to the entry being done. Doing that has two main purposes:

  • Data entry becomes more effective with less cumbersome investigation and fewer keystrokes.
  • Data quality is safeguarded by better real world alignment.

Preferably the inside and the outside search should be the same mash-up.

Searching the outside is applies especially to location and party master data.

Search from the outside

Website search applies especially to product master data and in some cases also to related location master data as described in the post Product Placement.

Your website users should be able to quickly find what you publish from your master data hub be that description of physical products, services or research documents as in the case of Gartner, which is an analyst firm.

As said in the tweet on the top of this post, (good) search makes the life of your coming and current customers much easier. Do I need to emphasize the importance of good customer experience?

Bookmark and Share


Reference Data at Work in the Cloud

5th January 2012

One of the product development programs I’m involved in is about exploiting rich external reference data and using these data in order to get data quality right the first time and being able to maintain optimal data quality over time.

The product is called instant Data Quality (abbreviated as iDQ ™). I have briefly described the concept in an earlier post called instant Data Quality.

iDQ ™combines two concepts:

  • Software as a Service
  • Data as a Service

While most similar solutions are bundled with one specific data provider the iDQ ™ concept embraces a range data sources. The current scope is around customer master data where iDQ ™ may include Business-to-Business (B2B) directories, Business-to-Consumer (B2C) directories, real estate directories, Postal Address Files and even social media network data from external sources as well as internal master data at the same time all presented in a compact mash-up.

The product has already gained a substantial success in my home country Denmark leading to the formation of a company solely working with development and sales of iDQ ™.

The results iDQ ™ customers gains may seem simple but are the core advantages of better data quality most enterprises are looking for, like said by one of Denmark’s largest companies:

“For DONG Energy iDQ ™ is a simple and easy solution when searching for master data on individual customers. We have 1,000,000 individual customers. They typically relocate a few times during the time they are customers of us. We use iDQ ™ to find these customers so we can send the final accounts to the new address. iDQ ™ also provides better master data because here we have an opportunity to get names and addresses correctly spelled.

iDQ ™ saves time because we can search many databases at the time. Earlier we had to search several different databases before we found the right master data on the customer. “

Please find more testimonials (in Danish) here.

I hope to be able to link to testimonials in more languages in the future.

Bookmark and Share


Matching Light Bulbs

15th December 2010

This morning I noticed this lightbulb joke in a tweet from @mortensax:

Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:

The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.

Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:

  • Light bulb, A 19, 130 Volt long life, 60 W
  • Incandescent lamp, 60 Watt, A19, 130V

while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):

  • Light bulb, 60 Watt, A 19, 130 Volt long life
  • Light bulb, 40 Watt, A 19, 130 Volt long life

Working with product data matching is indeed very enlightening.

Bookmark and Share


The Overlooked MDM Feature

7th December 2010

When engaging in the social media community dealing with master data management an often seen subject is creating a list of important capabilities for the technical side of master data management. I have at some occasions commented on such posts by adding a feature I often see omitted from these lists, namely: Error tolerant search functionality. Examples from the DataFlux CoE blog here and the LinkedIn Master Data Management Interest Group here.

Error tolerant search (also called fuzzy search) technology is closely related to data matching technology. But where data matching is basically none interactive, error tolerant search is highly interactive.

Most people know error tolerant search from googling. You enter something with a typo and google prompts you back with: Did you mean…? When looking for entities in master data management hubs you certainly need something similar. Spelling of names, addresses, product descriptions and so on is not easy – not at least in a globalized world.

As in data matching error tolerant search may use lists of synonyms as the basic technology. But also the use of algorithms is common going from an oldie like the soundex phonetic algorithm over more sophisticated algorithms.

The business benefits from having error tolerant search as a capacity in your master data management solution are plenty, including:

  • Better data quality by upstream prevention against duplicate entries as explained in this post.
  • More efficiency by bringing down the time users spends on searching for information about entities in the master data hub.
  • Higher employee satisfaction by eliminating a lot of frustration else coming from not finding what you know must be inside the hub already.

Error tolerant search has been one of the core features in the master data management implementations where I have been involved. What about you?

Bookmark and Share


The Sound of Soundex

14th September 2010

The probably oldest and most used error tolerant algorithm in searching and data matching is a phonetic algorithm called Soundex. If you are not familiar with Soundex: Wikipedia to the rescue here.

In the LinkedIn group Data Matching we seem to have an ongoing discussion about the usefulness of Soundex. Link to the discussion here – if you are not already a member: Please join, spammers are dealt with, though it is OK to brag about your data matching superiority.

To sum up the discussion on Soundex I think we at this stage may conclude:

  • Soundex is of course very poor compared to the more advanced algorithms, but it may be better than nothing (which will be exact searching and matching)
  • Soundex (or a variant of Soundex) may be used for indexing in order to select candidates to be scored with better algorithms.

Let’s say you are going to match 100 rows with names and addresses against a table with 100 million rows with names and addresses and let’s say that the real world individual behind the 100 rows is in fact represented among the 100 million, but not necessary spelled the same.

Your results may be as this:

  • If you use exact automated matching you may find 40 matching rows (40 %).
  • If you use automated matching with (a variant of) Soundex you may find 95 matching rows, but only 70 rows (70 %) are correct matches (true positives) as 25 rows (25 %) are incorrect matches (false positives).
  • If you use automated matching with (a variant of) Soundex indexing and advanced algorithm for scoring you may find 75 matching rows where 70 rows (70 %) are correct matches (true positives) and 5 rows (5 %) are incorrect matches (false positives).
  • By tuning the advanced algorithm you may find 67 matching rows where 65 rows (65 %) are correct matches (true positives) and 2 rows (2 %) are incorrect matches (false positives).

So when using Soundex you will find more matching rows but you will also find more manual work in verifying the results. Adding an advanced algorithm may reduce the manual work or eliminate manual work at the cost of some not found matches (false negatives) and the risk of a few wrong matches (false positives).

PS: I have a page about other Match Techniques including standardization, synonyms and probabilistic learning.

PPS: When googling for if the title of this blog has been used before I found this article from a fellow countryman.

Bookmark and Share


Military Intelligence

2nd September 2010

Many data quality issues may be prevented by having some intelligent (error tolerant) search going on. I wrote a post about it called Upstream prevention by error tolerant search.

Intelligent search may have a lot of other advantages too.

A scam related to the Danish Military has been going on for a while. The short story is:

A member of the Special Forces wrote a book about combat actions in Afghanistan. The Military tried to stop it, because it could help the enemy. In that process they by some reason made an Arabic translation and by some mistake leaked that to the press. The key person at the military around doing that has the surname “Sønderskov”.

Police “experts” were assigned to find the leak. For a month they unsuccessful searched for an e-mail address including “Sønderskov” only to realize: Oh, e-mail addresses can’t have the national character “ø”. It must either be “oe” or “o” instead as “Soenderskov” or “Sonderskov”.

The story (in Danish) here from the online computer media Version2.

Bookmark and Share


Data Quality is an Ingredient, not an Entrée

9th July 2010

Fortunately it is more and more recognized that you don’t get success with Business Intelligence, Customer Relationship Management, Master Data Management, Service Oriented Architecture and many more disciplines without starting with improving your data quality.

But it will be a big mistake to see Data Quality improvement as an entrée before the main course being BI, CRM, MDM, SOA or whatever is on the menu. You have to have ongoing prevention against having your data polluted again over time.

Improving and maintaining data quality involves people, processes and technology. Now, I am not neglecting the people and process side, but as my expertise is in the technology part I will like to mention some the technological ingredients that help with keeping data quality at a tasty level in your IT implementations.

Mashups

Many data quality flaws are (not surprisingly) introduced at data entry. Enterprise data mashups with external reference data may help during data entry, like:

  • An address may be suggested from an external source.
  • A business entity may be picked from an external business directory.
  • Various rules exist in different countries for using consumer/citizen directories – why not use the best available where you do business.

External ID’s

Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.

The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.

Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.

Error tolerant search

A common workflow when in-house personnel are entering new customers, suppliers, purchased products and other master data are, that first you search the database for a match. If the entity is not found, you create a new entity. When the search fails to find an actual match we have a classic and frequent cause for introducing duplicates.

An error tolerant search are able to find matches despite of spelling differences, alternative arranged words, various concatenations and many other challenges we face when searching for names, addresses and descriptions.

Bookmark and Share


Follow

Get every new post delivered to your Inbox.

Join 125 other followers