Don’t Mess (Up) with Jensen

Jensens fiskA big talk in the media in Denmark this weekend is the story about that a little harbor restaurant specializing in serving fish has been denied continuing using the name Jensens Fiskerestaurant (Jensen’s Fish Restaurant in English). A lower court has earlier disallowed the name Jensens Fiskehus (Jensen’s Fish House in English).

Jensen BeefThe opponent is a large restaurant chain called Jensen’s Bøfhus (Jensen’s Beef House in English).

This has brought a so called shitstorm over the restaurant chain in social media, not at least on Facebook. Jensen is the most common surname in Denmark. A bit more than a quarter of a million people, which is 5 percent of the population, are called Jensen. So how can a big chain be the only one allowed to use the name Jensen for a restaurant?

PS: I remember this nasty restaurant chain name from when I coded name parsing routines in the old days. “Jensen’s Bøfhus” initially came out as “S. Bøfhus Jensen”. Some of the remedy was to apply external reference data to name parsing as checking if a business entity with a similar name exists on the address.

Bookmark and Share

Reading the right Reading

TripItIn order to have all my travel arrangements in one place I use a service called TripIt. When I receive eMail confirmations from airlines, hotels, train planners and so, I simply forward those to plans@tripit.com, and within seconds they build or amend to an itinerary for me that is available in an app.

Today I noticed a slight flaw though. I was going by train from London, UK up to the Midlands via a large town in the UK called Reading.

The strange thing in the itinerary was that the interchanges in Reading was placed in chronology after arriving at and leaving the final destination.

A closer look at the data revealed two strange issues:

  • Reading was spelled Reading, PA
  • The time zone for the interchange was set to EST

Hmmm…  There must be a town called Reading in Pennsylvania across the pond. Tripit must, when automatically reading the eMail, have chosen the US Reading for this ambiguous town name and thereby attached the Eastern American time zone to the interchange.

Picking the right Reading for me in the plan made the itinerary look much more sensible.

Bookmark and Share

Using External Data in Data Matching

One of the things that data quality tools does is data matching. Data matching is mostly related to the party master data domain. It is about comparing two or more data records that does not have exactly the same data but are describing the same real world entity.

Common approaches for that is to compare data records in internal master data repositories within your organization. However, there are great advantages in bringing in external reference data sources to support the data matching.

Some of the ways to do that I have worked with includes these kind of big reference data:

identityBusiness directories:

The business-to-business (B2B) world does not have privacy issues in the degree we see in the business-to-consumer (B2C) world. Therefore there are many business directories out there with a quite complete picture of which business entities exists in a given country and even in regions and the whole world.

A common approach is to first match your internal B2B records against a business directory and obtain a unique key for each business entity. The next step of matching business entities with that unique is a no brainer.

The problem is though that an automatic match between internal B2B records and a business directory most often does not yield a 100 % hit rate. Not even close as examined in the post 3 out of 10.

Address directories:

Address directories are mostly used in order to standardize postal address data, so that two addresses in internal master data that can be standardized to an address written in exactly the same way can be better matched.

A deeper use of address directories is to exploit related property data. The probability of two records with “John Smith” on the same address being a true positive match is much higher if the address is a single-family house opposite to a high-rise building, nursery home or university campus.

Relocation services:

A common cause of false negatives in data matching is that you have compared two records where one of the postal addresses is an old one.

Bringing in National Change of Address (NCOA) services for the countries in question will help a lot.

The optimal way of doing that (and utilizing business and address directories) is to make it a continuous element of Master Data Management (MDM) as explored in the post The Relocation Event.

Bookmark and Share

Completeness is still bad, while uniqueness is improving

In a recent report called The State of Marketing Data prepared by Netprospex over 60 million B2B records were analyzed in order to assess the quality of the data measured as fitness for use related to marketing purposes.

An interesting find was that out of a score of maximum 5.0 duplication, the dark side of uniqueness, was given the average score 4.2 while completeness was given the average score 2.7.

The STaTe of MarkeTing DaTa

This corresponds well with my experience. We have in the data quality realm worked very hard with deduplication tools using data matching approaches over the years and results are showing up. We are certainly not there yet, but it seems that completeness, and in my experience also accuracy, are data quality dimensions currently suffering more.

In my eyes the remedy for improvement in completeness and accuracy goes hand in hand with even better uniqueness. It is about getting the basic data right the first time as described in the post instant Single Customer View and being able to keep up completeness and accuracy as told in the post External Events, MDM and Data Stewardship.

Bookmark and Share

Identity Resolution and Social Data

Fingerprint
Identity Resolution

Identity resolution is a hot potato when we look into how we can exploit big data and within that frame not at least social data.

Some of the most frequent mentioned use cases for big data analytics revolves around listening to social data streams and combine that with traditional sources within customer intelligence. In order to do that we need to know about who is talking out there and that must be done by using identity resolution features encompassing social networks.

The first challenge is what we are able to do. How we technically can expand our data matching capabilities to use profile data and other clues from social media. This subject was discussed in a recent post on DataQualityPro called How to Exploit Big Data and Maintain Data Quality, interview with Dave Borean of InfoTrellis. In here InfoTrellis “contextual entity resolution” approach was mentioned by David.

The second challenge is what we are allowed to do. Social networks have a natural interest in protecting member’s privacy besides they also have a commercial interest in doing so. The degree of privacy protection varies between social networks. Twitter is quite open but on the other hand holds very little usable stuff for identity resolution as well as sense making from the streams is an issue. Networks as Facebook and LinkedIn are, for good reasons, not so easy to exploit due to the (chancing) game rules applied.

As said in my interview on DataQualityPro called What are the Benefits of Social MDM: It is a kind of a goldmine in a minefield.

Bookmark and Share

Unique Data = Big Money

In a recent tweet Ted Friedman of Gartner (the analyst firm) said:

ted on reference data

I think he is right.

Duplicates has always been pain number one in most places when it comes to the cost of poor data quality.

Though I have been in the data matching business for many years and been fighting duplicates with dedupliaction tools in numerous battles the war doesn’t seem to be won by using deduplication tools alone as told in the post Somehow Deduplication Won’t Stick.

Eventually deduplication always comes down to entity resolution when you have to decide which results are true positives, which results are useless false positives and wonder how many false negatives you didn’t catch, which means how much money you didn’t have in return of your deduplication investment.

Bringing in new and be that obscure reference sources is in my eyes a very good idea as examined in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

Ways of Sharing Master Data

The ”buy vs. build” option is well known within many disciplines not at least around your IT application stack. The trend here is that where you in the old times did a lot of in-house programming today you tend to buy more and more stuff to prevent reinventing wheel. Yesterday there was a post on that on Informatica Perspectives. The post is called Stop The Hand-Coding Madness!.

We certainly also see that trend when it comes to Master Data Management (MDM) solutions. And my guess is that we will see that trend too when it comes to the master data itself.

What has puzzled me over the years is how a lot of organizations spend time on and makes their personal errors when they type in the name, address and other core data about individuals and companies they do business with or alternatively letting us business partners type in our name, address and other data again and again – sometimes with a little remembering help from Google.

With product data you see that the same data is retyped again and again with heaps of errors and shortcuts from when the description and specifications is registered at the manufacturer, then again at a couple of wholesalers, at a lot of retailers and for some product types as for example spare parts in heaps of end user organizations.

In order to avoid this madness there are some different ways in which master data can be shared between organizations:

Using commercial third party data

Using third party directories is a well known way of buying your master data.

Business directories have been used for ages. The Dun&Bradstreet WorldBase is probably the most widely known example, but there are plenty of alternatives when it comes to specific regions and countries out there.  Where it earlier was common to use these sources for downstream data enrichment we now see more services for picking the id, names, addresses and other data in the data entry process.

Address directories are becoming very useful for example in using rapid addressing which saves time and ensures data quality for addresses when they are entered.

idq_frameworkProduct directories with related services can also help within managing product master data.

Digging into open government data

In many countries the public company registry is available as a raw business directory and in some countries there are also possibilities with citizen data. Public sector is often the root source for address data, which is getting more available around and even in some cases with relating property data as told in post Making Data Quality Gangnam Style.

As it often isn’t in the genes of public sector bodies to provide nice and easy ways of getting to these data, there are good opportunities for private enterprises to add that service on top of the open government data.

Having your own data locker

Instead of having business men controlling your data or trusting the government to do so the idea of a personal controlled data locker has gained interest. In the UK there is such a service called Mydex.

Relying on social collaboration

Most people and companies too are doing a good job in maintaining their profile data on social networks. So this is in many cases the place to go to find out where someone is and is doing right now.

Social collaboration is also a possible way to share product data between manufacturers, wholesalers, retailers and end users. There is a service for that called Actualog.

Bookmark and Share

Data Quality vs Identity Checking

Yesterday we had a call from British Gas (or probably a call centre hired by British Gas) explaining the great savings possible if switching from the current provider – which by the way is: British Gas. This is a classic data quality issue in direct marketing operations being accurately separating your current customers and entities belonging to new market.

As I have learned that your premier identity proof in the United Kingdom is your utility bill, this incident may be seen as somewhat disturbing – or by further thinking, maybe a business opportunity 🙂

identity resolutionAt iDQ we develop a solution that may be positioned in the space between data quality prevention and identity check by addressing the identity resolution aspect during data capture.

The nearly two year old post The New Year in Identity Resolution explains some different kinds of identity resolution being:

  • Hard core identity check
  • Light weight real world alignment
  • Digital identity resolution

Since then I have seen a slowly but steady convergence of these activities.

Bookmark and Share

Our Double Trouble

Royal Coat of Arms of DenmarkUsing the royal we is usually only for majestic people, but as a person with a being in two countries at the same time, I do sometimes feel that I am we.

So, this morning we once again found our way to London Heathrow Airport for one of our many trips between London and Copenhagen as we have lived in the United Kingdom the last couple of years but still have many business and private ties with The Kingdom of Denmark where we (is that was or were?) born, raised and worked and from where we still hold a passport.

Most public sector and private sector business processes and master data management implementations simply don’t cope with the fast evolving globalization. Reflecting on this, flying over Doggerland, we memorize situations where:

  • We as a prospect or customer in a global brand are stored as a duplicate record for each country as told in the post Hello Leading MDM Vendor.
  • You as an employee in a multi-national firm have a duplicate record for each country you have worked in.

People moving between countries are still treated as an exception not covered by adequate business rules and data capture procedures. Most things are sorted out eventually, but it always takes a whole lot of more trouble compared to if you just are born, raised and stays in the same country.

When we landed in Copenhagen this morning we (is that was or were?) able to use the new local smart travel card in order to travel on with public transit. But it wasn’t easy getting the card we remember. With a foreign address you can’t apply online. So we had to queue up at the Central Station, fill in a form and explain that you don’t have an official document with your address in the UK – and we avoided explaining the shocking fact that in the UK your electricity bill is your premier proof of almost anything related to your identity.

What about you? Do you have a being in several countries? Any war stories experienced related to your going back and forth?

Bookmark and Share

Famous False Positives

You should Beware of False Positives in Data Matching. A false positive in the data quality realm is a match of two (or more) identities that actually isn’t the same real world entity.

Throughout history and within art we have seen some false positives too. Here are my three favorites:

The Piltdown Man

In 1912 a British amateur archeologist apparently found a fossil claimed to be the missing link between apes and man: The so called Piltdown Man. Backed up by the British Museum it was a true discovery until 1953 when it was finally revealed as a hoax. It was however disputed during all the years but defended by the British establishment maybe due to envy on the French having a Cro-Magnon man first found there and the Germans having a name giving true discovery in Neandertal.

Eventually the Piltdown Man was exposed as a middle age human upper skull, an orangutan jawbone and chimpanzee teeth.

Barry_Nelson_as_Jimmy_Bond_in_1954
Jimmy Bond in Casino Royale

James and Jimmy Bond

As told in the post My Name is Bond. Jimmy Bond: James Bond is British intelligence and Jimmy Bond is an American agent. It’s always a question if two identities residing in different countries are the same as discussed (about me) in the post Hello Leading MDM Vendor.

Dupond et Dupont

In English they are known as Thomson and Thompson. In the original Belgian/French (and in my childhood Danish comics) piece of art about the adventures of Tintin they are known as Dupond et Dupont. They are two incompetent detectives who look alike and have names with a low edit distance and same phonetic sound. For twin names in a lot of other languages check the Wikipedia article here.

And hey, today I’m going to the creator of these two guy’s home country Belgium to be at the Belgian Data Quality Association congress tomorrow.

Bookmark and Share