Liliendahl on Data Quality

Instant Data Enrichment

31st May 201215th January 2013Henrik Gabs LiliendahlLeave a comment

Data enrichment is one of the core activities within data quality improvement. Data enrichment is about updating your data in order to be more real world aligned by correcting and completing with data from external reference data sources.

Traditionally data enrichment has been a follow up activity to data matching and doing data matching as a prerequisite for data enrichment has been a good part of my data quality endeavor during the recent 15 years as reported in the post The GlobalMatchBox.

During the last couple of years I have tried to be part of the quest for doing something about poor data quality by moving the activities upstream. Upstream data quality prevention is better than downstream data cleansing wherever applicable. Doing the data enrichment at data capture is the fast track to improve data quality for example by avoiding contact data entry flaws.

It’s not that you have to enrich with all the possible data available from external sources at once. What is the most important thing is that you are able to link back to external sources without having to do (too much) fuzzy data matching later. Some examples:

Getting a standardized address at contact data entry makes it possible for you to easily link to sources with geo codes, property information and other location data at a later point.
Obtaining a company registration number or other legal entity identifier (LEI) at data entry makes it possible to enrich with a wealth of available data held in public and commercial sources.
Having a person’s name spelled according to available sources for the country in question helps a lot when you later have to match with other sources.

In that way your data will be fit for current and future multiple purposes.

Most Times the Home Team Wins

25th May 201225th May 2012Henrik Gabs LiliendahlLeave a comment

This summer is going to be huge if you like sports. The Olympics is coming to London and only 14 days away from now we have the European football (soccer) championship in Poland and Ukraine.

As usual hopes are high for the England soccer team. But statistics doesn’t support the hopes. The England team haven’t really succeeded since the World Cup victory on home ground at Wembley in 1966. That victory was mainly (and now I’m going to be shot in the streets of London) due to a ghost goal.

In business, and in data quality and MDM business too, the home team usually also wins.

Yesterday I noticed a tweet telling that the MDM tool vendor Orchestra Network has been selected as tool vendor by a large bank. The bank is Credit Agricole, a big financial service provider based in France. Orchestra Networks is also based in France. A home win so to say.

In the post The Pond it was told how else dominating American tool vendors may in the first place succeed in expansion to Europe by coming to London, but in fact having a hard time competing in continental Europe due to diversity issues.

European tool vendors going to North America often tries to disguise as a home team. Orchestra Network for example uses Boston & Paris as place of origin in the messaging. Other examples are the leading open source data management tool vendor Talend with dual head quarter in Paris and California, hot Danish MDM vendor Stibo Systems messaging out of Atlanta and the Swedish business intelligence success QlikTech who officially has moved to Pennsylvania.

The Problem with Multiple Purposes of Use

23rd May 201223rd May 2012Henrik Gabs Liliendahl3 Comments

Today I noticed this tweet by Malcolm Chisholm:

I agree.

The problem with the “fitness for use” or “fit for the purpose of use” definition of data quality has been a recurring subject on this blog starting with the post Fit for What Purpose? through to lately the post Inaccurately Accurate discussing the data quality of the British electoral roll seen from either a strict electoral point of view and the point of view from external use of the electoral roll.

The problem with “fitness of use” becomes clear when data quality has to be addressed within master data management. Master data has, per definition so to say, many uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

Today Jim Harris made an (as ever) excellent post related to how data actually represents what it purports to represent – now and tomorrow too. Find the post called Syncing versus Streaming on the Data Roundtable.

Avoiding Contact Data Entry Flaws

19th May 201229th May 2012Henrik Gabs Liliendahl9 Comments

Contact data is the data domain most often mentioned when talking about data quality. Names and addresses and other identification data are constantly spelled wrong, or just different, by the employees responsible of entering party master data.

Cleansing data long time after it has been captured is a common way of dealing with this huge problem. However, preventing typos, wrong hearings and multi-cultural misunderstandings at data entry is a much better option wherever applicable.

I have worked with two different approaches to ensure the best data quality for contact data entered by employees. These approaches are:

Correction and
Assistance

Correction

With correction the data entry clerk, sales representative, customer service professional or whoever is entering the data will enter the name, address and other data into a form.

After submitting the form, or in some cases leaving each field on the form, the application will check the content against business rules and available reference data and return a warning or error message and perhaps a correction to the entered data.

As duplicated data is a very common data quality issue in contact data, a frequent example of such a prompt is a warning about that a similar contact record already exists in the system.

Assistance

With assistance we try to minimize the needed number of key strokes and interactively help with searching in available reference data.

For example when entering address data assistance based data entry will start with the highest geographical level:

If we are dealing with international data the country will set the context and know about if a state or province is needed.
Where postal codes (like ZIP) exists, this is the fast path to the city.
In some countries the postal code only covers one street (thoroughfare), so that’s settled by the postal code. In other situations we will usually have a limited number of streets that can be picked from a list or settled with the first characters.

(I guess many people know this approach from navigation devices for cars.)

When the valid address is known you may catch companies from business directories being on that address and, depending on the country in question, you may know citizens living there from phone directories and other sources and of course the internal party master data, thus avoiding entering what is already known about names and other data.

When catching business entities a search for a name in a business directory often leads to being able to pick a range of identification data and other valuable data and not at least a reference key to future data updates.

Lately I have worked intensively with an assistance based cloud service for business processes embracing contact data entry. We have some great testimonials about the advantages of such an approach here: instant Data Quality Testimonials.

Social MDM, Privacy and Data Quality

17th May 2012Henrik Gabs Liliendahl4 Comments

The term “Social MDM” has been promoted quite well this week not at least as part of the social media information stream from the ongoing user conference of the tool vendor Informatica.

In a blog post called Informatica 9.5 for Big Data Challenge #2: Social Jody Ko of Informatica introduces the opportunities and challenges.

In the closing remarks Judy says: “There’s still a long way to go to bring social data into the mainstream enterprise, in part due to concerns over privacy and the potential “creepiness” factor of mining social data.”

As I understand it the spearhead Social MDM part of the tool release is a Facebook App that provides connectivity between Facebook and the MDM solution.

Industry analyst R “Ray” Wang examines this in the blog post News Analysis: Informatica Launches MDM 9.5. The analysis states that it now is time to “drive data out of Facebook and not into Facebook”.

The opportunities and challenges of driving data out of Facebook was discussed in a post called exactly Out of Facebook here on the blog some years ago.

Balancing privacy with data hoarding is still for sure a subject that in no way is settled and probably never will be.

Connecting systems of record in traditional MDM solutions with social network profiles is in no way a walk over too. The classic data quality challenges with uniqueness of records and completeness of data only gets more difficult, but also, there are great opportunities for getting a better picture of your customers and other business partners.

If you are interested in Social MDM and the related challenges and opportunities there is a LinkedIn group for Social MDM.

The group is new, less than a month old at the present time, but there is already a lot of content to dip into, including:

A discussion called: What’s the business case?
A survey on data quality mechanisms focused on author identity resolution in social media
A poll: What is a good MDMish name for social network profiles?

Deduplication vs Identity Resolution

15th May 201229th May 2012Henrik Gabs LiliendahlLeave a comment

When working with data matching you often finds that there basically is a bright view and a dark view.

Traditional data matching as seen in most data quality tools and master data management solutions is the bright view: Being about finding duplicates and making a “single customer view”. Identity resolution is the dark view: Preventing fraud and catching criminals, terrorists and other villains.

These two poles were discussed in a blog post and the following comments last year. The post was called What is Identity Resolution?

While deduplication and identity resolution may be treated as polar opposites and seemingly contrary disciplines they are in my eyes interconnected and interdependent. Yin and Yang Data Quality.

At the MDM Summit in London last month one session was about the Golden Nominal, Creating a Single Record View. Here Corinne Brazier, Force Records Manager at the West Midlands Police in the UK told about how a traditional data quality tool with some matching capabilities was used to deal with “customers” who don’t want to be recognized.

In the post How to Avoid Losing 5 Billion Euros it was examined how both traditional data matching tools and identity screening services can be used to prevent and discover fraudulent behavior.

Deduplication becomes better when some element of identity resolution is added to the process. That includes embracing big reference data in the process. Knowing what is known in available sources about the addresses that is being matched helps. Knowing what is known in business directories about companies helps. Knowing what is known in appropriate citizen directories when deduping records holding data about individuals helps.

Identity Resolution techniques is based on the same data matching algorithms we use for deduplication. Here for example a fuzzy search technology helps a lot compared to using wildcards. And of course the same sources as mentioned above are a key to the resolution.

Right now I’m dipping deep into the world of big reference data as address directories, business directories, citizen directories and the next big thing being social network profiles. I have no doubt about that deduplication and identity resolution will be more yinyang than yin and yang in the future.

Häagen-Dazs Datakvalitet

12th May 201221st March 2014Henrik Gabs LiliendahlLeave a comment

There is a term called foreign branding. Foreign branding is describing an implied cachet or superiority of products and services with foreign-sounding names

Häagen-Dazs ice cream is an example of foreign branding. Though the brand was established in New York the name was supposed to sound Scandinavian.

However, Häagen-Dazs does sound and look somewhat strange to a Scandinavian. The reason is probably that the constellation of the letters “äa” and “zs” are not part of any native Scandinavian words.

By the way, datakvalitet is the Scandinavian compound word for data quality.

Getting datakvalitet right in world wide data isn’t easy. What works in some countries doesn’t work in other countries, not at least when we are talking datakvalitet regarding party master data such as customer master data, supplier master data and employee master data.

One of the reasons why datakvalitet for party master data is different is the various possibilities with applying big reference data sources. For example the availability of citizen data is different in New York than in Scandinavia. This affects the ways of reaching optimal datakvalitet as reported in the post Did They Put a Man on the Moon.

As part of the ongoing globalization handling international datakvalitet is becoming more and more common. Many enterprises try to deploy enterprise wide datakvalitet initiatives and shared service centers handles party master data uncommon to the people working there. This often results in finding a strange word like Häagen-Dazs.

The Data Quality Tool Vendor Difference

10th May 201210th May 2012Henrik Gabs Liliendahl2 Comments

How do analysts look at the data quality tool vendor market? As with everything data quality there are differences and apparently no single source of truth.

Gartner has its magic quadrant. They sell it for money, but usually you are able to get a free copy from the leading vendors.

The Information Difference has its DQ Landscape in the cloud for free.

It is interesting to compare which vendors are included in the latest main pictures, as I have tried below:

The number of x’s is a rough measure of the ability to execute / market strength.

Three smaller vendors are considered by Gartner, but not by The Information Difference and vice versa. Two midsize vendors are included by The Information Difference, but not by Gartner. Experian QAS are included as a big one by The Information Difference, but did not (yet) meet the inclusion criteria used by Gartner.

	Henrik Gabs Lilienda… on Balancing the Business Partner…
	Jeppe Thing Sørensen on Balancing the Business Partner…
	peolsolutions on MDM, Cloud, SaaS, PaaS, IaaS a…
	Henrik Gabs Lilienda… on Is the Holiday Season called C…
	Michael D. on Is the Holiday Season called C…
	Jay Ram on The Disruptive MDM List is…
	Henrik Gabs Lilienda… on The Intersection of Data Obser…
	Shanker on The Intersection of Data Obser…
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on Data Matching Efficiency
	Bhavani Shanker on Data Matching Efficiency
	Henrik Gabs Lilienda… on From Platforms to Ecosyst…
	Michael Fieg on From Platforms to Ecosyst…
	From Platforms to Ec… on What is Collaborative Product…
	From Platforms to Ec… on MDM and Knowledge Graph