Follow Friday Data Quality

Every Friday on Twitter people are recommending other tweeps to follow using the #FollowFriday (or simply #FF) hash tag.

My username on twitter is @hlsdk.

Sometimes I notice tweeps I follow are recommending the username @hldsk or @hsldk or other usernames with my five letters swapped.

It could be they meant me? – but misspelled the username. Or they meant someone else with a username close to mine?

As the other usernames wasn’t taken I have taken the liberty to create some duplicate (shame on me) profiles and have a bit of (nerdish) fun with it:

@hsldk

For this profile I have chosen the image being the Swedish Chef from the Muppet show. To make the Swedish connection real the location on the profile is set as “Oresund Region”, which is the binational metropolitan area around the Danish capital Copenhagen and the 3rd largest Swedish city Malmoe as explained in the post The Perfect Wrong Answer.

@hldsk

For this profile I have chosen the image being a gorilla originally used in the post Gorilla Data Quality.

This Friday @hldsk was recommended thrice.

But I think only by two real life individuals: Joanne Wright from Vee Media and Phil Simon who also tweets as his new (one-man-band I guess) publishing company.

What’s the point?

Well, one of my main activities in business is hunting duplicates in party master databases.

What I sometimes find is that duplicates (several rows representing the same real world entity) have been entered for a good reason in order to fulfill the immediate purpose of use.

The thing with Phil and his one-man-band company is explained further in the post So, What About SOHO Homes.

By the way, Phil is going to publish a book called The New Small. It’s about: How a New Breed of Small Businesses is Harnessing the Power of Emerging Technologies.

Bookmark and Share

Linked Data Quality

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

New Blog Name?

As reported by Mark Goloboy here ”Data Quality” is becoming a dirty word. ”Information Quality” is in vogue.

Maybe I will soon have to change the name of my blog?

Also one may expect other related terms will be changed, like:

  • Data Governance becomes Information Governance
  • Master Data Management becomes Master Information Management
  • Data Matching becomes Information Matching
  • Data Warehouse becomes Information Warehouse
  • Database becomes Informationbase
  • Information Technology becomes Data Technology

But changing the name of a blog is a serious thing you shouldn’t do too often. I think I will wait and see if the term renaming stops at simply replacing data and information. Some guesses for further renaming:

Information Fitness replaces Data Quality as Data quality is often defined as “fit for intended purpose of use” and by replacing data with information that trail is even more clear – opposed to the other trail being real world alignment.

Information Political Correctness replaces Data Governance as Data Governance is a lot about policies and the Data Governance practice is a lot about maneuvering in the corporate political landscape.    

Master Information Technology (MIT) replaces Master Data Management (MDM)

Bookmark and Share

Four Different Data Matching Stage Types

One of the activities I do in my leisure time is cycling. As a consequence I guess I also like to watch cycling on TV (or on the computer), not at least the cycling sport paramount of the year: Le Tour de France.

In Le Tour de France you basically have four different types of stages:

  • Time trial
  • Stages on flat terrain
  • Stages through hilly landscape
  • Stages in the high mountains

Some riders are specialists in one of the stage types and some riders are more all-around types.

With automated data matching, which is what I do the most in my business time, there are basically also four different types of processes:

  • Internal deduplication of rows inside one table
  • Removal of rows in one table which also appears in another table
  • Consolidation of rows from several tables
  • Reference matching with rows in one table against another (big) table

Internal deduplication

Examples of data matching objectives here is finding duplicates in names and addresses before sending a direct mail or finding the same products in a material master.

The big question in this type of process is if you are able to balance between not making any false positives (being too aggressive) while not leaving to many to many false negatives behind (losing the game). You also have to think about survivorship when merging into a golden record.

In Le Tour de France the overall leader who gets the yellow jersey has to make a good time trial.

Removal

Here the examples of data matching objectives will be eliminating nixies (people who don’t want offerings by mail) before sending a direct mail or eliminating bad payers (people you don’t want to offer a credit).

Probably the easiest process everyone can do – but in the end of the day some are better sprinters than others.

The best sprinter in Le Tour de France gets the green jersey.

Consolidation

When migrating databases and/or building a master data hub you often have to merge rows from several different tables into a golden copy.

Here you often see the difficulty of making data fit for the immediate purpose of use and at the same time be aligned with the real world in order to also being able to handle the needs that arises tomorrow.

Often some of the young riders in Le Tour de France makes an escape when climbing the hills and gets the white jersey.

Reference match

Doing business directory matching has been a focus area of mine including making a solution for match with the D&B worldbase. The worldbase holds over 165 million rows representing business entities from all over the world.

The results from automated matching with such directories may vary a lot like you see huge time differences in Le Tour de France when the riders faces the big mountains. Here the best climber gets the polka dotted jersey.

Bookmark and Share

Mixed Identities

A frequent challenge when building a customer master data hub is dealing with incoming records from operational systems where the data in one record belongs to several real world entities.

One situation may be that that a name contains two (or more) real world names. This situation was discussed in the post Splitting names.

Another situation may be that:

  • The name belongs to real world entity X
  • The address belongs to real world entity Y
  • The national identification number belongs to real world entity Z

Fortunately most cases only have 2 different real world representations like X and Y or Y and Z.

An example I have encountered often is when a company delivers a service through another organization. Then you may have:

  • The name of the 3rd party organization in the name column(s)
  • The address of the (private) end user in the address columns

Or as I remember seen once:

  • The name of the (private) end user in the name column(s)
  • The address of the (private) end user in the address columns
  • The company national identification number of the 3rd party organization in the national ID column

Of course the root cause solution to this will be a better (and perhaps more complex) way of gathering master data in the operational systems. But most companies have old and not so easy changeable systems running core business activities. Swapping to new systems in a rush isn’t something just done either. Also data gathering may take place outside your company making the data governance much more political.

A solution downstream at the data matching gates of the master data hub may be to facilitate complex hierarchy building.

Oftentimes the solution will be that the single customer view in the master data hub will be challenged from the start as the data in some perception is fit for the intended purpose of use.

Bookmark and Share

Location, Location, Location

Now, I am not going to write about the importance of location when selling real estates, but I am going to provide three examples about knowing about the location when you are doing data matching like trying to find duplicates in names and addresses.

Location uniqueness

Let’s say we have these two records:

  • Stefani Germanotta, Main Street, Anytown
  • Stefani Germanotta, Main Street, Anytown

The data is character by character exactly the same. But:

  • There is only a very high probability that it is the same real world individual if there is only one address on Main Street in Anytown.
  • If there are only a few addresses on Main Street in Anytown, you will still have a fair probability that this is the same individual.
  • But if there are hundreds of addresses on Main Street in Anytown, the probability that this is the same individual will be below threshold for many matching purposes.

Of course, if you are sending a direct marketing letter it is pointless sending both letters, as:

  • Either they will be delivered in the same mailbox.
  • Or both will be returned by postal service.

So this example highlights a major point in data quality. If you are matching for a single purpose of use like direct marketing you may apply simple processing. But if you are matching for multiple purposes of use like building a master data hub, you don’t avoid some kind of complexity.

Location enrichment

Let’s say we have these two records:

  • Alejandro Germanotta, 123 Main Street, Anytown
  • Alejandro Germanotta, 123 Main Street, Anytown

If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual.

But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith).

So this example highlights the importance of using external reference data in data matching.

Location geocoding

Let’s say we have these two records:

  • Gaga Real Estate, 1 Main Street, Anytown
  • L.  Gaga Real Estate, Central Square, Anytown

If you match using the street address, the match is not that close.

But if you assigned a geocode for the two addresses, then the two addresses may be very close (just around the corner) and your match will then be pretty confident.

Assigning geocodes usually serve other purposes than data matching. So this example highlights how enhancing your data may have several positive impacts.

Bookmark and Share

The Slurry Project

When cleansing party master data it is often necessary to typify the records in order to settle if it is a business entity, a private consumer, a department (or project) in a business, an employee at a business, a household or some kind of dirt, test, comic name or other illegible name and address.

Once I made such a cleansing job for a client in the farming sector. When I browsed the result looking for false positives in the illegible group this name showed up:

  • The Slurry Project (in Danish: Gylleprojektet)

So, normally it could be that someone called a really shitty project a bad name or provided dirty data for whatever reason. But in the context of the farming sector it makes a good name for a project dealing with better exploitation of slurry in growing crops.

A good example of the need for having the capability to adjust the bad word lists according to the context when cleansing data.


Bookmark and Share

Post no. 100

This is post number 100 on this blog. Besides that this is a time for saying thank you to those who have read this blog, those who have re-tweeted the posts and not at least those who have commented on the posts on this blog, it is also time for a recapitulation on my opinions (based on my experiences and observations) about data quality.

Let me emphasize three points:

  • Fit for purpose versus real world alignment
  • Diversity in data quality
  • The role of technology in data quality improvement

Fit for purpose versus real world alignment

According to Wikipedia data may be of high quality in two alternative ways:

  • Either they are fit for their intended uses
  • Or they correctly represent the real-world construct to which they refer

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align all known purposes.

This theme is so far covered in 19 posts and pages including:

Diversity in data quality

International and multi-cultural aspects of data quality improvement have been a favorite topic of mine for a long time.

While working with data quality tools and services for many years I have found that many tools and services are very national. So you might discover that a tool or service will make wonders with data from one country, but be quite ordinary or in fact useless with data from another country.

I have made 15 posts on diversity in data quality so far including:

The role of technology in data quality improvement

Being a Data Quality professional may be achieved by coming from the business side or the technology side of practice. But more important in my eyes is the question whether you have made serious attempts and succeeded in understanding the side from where you didn’t start. I have always strived to be a mixed skilled person. As I have tried single handed to build a data quality tool – or to be more specific a data matching tool – I do of course write a lot about data quality technology.

This blog includes 37 posts on data quality technology so far including:

Bookmark and Share