Is Data Cleansing Bad for Data Matching?

Today I stumbled upon an article from Australia on BMC: Medical Informatics and Decision Making. The article is called The effect of data cleaning on record linkage quality.

The result of the described research is:

“Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.”

datamatchingThis resonates very well with my experience too. Usually I like to match with both original data and standardized (cleansed) data in order to exploit the best of both approaches.

What are your experiences?

Bookmark and Share

In the future, data quality will be more social

Every time I walk in and out of a plane at London-Gatwick Airport I always nod at an advert from the HSBC bank saying that in the future, selling will be more social:

Selling will be more social

A natural consequence of this will also be that data quality improvement (and master data management) will be more social.

One example is how complex sales, being sales processes typically in business-to-business (B2B) environments, will be heavily depended on integrating the exploitation of professional social networks as discussed on the DataQualityPro interview about the benefits of Social MDM.

Traditional Master Data Management (MDM) and related data quality improvement in B2B environments has been a lot about a single view of the business account and the legal entity behind. As Social Customer Relation Management (CRM) is much about the relations to the business contacts, the people side of business, we need a solid master data foundation behind the people being those contacts.

The same individual may in fact be an important influencer related to a range of business accounts being the legal entity with who you are aiming for a sales contract. You need a single view of that. So many sales contracts are based on a relation to a buyer moving from one business account to another. You need to be the winner in that game and the answer to that may very well be your ability to do better social MDM and embrace the data quality issues related to that.

Social selling of course also relates to business-to-consumer (B2C) activities and in doing that we will see new data quality issues. When exploiting social networks, both in B2B and B2C activities you need to link the traditional attributes as name and address with new attributes in the online and social world as explained in the post Multi-Channel Data Matching.

Besides exploiting social networks we will also see social collaboration as a mean to improve data quality. Social collaboration will go beyond collaboration within a single company and extend to the ecosystems of manufacturers, distributors, resellers and end users. A good example of this is the social collaboration platform called Actualog, which is about sharing product master data and thereby improving product data quality.

Bookmark and Share

Call me on Phone, Mobile or Skype

When calling people in order to have a long distance conversation there are three main ways today:

  • The landline phone, which have been around since the 19th century and penetrated most homes and businesses in the last century
  • The mobile phone, which came around in the 70’s and spread rapidly in the 90’s
  • Skype, a voice over internet service that grew in the 00’s

Using these services involves and identifier which may be stored in customer tables and other party master data repositories with some implications for data management and identity resolution:

TelephoneThe Landline Phone Number

The landline phone number is a very common attribute in databases around and is often used as the main identifier of a customer in ERP and CRM solutions around.

Using a landline phone number for identity resolution has some challenges, including:

  • As with most attributes they may change. Depending on the country in question they may change during relocation and most phone number systems gets and upgrade over the years.
  • In business-to-business (B2B) a company typically has more than one phone number.
  • In business-to-consumer (B2C) the landline phone number merely belongs to a household rather than a single individual. That may be good or not good depending on purpose of use.

The Mobile Phone Number

Mobile phone numbers also piles up in databases around. In relation to identity resolution there are issues with mobile phone numbers, namely:

  • They change a lot.
  • It’s not always clear to who a number actually belongs:
    • A company paid phone may be used for both business and pleasure and may be transferred to another individual
    • In a household a person may be registered for a range of mobile phones used by individual members of the household including children

The Skype ID

I seldom see databases with Skype ID’s. In my experience Skype ID aren’t used a lot in internal master data. They reside in Skype and social network profiles like for example LinkedIn.

A final rant

Today I hardly ever use a landline phone, I use my mobile once in a while and I use Skype a lot. Not because it’s convenient, but because the telecom companies has decided to charge international mobile calls in ways so greedy that it make Somali sea pirates look like honest business men.

Bookmark and Share

Shifts in Data Quality Tool Vendor Landscape

The Information Difference is an analyst firm that every year publishes a free online paper ranking the data quality tool vendors. The 2013 data quality tool landscape is out now.

An interesting trend is the shifts in who is in the main picture. Here are the 2012 and 2013 participants:

The Information Difference 12 13

The number of x’s is a rough measure of market strength.

While X88 is a new vendor in the landscape there are four vendors that have dropped from the main picture to the list of other vendors.

I have earlier compared the Gartner Data Quality Tool Magic Quadrant and The Information Difference Landscape in the post The Data Quality Tool Vendor Difference and put the spot light on Experian QAS as a vendor appearing differently by not being in the Gartner Quadrant as reported here. This year Experian QAS also have dropped from The Information Difference Landscape main picture. Not the way to go I guess considering the many efforts of Experian QAS to be a leading data quality tool vendor.

Other vendors have dropped from their position in the picture. DQ Global is one. Oracle as well. And then Talend. Both Oracle and Talend are doing much more than data quality and probably some focus has shifted to other things. Talend for example has emphasized a lot on big data recently.

It’s going to be exciting to see what happens on another source of truth, being the Gartner Data Quality Tool Magic Quadrant, this year.

Bookmark and Share

Entrepreneurs within Social MDM

Some of the established vendors in the Master Data Management (MDM) realm may be working on integrating social data and some apparently don’t. Either way as with many other new technologies we will probably see the big movements coming from entrepreneurs.

I have noticed some new startups. Two is not surprisingly coming from the San Francisco Bay area and one is maybe surprisingly coming from the Saint Petersburg that is the original one in Russia.

Reltio is working with multi-channel, including the social channel, data integration. Their raison d’être is:

Reltio_Logo.“As a business user in Sales, Marketing or Compliance you always work with information from multiple sources of data, then why is it that most of your existing applications cannot handle data from multiple sources (internal, third party or social) or channels of interaction to provide you with the benefit of insights from this related information. Reltio is working to fill this gap….”.

Fliptop is doing the matching between your current party master data records and the same real world entities in the social sphere:

Fliptop_logo_white_small“Fliptop’s Customer Intelligence platform provides companies with an on-demand data scientist for their leads and contacts. Using publicly available information including social data to score and enrich leads, companies can prioritize their pipeline, better target their audience and know more about their customers.”

Actualog is into Social PIM (Product Information Management):

Actualog logo“Actualog is an innovative cloud-based social Product Information Management platform that brings together the expertise and knowledge of the manufacturers and most competent customers around the world. Actualog helps companies to share information about products, materials and technologies focusing on complex technical products using the ideas of social interaction.”

Have you noticed some Social MDM and related startups? – or are you actually one?

Bookmark and Share

Multi-Channel Data Matching

Most data matching activities going on are related to matching customer, other rather party, master data.

In today’s business world we see data matching related to party master data in those three different channels types:

  • Offline is the good old channel type where we have the mother of all business cases for data matching being avoiding unnecessary costs by sending the same material with the postman twice (or more) to the same recipient.
  • Online has been around for some time. While the cost of sending the same digital message to the same recipient may not be a big problem, there are still some other factors to be considered, like:
    • Duplicate digital messages to the same recipient looks like spam (even if the recipient provided different eMail addresses him/her self).
    • You can’t measure a true response rate
  • Social is the new channel type for data matching. Most business cases for data matching related to social network profiles are probably based on multi-channel issues.

Multi-channel data matchingThe concept of having a single customer view, or rather single party view, involves matching identities over offline, online and social channels, and typical elements used for data matching are not entirely the same for those channels as seen in the figure to the right.

Most data matching procedures are in my experience quite simple with only a few data elements and no history track taking into considering. However we do see more sophisticated data matching environments often referred to as identity resolution, where we have historical data, more data elements and even unstructured data taking into consideration.

When doing multi-channel data matching you can’t avoid going from the popular simple data matching environments to more identity resolution like environments.

Some advices for getting it right without too much complication are:

  • Emphasize on data capturing by getting it right the first time. It helps a lot.
  • Get your data models right. Here reflecting the real world helps a lot.
  • Don’t reinvent the wheel. There are services for this out here. They help a lot.

Read more about such a service in the post instant Single Customer View.

Bookmark and Share

Big Data and Data Matching

Data matching has been an established discipline for many years and most data quality tools have more or less sophisticated features for data matching as well as many MDM (Master Data Management) platforms have data matching capabilities.

BigDataQuality
The LinkedIn Big Data Quality group

In a way the data matching realm has become slightly dull the recent years. People don’t get excited anymore over a discussion about if deterministic matching or probabilistic matching is the right way.  Soundex is old, edit distance has been around for ages and matchcodes may have outlived themselves.

So, it’s good to see a new beast turning up. Data matching with big data.

It may be about deduplicating (deduping) volumes that is bigger than traditional data matching can handle. You know: Dedoop’ing.

But it is also very much about matching big data with small data, first and foremost master data. And having well matched master data. Kimmo Kontra wrote a good post about that recently. The post is called Big Grease, Big Data, and Big Apple – manholes and MDM.

The case presented by Kimmo holds many exciting implementations of data matching like for example proximity matching of locations.

Bookmark and Share

Small Data with Big Impact

In an ongoing discussion on LinkedIn there are some good points on: How important is data quality for big data compared to data quality for small data?

A repeated sentiment in the comments is that data quality for small data is going to be more important with the rise of big data.

The small data we are talking about here is first and foremost master data.

Master Data Challenges with Big Data

As with traditional transaction data master data is also describing the who, what, where and when of big data.

If we are having issues with completeness, timeliness and uniqueness in our master data any prediction based on big data matched with master data is going to be as chaotic as weather forecasts.

big small dataWe also need to expand the range of entities embraced by our master data management implementations as exemplified in the post Social MDM and Future Competitive Intelligence.

Matching Big Data with Master Data

Some of the issues in matching big data with master data I have stumbled upon are:

  • Who: How do we link the real world entities reflected in our traditional systems of record with the real world entities behind who’s talking in systems of engagement? This question was touched in post Making Sense with Social MDM.
  • What: How do we manage our product hierarchies and product descriptions so they fulfill both (different) internal purposes and external usage? More on this in the post Social PIM.
  • Where: How do we identify a given place? If you think this is easy, why not read the post Where is the Spot?
  • When: Date and time comes in many formats and relating events to the wrong schedule may have us  Going in the Wrong Direction.

How: You may for example follow this blog. Subscription is in the upper right corner 🙂

Bookmark and Share

Coma, Wetsuit and Dedoop

The sehr geehrte damen und herren at Universität Leipzig (Leipzig University) are doing a lot of research in the data management realm and puts some good efforts in naming the stuff.

Here are some of the inventions:

COMA is a system for flexible Combination Of schema Matching Approaches. Let’s hope the thing is still alive.

WETSUIT (Web EnTity Search and fUsIon Tool) is a new powerful mashup tool – and what a nice seven letter abbreviation not sticking only to the first letters.

Tilia_tomentosaDedoop (Deduplication with Hadoop) is a prototype for entity matching for big data. Big phonetic Dedupe will be around of course.

Well, you should expect fuzzy abbreviations from this city, as Leipzig means “settlement where the linden trees stand”.

Bookmark and Share

Making sense with Social MDM

A few days ago Jeff Jonas of IBM made a new blog post called Master Data Management (MDM) vs. Sensemaking.

iDQ microscopeHerein Jeff Jonas ponders the differences in the data matching algorithms we use in traditional MDM, predominately name and address matching, and the kind of identity resolution we need when we for example try to listen to and make sense of the signals in the social media data streams.

Jeff Jonas says: “Different missions, different tools.  Some organizations will use one or the other; most organizations will want both.”  

I tend to disagree slightly with Jeff Jonas. As told in the post The New Year in Identity Resolution I think we will need a connection between the old systems of record and the new systems of engagement.

Indeed the algorithms will be used differently and indeed we need different thresholds of confidence for different tasks. But I think we will have to make the integration story a bit more complicated in order to make sensible decisions across the two missions.

Bookmark and Share