The Good, Better and Best Way of Avoiding Duplicates

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

  • Downstream data matching
  • Real time duplicate check
  • Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

iDQ mashup
instant Data Quality

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

  • If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
  • If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

Bookmark and Share

MDM is all about Software Brands

LinkedIn is a great social service for professionals. I often read descriptions of LinkedIn with the sentiment that LinkedIn is a recruitment platform. However, in my opinion LinkedIn is much more than that. To me LinkedIn is more about networking, knowledge sharing, social marketing and social selling.

But that said, recruiters are certainly very active on LinkedIn. I guess it happens to me every week that I’m contacted on LinkedIn by a recruiter with a MDM (Master Data Management) job.

MDM BrandsThe opening is practically always like this:

“We are looking for a candidate with experience with <brand>….”, where <brand> is Informatica, Oracle, IBM, SAP and other well known brands in the MDM sphere.

As I don’t guess the recruiters make up the top requirement themselves, this number one requirement probably comes from the hiring organization. So to users of MDM, MDM is all about the software brand. Never mind people and processes. That’s easy. Technology is the hard part, not at least mastering the master data technology that was bought after a thorough selection process.

Bookmark and Share

Matching for Multiple Purposes

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

HierarchyFor that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Bookmark and Share

Think global from day one

The title of this post is taken from a blog post by Hans Peter Bech. The post is called Entering a Foreign Market – The 9 Steps to Success for Software Companies.

Decimal_mark

In the post Hans Peter says:

“German software companies having access to 7% of world demand and US based companies with a domestic market representing 38% of world demand often ignore the global perspective until forced to face the challenge. That’s very fortunate for the smaller companies from the smaller countries!”

This observation from the software market in general certainly also applies to software for data quality improvement and master data management as examined in the post 255 Reasons for Data Quality Diversity.

If you are a software company in the data management space the meaning of thinking global may apply to various activities as:

  • How the product is designed in respect to handling data from all over the world. Here thinking global from day one is crucial.
  • How the product is marketed to a world-wide audience. Here the global approach could wait a bit.

On the latter matter I have teased one of the magic quadrant data quality tool vendors, Trillium Software, for having used a date format only used in the United States on their blog. Maybe it’s a small matter and just me who is sensitive to this normal glitch. Anyway I’m pleased to congratulate Trillium Software on their new blog design with a world-wide fit date format. Check out the blog, which is a good one indeed, here.

Bookmark and Share

What’s so Special About MDM?

In a blog post from yesterday one of my favorite bloggers Loraine Lawson writes:

“Take master data management, for instance. Oh sure, experts preach that it’s a discipline, not “just” a technology, but come on. Did anybody ever hear about MDM before MDM solutions were created?”

The post is called: Let’s Talk: Do You Really Need an Executive Sponsor for MDM?

And yes we do need an executive sponsor. Also we need a business case as we must avoid doing it big bang style and we need to establish metrics for measuring success and so on.

All wise things as it is wise sayings about data quality improvement initiatives, business intelligence (BI) implementations, customer relationship management (CRM) system roll-out and almost any other kind of technology enabled project.

shiny thingsI touched this subject some years ago in the post Universal Pearls of Wisdom.

So let’s talk:

  • Is an executive sponsor more important for Master Data Management (MDM) than for Business Intelligence (BI)?
  • Is a business case more important for Master Data Management (MDM) than for Supplier Chain Management (SCM)?
  • Is big bang style more dangerous for Master Data Management (MDM) than for Service Oriented Architecture (SOA)?

And oh, don’t just tell me that I can’t compare apples and pears.

Bookmark and Share

Multi-Domain MDM Market Update

Multi-Domain MDMIn the recent post called The MDM Landscape is Slowly Changing I wrote about some findings in the latest MDM market research document by the Information Difference.

Recently Bloor published their view on the MDM Market including who is in or close to the bull’s eye. You may read the document called Master Data Management Market update by following the press release on the paper from Informatica here.

The two views are in agreement on a lot of things including how Multi-Domain MDM is becoming the norm.  The alignment of views is no surprise as I guess that there is only one Andy Hayler around in the MDM sphere and he is the man behind both documents.

And hey, if you agree with Andy about Multi-Domain MDM, why not join the LinkedIn Multi-Domain MDM group.

Bookmark and Share

Things we do with fresh master data in old packing

In a recent blog post called Understanding the sources of master data Prashanta Chandramohan writes:

“Often times, master data sources are legacy in nature, built and maintained over a long period of time, lack documentation and include procedures and terminologies which are no longer relevant in the current context.”

This saying resonates very well with my experiences.

puzzleImplementing Master Data Management (MDM) solutions doesn’t take place in a green field. Most of the hard work is not about how to build a perfect master data environment but is about how to work around what during the years has been done badly with master data for many good reasons.

Some of the maybe low practical but yet persistent challenges I have worked with are:

  • Quite a few old systems hold data as names, addresses, product descriptions and so on only in upper case. You may want to convert that to a more beautiful mix of upper and lower case (according to the culture in question) on an ongoing basis. When handling master data entities describing things outside the English alphabet, we may even want to optimize the use of national characters in strings that before only allowed characters from the English alphabet.
  • Fields are used for other things than the original purpose because there is no other way. While ongoing conversions including parsing may not be the best way around it often is the only way to go.
  • Due to limited search capabilities in old systems you may write personal names starting with the surname (in cultures where that’s not common), twist company name elements around and so on. This may not look nice when mashing up with other sources and limit the use for other purposes, so also here conversions may the only way to go.

Please find some more on the fun in doing those things in the post The Cases for UPPER CASE in Data Management.

Bookmark and Share

Hierarchical Data Matching

A year ago I wrote a blog post about data matching published on the Informatica Perspective blog. The post was called Five Future Data Matching Trends.

HierarchyOne of the trends mentioned is hierarchical data matching.

The reason we need what may be called hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

I usually divide a data matching process into three main steps:

  • Candidate selection
  • Match scoring
  • Match destination

(More information on the page: The Art of Data Matching)

Hierarchical data matching is mostly about the last step where we apply survivorship rules and execute business rules on whether to purge, merge, split or link records.

In my experience there are a lot of data matching tools out there capable of handling candidate selection, match scoring, purging records and in some degree merging records. But solutions are sparse when it comes to more sophisticated things like spitting an original entity into two or more entities by for example Splitting Names or linking records in hierarchies in order to build a Hierarchical Single Source of Truth.

Bookmark and Share

180 Degree Prospective Customer View isn’t Unusual

My eMail inbox is collecting received mails from several eMail accounts and therefore it’s not unusual to have duplicate messages in there.

This morning I had two eMails coming in to two different eMail accounts probably part of the same campaign but with different messages:

180 degree

Apparently I have landed in two different segments with two different eMail accounts: One technology oriented and one sales and marketing oriented.

Record linking of sparse subscription profiles isn’t easy and even Informatica, a big player in Master Data Management and Data Quality solutions, have land to be covered in this game.

Bookmark and Share

The MDM Landscape is Slowly Changing

This year’s version of the MDM (Master Data Management) Landscape report from Information Difference is out.

The report confirms some trends in MDM offerings also mentioned here on the blog. Some sayings from the Information Difference report are:

  • “The market is starting to dabble in cloud-based implementations…”
  • “There continues to be a demand for MDM offerings to handle reference data….”
  • “ …still very much in their early stages, are support for Big Data…”

Categorizing the vendors into the traditional division of Customer Data Integration (CDI) versus Product Information Management (PIM) support is becoming less relevant as new Multi-Domain offerings are coming out and larger Product Master Data specialists as Hybris and Heiler has been snapped by megavendors. This leaves Stibo as the only remaining large PIM vendor, but Stibo has actually already rebranded themselves as a Multi-Domain player and have been working seriously on that for a couple of years.

MDM 2013You may view the full Information Difference MDM Landscape report here.

Bookmark and Share