MDM and Knowledge Graph

As examined in a previous post with the title Data Fabric and Master Data Management, the use of the knowledge graph approach is on the rise.

Utilizing a knowledge graph has an overlap with Master Data Management (MDM).

If we go back 10 years MDM and Data Quality Management had a small niche discipline that was called (among other things) entity resolution as explored in the post Non-Obvious Entity Relationship Awareness. The aim of this was the same that today can be delivered in a much larger scale using knowledge graph technology.

During the past decade there have been examples of using graph technology for MDM as for example mentioned in the post Takeaways from MDM Summit Europe 2016. However, most attempts to combine MDM and graph have been to visualize the relationships in MDM using a graph presentation.

When utilizing knowledge graph approaches you will be able to detect many more relationships than those that are currently managed in MDM. This fact is the foundation for a successful co-existence between MDM and knowledge graph with these synergies:

  • MDM hubs can enrich knowledge graph with proven descriptions of the entities that are the nodes (vertices) in the knowledge graph.
  • Additional detected relationships (edges) and entities (nodes) from the knowledge graph that are of operational and/or general analytic interest enterprise wide can be proven and managed in MDM.

In this way you can create new business benefits from both MDM and knowledge graph.

Are These Familiar Hierarchies in Your MDM / DQM / PIM Solution?

The term family is used in different contexts within Master Data Management (MDM), Data Quality Management (DQM) and Product Information Management (PIM) when working with hierarchy management and entity resolution.

Here are three frequent examples:

Consumer / citizen family

Family consumer citizenWhen handling party master data about consumers / citizens we can deal with the basic definition of a family, being a group consisting of two parents and their children living together as a unit.

This is used when the business scenario does not only target each individual person but also a household with a shared economy. When identifying a household, a common parameter is that the persons live on the same postal address (at the same time) while observing constellations as:

  • Nuclear families consisting of a female and a male adult (and their children)
  • Rainbow families where the gender is not an issue
  • Extended families consisting of more than two generations
  • Persons who happen to live on the same postal address

There are multicultural aspects of these constellations including the different family name constructions around the world and the various frequency and acceptance of rainbow families as well of frequency of extended families.

Company family tree

When handling party master data about companies / organizations a valuable information is how the companies / organizations are related most commonly pictured as a company family tree with mothers and sisters. This can in theory be in infinite levels. The basic levels are:

  • A global ultimate mother being the company that ultimately owns (fully or partly) a range of companies in several countries.
  • A national ultimate mother being the company that owns (fully or partly) a range of companies in a given country.
  • A legal entity being the basic registered company within a country having some form of a business entity identifier.
  • A branch owned by a legal entity and operating from a given postal / visiting address.

Family companyYou can build your own company tree describing your customers, suppliers and other business partners. Alternatively or supplementary, you can rely on third party business directories. It is here worth noticing that a national source will only go to the ultimate national mother level while a global source can include the global ultimate mother and thus form larger families.

Having a company family view in your master data repository is a valuable information asset within credit risk, supply risk, discount opportunities, cross-selling and more.

Product family

The term “product family” is often used to define a level in a homegrown product classification / product grouping scheme. It is used to define a level that can have levels above and levels below with other terms as “product line”, “product category”, “product class”, “product group”, “product type” and more.

Family productSometimes it is also used as a term to define a product with a family of variants below, where variants are the same product produced and kept in stock in different colours, sizes and more.

Read more about Stock Keeping Units (SKUs), product variants, product identification and product classification in the post Five Product Information Management Core Aspects.

Using External Data in Data Matching

One of the things that data quality tools does is data matching. Data matching is mostly related to the party master data domain. It is about comparing two or more data records that does not have exactly the same data but are describing the same real world entity.

Common approaches for that is to compare data records in internal master data repositories within your organization. However, there are great advantages in bringing in external reference data sources to support the data matching.

Some of the ways to do that I have worked with includes these kind of big reference data:

identityBusiness directories:

The business-to-business (B2B) world does not have privacy issues in the degree we see in the business-to-consumer (B2C) world. Therefore there are many business directories out there with a quite complete picture of which business entities exists in a given country and even in regions and the whole world.

A common approach is to first match your internal B2B records against a business directory and obtain a unique key for each business entity. The next step of matching business entities with that unique is a no brainer.

The problem is though that an automatic match between internal B2B records and a business directory most often does not yield a 100 % hit rate. Not even close as examined in the post 3 out of 10.

Address directories:

Address directories are mostly used in order to standardize postal address data, so that two addresses in internal master data that can be standardized to an address written in exactly the same way can be better matched.

A deeper use of address directories is to exploit related property data. The probability of two records with “John Smith” on the same address being a true positive match is much higher if the address is a single-family house opposite to a high-rise building, nursery home or university campus.

Relocation services:

A common cause of false negatives in data matching is that you have compared two records where one of the postal addresses is an old one.

Bringing in National Change of Address (NCOA) services for the countries in question will help a lot.

The optimal way of doing that (and utilizing business and address directories) is to make it a continuous element of Master Data Management (MDM) as explored in the post The Relocation Event.

Bookmark and Share

Identity Resolution and Social Data

Identity Resolution

Identity resolution is a hot potato when we look into how we can exploit big data and within that frame not at least social data.

Some of the most frequent mentioned use cases for big data analytics revolves around listening to social data streams and combine that with traditional sources within customer intelligence. In order to do that we need to know about who is talking out there and that must be done by using identity resolution features encompassing social networks.

The first challenge is what we are able to do. How we technically can expand our data matching capabilities to use profile data and other clues from social media. This subject was discussed in a recent post on DataQualityPro called How to Exploit Big Data and Maintain Data Quality, interview with Dave Borean of InfoTrellis. In here InfoTrellis “contextual entity resolution” approach was mentioned by David.

The second challenge is what we are allowed to do. Social networks have a natural interest in protecting member’s privacy besides they also have a commercial interest in doing so. The degree of privacy protection varies between social networks. Twitter is quite open but on the other hand holds very little usable stuff for identity resolution as well as sense making from the streams is an issue. Networks as Facebook and LinkedIn are, for good reasons, not so easy to exploit due to the (chancing) game rules applied.

As said in my interview on DataQualityPro called What are the Benefits of Social MDM: It is a kind of a goldmine in a minefield.

Bookmark and Share

Unique Data = Big Money

In a recent tweet Ted Friedman of Gartner (the analyst firm) said:

ted on reference data

I think he is right.

Duplicates has always been pain number one in most places when it comes to the cost of poor data quality.

Though I have been in the data matching business for many years and been fighting duplicates with dedupliaction tools in numerous battles the war doesn’t seem to be won by using deduplication tools alone as told in the post Somehow Deduplication Won’t Stick.

Eventually deduplication always comes down to entity resolution when you have to decide which results are true positives, which results are useless false positives and wonder how many false negatives you didn’t catch, which means how much money you didn’t have in return of your deduplication investment.

Bringing in new and be that obscure reference sources is in my eyes a very good idea as examined in the post The Good, Better and Best Way of Avoiding Duplicates.

Bookmark and Share

Data Quality vs Identity Checking

Yesterday we had a call from British Gas (or probably a call centre hired by British Gas) explaining the great savings possible if switching from the current provider – which by the way is: British Gas. This is a classic data quality issue in direct marketing operations being accurately separating your current customers and entities belonging to new market.

As I have learned that your premier identity proof in the United Kingdom is your utility bill, this incident may be seen as somewhat disturbing – or by further thinking, maybe a business opportunity 🙂

identity resolutionAt iDQ we develop a solution that may be positioned in the space between data quality prevention and identity check by addressing the identity resolution aspect during data capture.

The nearly two year old post The New Year in Identity Resolution explains some different kinds of identity resolution being:

  • Hard core identity check
  • Light weight real world alignment
  • Digital identity resolution

Since then I have seen a slowly but steady convergence of these activities.

Bookmark and Share

Our Double Trouble

Royal Coat of Arms of DenmarkUsing the royal we is usually only for majestic people, but as a person with a being in two countries at the same time, I do sometimes feel that I am we.

So, this morning we once again found our way to London Heathrow Airport for one of our many trips between London and Copenhagen as we have lived in the United Kingdom the last couple of years but still have many business and private ties with The Kingdom of Denmark where we (is that was or were?) born, raised and worked and from where we still hold a passport.

Most public sector and private sector business processes and master data management implementations simply don’t cope with the fast evolving globalization. Reflecting on this, flying over Doggerland, we memorize situations where:

  • We as a prospect or customer in a global brand are stored as a duplicate record for each country as told in the post Hello Leading MDM Vendor.
  • You as an employee in a multi-national firm have a duplicate record for each country you have worked in.

People moving between countries are still treated as an exception not covered by adequate business rules and data capture procedures. Most things are sorted out eventually, but it always takes a whole lot of more trouble compared to if you just are born, raised and stays in the same country.

When we landed in Copenhagen this morning we (is that was or were?) able to use the new local smart travel card in order to travel on with public transit. But it wasn’t easy getting the card we remember. With a foreign address you can’t apply online. So we had to queue up at the Central Station, fill in a form and explain that you don’t have an official document with your address in the UK – and we avoided explaining the shocking fact that in the UK your electricity bill is your premier proof of almost anything related to your identity.

What about you? Do you have a being in several countries? Any war stories experienced related to your going back and forth?

Bookmark and Share

Entity Resolution and Big Data

FingerprintThe Wikipedia article on Identity Resolution has this catch on the difference between good old data matching and Entity Resolution:

”Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Laboratory for Advanced Research in Entity Resolution and Information Quality:

  • Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
  • Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
  • Utilizes non-matching, asserted linking (associate) information in addition to direct matching
  • Uncovers non-obvious relationships and association networks (i.e. who’s associated with whom)”

I have a gut feeling that Data Matching and Entity (or Identity) Resolution will melt together in the future as expressed in the post Deduplication vs Identity Resolution.

If you look at the above mentioned factors that distinguish data matching from identity resolution, some of the often mentioned features in the new big data technology shine through:

  • Working with unstructured and semi-structured data is probably the most mentioned difference between working with small data versus working with big data.
  • Working with associations is a feature of graph databases or other similar technologies as mentioned in the post Will Graph Databases become Common in MDM?

So, in the quest of expanding matching small data to evolve into Entity (or Identity) Resolution we will be helped by general developments in working with big data.

Bookmark and Share

Matching for Multiple Purposes

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

HierarchyFor that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Bookmark and Share

Know Your Fan

A variant of the saying “Know Your Customer” for a football club will be “Know Your Fan” and indeed fans are customers when they buy tickets. If they can.

FC Copenhagen

FC Copenhagen cruised into stormy waters when they apparently cancelled all purchases for the upcoming Champions League (European soccer club paramount tournament) clashes against Real Madrid, Juventus and Galatasaray if the purchasers didn’t have a Danish sounding name. The reason was to prevent mixing fans of the different clubs, but surely this poorly thought screening method wasn’t received well among the FC Copenhagen fans not called Jensen, Nielsen or Sørensen.

The story is told in English here on Times of India.

Actually methods of verifying identities are available and cheap in Denmark so I’m surprised to see FC Copenhagen caught offside in this situation.

Bookmark and Share