One of the things that data quality tools does is data matching. Data matching is mostly related to the party master data domain. It is about comparing two or more data records that does not have exactly the same data but are describing the same real world entity.
Common approaches for that is to compare data records in internal master data repositories within your organization. However, there are great advantages in bringing in external reference data sources to support the data matching.
Some of the ways to do that I have worked with includes these kind of big reference data:
The business-to-business (B2B) world does not have privacy issues in the degree we see in the business-to-consumer (B2C) world. Therefore there are many business directories out there with a quite complete picture of which business entities exists in a given country and even in regions and the whole world.
A common approach is to first match your internal B2B records against a business directory and obtain a unique key for each business entity. The next step of matching business entities with that unique is a no brainer.
The problem is though that an automatic match between internal B2B records and a business directory most often does not yield a 100 % hit rate. Not even close as examined in the post 3 out of 10.
Address directories are mostly used in order to standardize postal address data, so that two addresses in internal master data that can be standardized to an address written in exactly the same way can be better matched.
A deeper use of address directories is to exploit related property data. The probability of two records with “John Smith” on the same address being a true positive match is much higher if the address is a single-family house opposite to a high-rise building, nursery home or university campus.
A common cause of false negatives in data matching is that you have compared two records where one of the postal addresses is an old one.
Bringing in National Change of Address (NCOA) services for the countries in question will help a lot.
The optimal way of doing that (and utilizing business and address directories) is to make it a continuous element of Master Data Management (MDM) as explored in the post The Relocation Event.
Identity resolution is a hot potato when we look into how we can exploit big data and within that frame not at least social data.
Some of the most frequent mentioned use cases for big data analytics revolves around listening to social data streams and combine that with traditional sources within customer intelligence. In order to do that we need to know about who is talking out there and that must be done by using identity resolution features encompassing social networks.
The second challenge is what we are allowed to do. Social networks have a natural interest in protecting member’s privacy besides they also have a commercial interest in doing so. The degree of privacy protection varies between social networks. Twitter is quite open but on the other hand holds very little usable stuff for identity resolution as well as sense making from the streams is an issue. Networks as Facebook and LinkedIn are, for good reasons, not so easy to exploit due to the (chancing) game rules applied.
In a recent tweet Ted Friedman of Gartner (the analyst firm) said:
I think he is right.
Duplicates has always been pain number one in most places when it comes to the cost of poor data quality.
Though I have been in the data matching business for many years and been fighting duplicates with dedupliaction tools in numerous battles the war doesn’t seem to be won by using deduplication tools alone as told in the post Somehow Deduplication Won’t Stick.
Eventually deduplication always comes down to entity resolution when you have to decide which results are true positives, which results are useless false positives and wonder how many false negatives you didn’t catch, which means how much money you didn’t have in return of your deduplication investment.
Yesterday we had a call from British Gas (or probably a call centre hired by British Gas) explaining the great savings possible if switching from the current provider – which by the way is: British Gas. This is a classic data quality issue in direct marketing operations being accurately separating your current customers and entities belonging to new market.
As I have learned that your premier identity proof in the United Kingdom is your utility bill, this incident may be seen as somewhat disturbing – or by further thinking, maybe a business opportunity 🙂
At iDQ we develop a solution that may be positioned in the space between data quality prevention and identity check by addressing the identity resolution aspect during data capture.
Using the royal we is usually only for majestic people, but as a person with a being in two countries at the same time, I do sometimes feel that I am we.
So, this morning we once again found our way to London Heathrow Airport for one of our many trips between London and Copenhagen as we have lived in the United Kingdom the last couple of years but still have many business and private ties with The Kingdom of Denmark where we (is that was or were?) born, raised and worked and from where we still hold a passport.
Most public sector and private sector business processes and master data management implementations simply don’t cope with the fast evolving globalization. Reflecting on this, flying over Doggerland, we memorize situations where:
We as a prospect or customer in a global brand are stored as a duplicate record for each country as told in the post Hello Leading MDM Vendor.
You as an employee in a multi-national firm have a duplicate record for each country you have worked in.
People moving between countries are still treated as an exception not covered by adequate business rules and data capture procedures. Most things are sorted out eventually, but it always takes a whole lot of more trouble compared to if you just are born, raised and stays in the same country.
When we landed in Copenhagen this morning we (is that was or were?) able to use the new local smart travel card in order to travel on with public transit. But it wasn’t easy getting the card we remember. With a foreign address you can’t apply online. So we had to queue up at the Central Station, fill in a form and explain that you don’t have an official document with your address in the UK – and we avoided explaining the shocking fact that in the UK your electricity bill is your premier proof of almost anything related to your identity.
What about you? Do you have a being in several countries? Any war stories experienced related to your going back and forth?
”Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Laboratory for Advanced Research in Entity Resolution and Information Quality:
Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
Utilizes non-matching, asserted linking (associate) information in addition to direct matching
Uncovers non-obvious relationships and association networks (i.e. who’s associated with whom)”
The post has a good walk through on the topic and reaches this conclusion:
“So, which is better, Deterministic Matching or Probabilistic Matching? The question should actually be: ‘Which is better for you, for your specific needs?’ Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”
This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.
For that we need what could be called hierarchical data matching.
The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.
One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.