3 out of 10

Just before I left for summer vacation I noticed a tweet by MDM guru Aaron Zornes saying:

This is a subject very close to me as I have worked a lot with business directory matching during the last 15 years not at least matching with the D&B WorldBase.

The problem is that if you match your B2B customers, suppliers and other business partners with a business directory like the D&B WorldBase you could naively expect a 100% match.

If your result is only a 30% hit rate the question is: How many among the remaining 70% are false negatives and how many are true negatives.

True negatives

There may be a lot of reasons for true negatives, namely:

  • Your business entity isn’t listed in the business directory. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordic countries and others have a tight public registration of companies and then it is less tight from countries in North America, other European countries and the rest of the world.
  • Your supposed business entity isn’t a business entity. Many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.
  • Uniqueness may be different defined in the business directory and your table to be matched. This includes the perception of hierarchies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also the different roles as those of small business owners are a challenge. The same is true about roles as franchise takers and the use of trading styles.

False negatives

In business directory matching the false negatives are those records that should have been matched by an automated function, but isn’t.

The number of false negatives is a measure of the effectiveness of the automated matching tool(s) and rules applied. Big companies often use the magic quadrant leaders in data quality tools, but these aren’t necessary the best tools for business directory matching.

Personally I have found that you need a very complex mix of tools and rules for getting a decent match rate in business directory matching, including combining both deterministic and probabilistic matching. Some different techniques are explained in more details here.

Bookmark and Share

Consultants

Just arrived home from summer vacation I have been thinking a bit about how we consultants act at work. On our vacation we used local guides at some places. These guides were our consultants at places they know very well and we didn’t know at all. But I also noticed they had some habits which may be considered as common weak sides of practicing consultancy.

Different language

Francisco Caballero has lived all his long life in the beautiful town Ronda in Southern Spain. He shared his great knowledge about the town with us in his distinguished blend of English and Spanish spiced up with some Russian, German and probably also Dutch words. I think we understood the most though we did have some variances when we compared our perceptions afterwards.

Personal opinions

Besides telling about the town and the history behind Señor Caballero also shared his views about politics. He told about problems with young people today and increasing crime. He remembered things were much better when Generalissimo Franco was in charge. He admitted though that today there is no “bandidos” in the mountains as in the old days, but as he put it: “Today all bandidos in Madrid”. I guess he was referring to recent governments.

Assessing risk

Robert is fifth generation of British descent living in Gibraltar, the small English enclave around the marvelous rock on the Southern tip of Spain facing Africa cross the narrow strait. I remember the opening scene of the James Bond film The Living Daylights is a hazardous car ride down the rock. Robert took us in his taxi on the very same narrow roads, practicing pretty much the same style of driving while explaining that as we had to go off and on the car all the time at the different sights, there was really no point in using the safety belts.

Personal commercial agenda

Salam seemed to know everyone and everything in Tangier, the Moroccan city on the Northern tip of Africa on the other side of the Strait of Gibraltar. Salam offered us a guided tour where we would go everywhere we wanted and look at everything we fancied using any time as we pleased. Only when going around he strongly urged us to go to exactly that spice shop he knew and strongly recommended not sitting at that café we spotted but preceding to a much better one. As infidels we couldn’t of course go into a mosque, unless (of course) we gave some extra Euro.

Bookmark and Share

Business Directory Musings

This coming Sunday I have worked professionally within Information Technology for 30 years. As I will be on a (well deserved!) vacation in Andalusia on Sunday, I’ll better post my thoughts today.

I have had a lot of different positions and worked in a lot of different domains. The single subject I have worked with the most is business directories.

My first job was at the Danish Tax Authorities and one of the assignments was being a secretary to the committee working for a joint registration of companies in Denmark. Besides I learned a lot about working in political driven organizations and about aligning business and technology I feel good about having been part of the start of building a public sector master data directory. Such directories are both essential for an effective public administration and can be used as external reference data in private enterprises as a valuable mean to improve data quality with business partner master data.

Later I have been working a lot with improving data quality through matching solutions around business directories. This goes from the Dun & Bradstreet WorldBase holding nearly 170 million business entities from all over the world, over databases like the EuroContactPool to national databases either holding all businesses (available) in a single country or given industry segments.

I guess I also will be spending some additional years from now with integrating business directory information into business processes as smooth as possible and preferable along with a range of other kind of external reference data.

One of the new sources building up in the cloud in the realm of business directories is master data references in social networks. The LinkedIn Companies feature is a prominent example. Of course such directories have some data quality issues. This is seen in looking at the companies where I currently work:

  • DM Partner A/S seems OK
  • Omikron Data Quality has 90 employees according to the company profile (filled out by yours truly). Then it’s strange that there are only 25 profiles in the network. But that’s because most employees are in Germany where the competing network called Xing is stronger.
  • Trapeze Group Europe has not been updated with a recent merger and not all profiles has changed their profile accordingly yet. But I’m sure that will be done as time goes by.

I have no doubt though that including information from social networks will become a part of integrating business partner master data in my future.

Bookmark and Share

Social Master Data Management

The term ”Social CRM” has been around for a while. Like traditional CRM (Customer Relationship Management) is heavily dependent on proper MDM (Master Data Management) we will also see that enterprise wide social CRM will be dependent on a proper social MDM element in order to be a success.

The challenge in social MDM will be that we are not going to replace some data sources for MDM, but we are actually going to add some more sources and handle the integration of these sources with the sources for traditional CRM and MDM and other new sources coming from the cloud.

Customer Master Data sources will expand to embrace:

  • Traditional data entry from field work like a sales representative entering prospect and customer master data as part of Sales Force Automation.
  • Data feed and data integration with external reference data like using a business directory. Such integration will increasingly take place in the cloud and the trend of governments releasing public sector data will add tremendously to this activity.
  • Self registration by prospects and customers via webforms.
  • Social media master data captured during social CRM and probably harvested in more and more structured ways.

Social media master data are found as profiles in services as Facebook mainly for business-to–consumer activities, LinkedIn mainly for business-to-business activities and Twitter somewhere in between. These are only some prominent examples of such services. Where LinkedIn may be dominant for professional use in English speaking countries and countries where English is widely spoken as Scandinavia and the Netherlands other regions are far less penetrated by LinkedIn. For example for German speaking countries the similar network service called Xing is much more crowded. So, when embracing global business you will have to acknowledge the diversity found in social network services.

A good way to integrate all these sources in business processes is using mashup’s. An example will be a mashup for entering customer data. If you are entering a business entity you may want to know:

  • What is already known in internal databases about that entity – either via a centralized MDM hub or throughout disparate databases?
  • Is the visit address correct according to public sector data?
  • How is the business account related to other business entities learned from a business directory?
  • Do we recognize the business contact in social networks – maybe we did have contact before in another relation?

If you are entering a consumer entity you may want to know:

  • Does that person already exist in our internal databases – as an individual and as a household?
  • What do we know about the residence address from public sector data?
  • Can we obtain additional data from phone book directories, nixie lists and what else being available, affordable and legal in the country in question?
  • How do we connect in social media?

Of course privacy is a big issue. Norms vary between countries, so do the legal rules. Norms vary between individuals and by the individuals as a private person and a business contact. Norms vary between industries and from company to company.

If aligning people, processes and technology didn’t matter before, it will when dealing with social master data management.

Bookmark and Share

No Re-Tweets?

12 hours ago from now I noticed the following tweet on Twitter from the profile @GartnerTedF:

The person behind @GartnerTedF is the analyst Ted Friedman of Gartner, Inc. He is a very important person in the data quality realm as he co-writes the Magic Quadrant.

Many of Ted’s tweets are usually re-tweeted by other tweeps.

But not this one.

I think I know why: It’s because technology simply doesn’t work.

I have noticed this often. What happens is that twitter somehow simply doesn’t index some tweets from time to time, so people don’t see them.

Going Upstream in the Circle

One of the big trends in data quality improvement is going from downstream cleansing to upstream prevention. So let’s talk about Amazon. No, not the online (book)store, but the river. Also as I am a bit tired about that almost any mention of innovative IT is about that eShop.

A map showing the Amazon River drainage basin may reveal what may go to be a huge challenge in going upstream and solve the data quality issues at the source: There may be a lot of sources. Okay, the Amazon is the world’s largest river (because it carries more water to the sea than any other river), so this may be a picture of the data streams in a very large organization. But even more modest organizations have many sources of data as more modest rivers also have several sources.

By the way: The Amazon River also shares a source with the Orinoco River through the natural Casiquiare Canal, just as many organizations also shares sources of data.

Some sources are not so easy to reach as the most distant source of the Amazon being a glacial stream on a snowcapped 5,597 m (18,363 ft) peak called Nevado Mismi in the Peruvian Andes.

Now, as I promised that the trend on this blog should be about positivity and success in data quality improvement I will not dwell at the amount of work in going upstream and prevent dirty data from every source.

I say: Go to the clouds. The clouds are the sources of the water in the river. Also I think that cloud services will help a lot in improving data quality in a more easy way as explained in a recent post called Data Quality from the Cloud.

Finally, the clouds over the Amazon River sources are made from water evaporated from the Amazon and a lot of other waters as part of the water cycle. In the same way data has a cycle of being derived as information and created in a new form as a result of the actions made from using the information.

I think data quality work in the future will embrace the full data cycle: Downstream cleansing, upstream prevention and linking in the cloud.

Bookmark and Share

Feasible Names and Addresses

Most data quality technology was born in relation to the direct marketing industry back in the good old offline days. Main objectives have been deduplication of names and addresses and making names and addresses fit for mailing.

When working with data quality you have to embrace the full scope of business value in the data, here being the names and addresses.

Back in the 90’s I worked with an international fund raising organization. A main activity was sending direct mails with greeting cards for optional sale with motives related to seasonal feasts. Deduplication was a must regardless of the country (though the means was very different, but that’s for another day). Obviously the timing of the campaigns and the motives on the cards was different between countries, but also within the countries based on the names and addresses.

Two examples:

German addresses

When selecting motives for Christmas cards it’s important to observe that Protestantism is concentrated in the north and east of the country and Roman Catholicism is concentrated in the south and west. (If you think I’m out of season, well, such campaigns are planned in summertime). So, in the North and East most people prefer Christmas cards with secular motives as a lovely winter landscape. In the South and West most people will like a motive with Madonna and Child. Having well organized addresses with a connection to demographic was important.

Malaysian names

Malaysia is a very multi-ethnic society. The two largest groups being the ethnic Malayans and the Malaysians of Chinese descent have different seasonal feasts. The best way of handling this in order to fulfill the business model was to assign the names and addresses to the different campaigns based on if the name was an ethnic Malayan name or a Chinese name. Surely an exercise on the edge of what I earlier described in the post What’s in a Given Name?

Bookmark and Share

What’s in a Blog Post Title?

I don’t know about you. But I am a slave to numbers and statistics and can’t help following my WordPress statistics telling me about pageviews – not at least pageviews per post.

There are huge differences in the number of visitors who views the different posts. The post with highest number of views on my blog has +2.500 views and the post with the lowest number has only 15 views.

To be honest, the ones with over 500 views are mainly visited due to some image search circumstances explained here, so views actually related to data quality varies between 15 and approximately 500. That’s still a huge difference.

I have still to find out precisely what makes the difference.

It can’t be the content, can it? Basically people don’t know the content before opening.

No doubt that time of posting – not to mention time of telling about posting on sites as Twitter and LinkedIn has something to say. On twitter the re-tweet action is important I have noticed. And of course re-tweet action relies on time and that the first readers found the content worth a re-tweet.

There is surely also a relation between number of comments and numbers of views. I see that in my numbers.

Obviously the title of the blog must be important. But from my numbers I can’t figure out how, except from an observation about that a technical title seem to rule over philosophical stuff as discussed here last year on DataQualityPro.

So, the title of this post is not the preface of explaining it all but a genuine question to you who by some reason came by:  What’s in a Blog Post Title?

New Blog Name?

As reported by Mark Goloboy here ”Data Quality” is becoming a dirty word. ”Information Quality” is in vogue.

Maybe I will soon have to change the name of my blog?

Also one may expect other related terms will be changed, like:

  • Data Governance becomes Information Governance
  • Master Data Management becomes Master Information Management
  • Data Matching becomes Information Matching
  • Data Warehouse becomes Information Warehouse
  • Database becomes Informationbase
  • Information Technology becomes Data Technology

But changing the name of a blog is a serious thing you shouldn’t do too often. I think I will wait and see if the term renaming stops at simply replacing data and information. Some guesses for further renaming:

Information Fitness replaces Data Quality as Data quality is often defined as “fit for intended purpose of use” and by replacing data with information that trail is even more clear – opposed to the other trail being real world alignment.

Information Political Correctness replaces Data Governance as Data Governance is a lot about policies and the Data Governance practice is a lot about maneuvering in the corporate political landscape.    

Master Information Technology (MIT) replaces Master Data Management (MDM)

Bookmark and Share

What a Lovely Day

As promised earlier today, here is the first post in an endless row of positive posts about success in data quality improvement.

This beautiful morning I finished yet another of these nice recurring jobs I do from time to time: Deduplicating bunches of files ready for direct marketing making sure that only one, the whole one and nothing but one unique message reaches a given individual decision maker, be that in the online or offline mailbox.

Most jobs are pretty similar and I have a fantastic tool that automates most of the work. I only have the pleasure to learn about the nature of the data and configure the standardisation and matching process accordingly in a user friendly interface. After the automated process I’m enjoying looking for any false positives and checking for false negatives. Sometimes I’m so lucky that I have the chance to repeat the process with a slightly different configuration so we reach the best result possible.

It’s a great feeling that this work reduces the costs of mailings at my clients, makes them look more smart and professional and facilitates that correct measure of response rates that is so essential in planning future even better direct marketing activities.

But that’s not all. I’m also delighted to be able to have a continuing chat about how we over time may introduce data quality prevention upstream at the point of data entry so we don’t have to do these recurring downstream cleansing activities any more. It’s always fascinating going through all the different applications that many organisations are running, some of them so old that I didn’t dream about they existed anymore. Most times we are able to build a solution that will work in the given landscape and anyway soon the credit crunch is totally gone and here we go.

I’ll be back again with more success from the data quality improvement frontier very soon.

Bookmark and Share