This coming Sunday I have worked professionally within Information Technology for 30 years. As I will be on a (well deserved!) vacation in Andalusia on Sunday, I’ll better post my thoughts today.
I have had a lot of different positions and worked in a lot of different domains. The single subject I have worked with the most is business directories.
My first job was at the Danish Tax Authorities and one of the assignments was being a secretary to the committee working for a joint registration of companies in Denmark. Besides I learned a lot about working in political driven organizations and about aligning business and technology I feel good about having been part of the start of building a public sector master data directory. Such directories are both essential for an effective public administration and can be used as external reference data in private enterprises as a valuable mean to improve data quality with business partner master data.
Later I have been working a lot with improving data quality through matching solutions around business directories. This goes from the Dun & Bradstreet WorldBase holding nearly 170 million business entities from all over the world, over databases like the EuroContactPool to national databases either holding all businesses (available) in a single country or given industry segments.
I guess I also will be spending some additional years from now with integrating business directory information into business processes as smooth as possible and preferable along with a range of other kind of external reference data.
One of the new sources building up in the cloud in the realm of business directories is master data references in social networks. The LinkedIn Companies feature is a prominent example. Of course such directories have some data quality issues. This is seen in looking at the companies where I currently work:
- DM Partner A/S seems OK
- Omikron Data Quality has 90 employees according to the company profile (filled out by yours truly). Then it’s strange that there are only 25 profiles in the network. But that’s because most employees are in Germany where the competing network called Xing is stronger.
- Trapeze Group Europe has not been updated with a recent merger and not all profiles has changed their profile accordingly yet. But I’m sure that will be done as time goes by.
I have no doubt though that including information from social networks will become a part of integrating business partner master data in my future.
With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.
In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.
Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).
All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:
- 2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
- 3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
- 4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
- 5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
- n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows
Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.
If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.
One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.
A business directory is a list of companies in a given area and perhaps a given industry. One very useful type of such a directory related to data quality is a list of all companies in a given country. In many countries the authorities maintains such a list, other places it’s a matter of assembling local lists or other forms of data capture. Many private service providers offer such lists often with added information value of different kinds.
If you take the customer/prospect master table from an enterprise doing B2B in a given country one should believe that the rows in that table would match 100% to the business directory of that country. I am not talking about that all data are spelled exactly as in the directory but “only” about that it’s the same real world object reflected.
During many years of providing solutions for business directory match and tuning these as well as handling such match services from colleagues in the business I have very, very seldom seen a 100% match – even 90% matches are very rare.
Why is that so? Some of the reasons – related to the classic data quality dimensions – I have stumbled over has been:
Completeness of business directories varies from country to country and between the lists provided by vendors. Some countries like those of the old Czechoslovakia, some English speaking countries in the Pacifics, the Nordics and others have a tight registration and then it is less tight from countries in North America, other European countries and the rest of the world.
Actuality in business directories also differs a lot. Also it is important if the business directory covers dissolved entities and includes history tracking like former names and addresses. Then take the actuality of the customer/prospect table to be matched and once again the time dimension has a lot to say.
Validity, accuracy, consistency both concerning the directory and the table to be matched is a natural course of mismatch. Also many B2B customer/prospect tables holds a lot of entities not being a formal business entity but being a lot of other types of party master data.
Uniqueness may be different defined in the directory and table to be matched. This includes the perception of hierachies of legal entities and branches – not at least governmental and local authority bodies is a fuzzy crowd. Also different roles as those of a small business owner makes challenges. The same is true about roles as franchise takers and the use of trading styles.
Then of course the applied automated match technique and the human interaction executed are factors of the resulting match rate and the quality of the match measured as frequency of false positives.
10 years ago I spend most of the summer delivering my first large project after being a sole proprietorship. The client – or actually rather the partner – was Dun & Bradsteet’s Nordic operation, who needed an agile solution for matching customer files with their Nordic business reference data sets. The application was named MatchBox.
This solution has grown over the years while D&B’s operation in the Nordics and other parts of Europe is now operated by Bisnode.
Today matching is done with the entire WorldBase holding close to 150 million business entities from all over the world – with all the diversity you can imagine. On the technology side the application has been bundled with the indexing capacities of www.softbool.com and the similarity cleverness of www.omikron.net (disclosure: today I work for Omikron) all built with the RAD tool www.magicsoftware.com. The application is now called GlobalMatchBox.
It has been a great but fearful pleasure for me to have been able to work with setting up and tuning such a data matching engine and environment. Everybody who has worked with data matching knows about the scars you get when avoiding false positives and false negatives. You know that it is just not good enough to say that you only are able to automatically match 40% of the records when it is supposed to be 100%.
So this project has very much been an unlike experience compared to the occasional SMB (Small and Medium size Business) hit and run data quality improvement projects I also do as described in my previous post. With D&B we are not talking about months but years of tuning and I have been guilty of practicing excessive consultancy.