A core attribute in customer master data when dealing with business entities is assigning values for your customers/prospects industry vertical (or Line-of-Business or market segment or whatever metadata name you like).
When handling this particular data element you will come across many of the classic different options in data and information management.
Unstructured versus structured
Many early CRM (Customer Relationship Management) implementations offered a free text field for the industry vertical. While this approach may have been good for the free flow in data entry it of course has created havoc when business intelligence was applied to the CRM data. Countless cleansing projects have been done (and is going on) around in order to fix this basic mistake.
Most data entry forms today having an industry vertical value has a value list to choose from.
Your list versus an external standard
When having a value list it may be a list of your own creation or be based on an external standard list, for example SIC or NACE codes.
Having a list of your own tends to fulfill the data quality principle of fit for purpose of use while an external standard tends to fulfill the data quality principle of reflecting the real world construct.
The main weaknesses of a list of your own are that it requires continuous manual based maintenance and may cause conflicts. Deep down into a discussion on the Initiate MDM blog Julian Schwarzenbach offered a good example saying:
“I have also come across ‘flip-flop’ data – which is typically subjective data where two users cannot agree what the correct value is and it keeps getting changed between two values. This could be the classification of a customer by market sector where two different territories are reflecting different capabilities in their territories.” – Link here.
The main weaknesses of an external standard are that they seldom offer the granularity you need and for global data the different standards (SIC versions and different national NACE implementations and others) are a pain in the…
One versus several values
Many companies have more than one distinct activity. Catching only one (the primary) value for each company is keeping it simple, stupid. Having more than one value in relevant cases is adding complexity but may lead to better decisions.
Hi Henrik,
I like this “follow on” to your earlier post on Data Quality 3.0. It highlights further failings of the “fit for purpose” approach to information management and data quality.
It reminds me of an experience I had with one client that historically had its own Market Classification codes. The client added a new system that captured NACE codes (wonderful). Over time, NACE codes would be captured for every business customer, allowing improved segmentation.
Unfortunately:
a) The client’s core systems (databases) could not accomodate the new NACE code – hence the old code stayed in the old systems, and the NACE code in the new system.
b) The client’s automated “Master Customer” merge/demerge process failed to accomodate the new NACE code.
RESULT: NACE code was captured for each business customer at data entry – then discarded in the Master Customer merge/demerge process.
I detected the problem when assisting my client add a new downstream system that needed to segment based on the NACE code.
Rgds Ken
Most times when I hear enterprises shouting loudly about how important unstructured data is to their business they are simply hiding behind the fact that they do not know how to structure it.
In one such organistion, the head of the Information Management Department said to me, “Have you never heard of unstructured information? Have you never heard of a novel?”
Well now, let’s see how unstructured a novel really is.
>> It has characters with names and relationships to, and interactions with, other characters. Structure!
>> The characters have life histories – events that happen in chronological order. Structure!
>> Characters have dialogue, a series of words and sentences, delivered in a defined sequence. Structure!
>> The novel has plots and subplots related to characters and events. Structure!
>> It has sentences, paragraphs and chapters all related in a defined order. Structure!
It’s almost impossible to find any part of a novel does not have structure. In fact if it did not have this structure it would be meaningless and unreadable – it simply would not make sense. It would not be a novel.
I would suggest that any enterprise with unstructured information will also find it meaningless, unreadable and senseless.
The basic SIC and NACE may not provide the level of granularity an organisation requires internally but this can be added through a simple matrix mapping that would ensure that it met both internal and global needs.
Thanks for the post, Henrik. Good to have you back.
Regards
John
Thanks for another great post, Henrik.
John,
Since you opened the literary door using a novel as an example of unstructured data that actually has structure, but the enterprise doesn’t know how to structure it…
I am afraid that I have to disagree, with both your underlying premise and your specific example.
The word novel derives from the Latin novella meaning “new,” and novella is now used to refer to a “short story of something new,” basically a shorter version of a novel.
There is a structure in novels, but it is a structure eerily reminiscent of the structure we impose on reality by describing it with data. A novel is a fictional narrative creating a static, artificial, and comforting, but false, sense of reality.
To make sense of the novel, to make sense of the data, we must enter its false reality, we must believe this false reality is real.
But it is, as Samuel Taylor Coleridge wrote, this “asemblance of truth sufficient to procure for these shadows of imagination that willing suspension of disbelief for the moment, which constitutes poetic faith.”
“The final belief,” Wallace Stevens once wrote, “is to believe in a fiction, which you know to be a fiction, there being nothing else.”
Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.
And that story is always novella and is always written in an a priori language, and never in a a posteriori language.
To believe otherwise, is to mistake the fiction of data for the non-fiction of reality.
Best Regards,
Jim
Ken and John, thanks a lot for the marvelous comments.
Jim,
Thank you for throwing open this debate with a dissenting view. That’s healthy.
I think John may be on thin ice with the idea that a novel constitutes “Structured Data”. I agree with him that a good novel has structure, but I think it is different to what data quality professionals would regard as “Structured Data”.
I believe the point that John is making (and I agree with him) is that free format text is a nightmare from a Data Quality / Data Governance perspective. Henrik cites a perfect example in his post, regarding early CRM systems that allowed free format entry of business/industry types.
I realise and accept that the brave new world of social media means that free format text is here to stay, and Data Quality professionals must deal with it. However, wherever possible, data entry should be performed from selection lists only – selection lists that guarantee that only valid values are selected, and business rules are observed.
Ken
This may be heresy to many, but I’m not sure I agree completely with the general assumption that structured data is always better than unstructured.
It’s only better for data input when the options are exhaustive, and that’s certainly not the case with business types. You won’t find a designation to describe my business in SIC, NACE, the Yellow Pages or in most other places, so any attempt to disallow any free text input would inevitably mean that inaccuracy creeps into the database, and that’s deadly for data quality.
Then there’s the huge time costs required at data entry to locate the correct classification amongst the very numerous options in those systems. Even if you do find a classification, there’s a good chance it’s not the right one, so you’re getting consistent and valid data but not correct and accurate data.
Furthermore, input systems which comply with business rules are all well and good, but those business rules have to accurately reflect reality, and many do not. I see this all the time with, for example, job title lists on input forms, where the business rules suggest that there are only 10 job titles in the world. Yeah, sure.
I often find that classifications reduce accuracy, regardless of how carefully they are implemented, and I am not adverse to free form text input when closed questions won’t work. You do, then, have to post-classify that data for any business intelligence purposes, but I do often find that that works better than creating inaccurate data, which is often difficult to correct downstream.
“Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo . . .”
This is the opening line from the novel A Portrait of the Artist as a Young Man by James Joyce.
Stephen Dedalus, Joyce’s fictional alter ego, is the protagonist, and the structure of this novel’s unstructured data can be quite challenging, especially the opening chapter since it is written from the perspective of young child discovering both the world and the words used to describe it.
Harry Levin, who edited a collection of Joyce’s work, commented that “the novelist through his command of words, is a mediator between the world of ideas and the world of reality.”
I think that this is also an apt job description for any data management professional, who is a mediator between the world of ideas, whether they be recorded in the structured data of databases or the unstructured data of tweets, and the world of reality, which is what all of that structured and unstructured data are discovering and attempting to describe.
What is a customer master data object other than the fictional alter ego of the real-world person that an organization does business with?
Is Stephen Dedalus the equivalent of James Joyce?
Is the database record, identified by CUSTOMER_KEY = 123 the equivalent of the real-world person it points its digital finger at and says “once upon a time and a very good time it was there was a customer record stored in the database and this customer record stored in the database described a nicens person whose name was entered as Jim. . .”
Wow, a discussion ranging from mere technicalities to deep philosophical thoughts based on literary references.
To start on the technical path and Graham’s comments I think using standards as SIC as NACE is often done along with integrating with an external business directory, so a given value is based on what that directory holds as the “single source of truth” on the industry vertical for that company as mentioned as a practice in Ken’s first comment. Maybe it is a bit stupid, but quite simple.
For the philosophical question from Jim: Is Stephen Dedalus the equivalent of James Joyce? Answer: Yes – score = 97.12.
Jim
Novel actually comes from the Latin “novus” meaning new. “Novella” was a name conjured up to mean a shorter and lighter novel. The term “Novel” was used because it told the reader that this was going to be a new tale, one they had not read before, it was unique.
When we gather data about a new customer we are writing a new story called “Everything That This Enterprise Needs to Know About the Unique Customer …..”
As you say, “Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.”
But to extract the “truth” from data within our specific organisations we must know the structure in which it was (or ought to have been) laid down.
Unstructured data is merely data. It has no intrinsic meaning. Structure gives data context and turns into information.
Perhaps, we need novel ways of looking at unstructured data in oder to turn it into a useful truth.
“Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.”
Yes, absolutely right … except for “but there being nothing else” … because we have a choice to make the fiction we accept more like reality. Creating a single version of the truth, where that truth is fiction, is really a Pyrrhic exercise, and we’re better than that.
Let me illustrate better what I meant with my comment above. This is what happens when the Dutch yellow pages calls me to check my database entry:
“What does your company do?”
“I’m a data consultant”
“What’s that then?”
I explain.
“Oh, we don’t have a category for that. I’ll put you down as ‘Direct Marketing'”.
Pure fiction. My answer has not been recorded. The only entry into their database is “Direct Marketing”. This cannot be corrected or post-classified (if, for example, the company adds or removes categories) because there’s no indication that this is not correct. What should happen is this:
“Oh, we can’t classify that. I’ll type it in”. In that case my answer might have been classified on the spot (or flagged as requiring post-classification), but the truth would (also) have been recorded, and would always be available for any future interpretation.
I understand entirely that context and circumstances often overrule best practice, but I, for one, have moved my own customer database to a less structured system because I need to know the truth about each customer, not a classification of them; and the amount of data I have allows me to use my brain to classify (if required) instead of a computer. That won’t always work, but we needn’t be dogmatic about data structure – I think accurate data is better (and far more valuable) than inaccurate but structured data.
Hi Graham,
You have raised excellent points, and I agree with your comment “Accurate data is better (and far more valuable) than inaccurate but structured data”. We need to strive for “Accurate, structured data”.
Your Dutch Yellow Pages example is excellent. However, it illustrates a flaw in the data collection process – not a flaw in the concept that data should be captured in a structured manner. I completely agree with your proposal of what should happen.
Your proposal, as a general rule, should be applied to all data collection processes (effectively applying the 80:20 rule). Accurately categorise what we can, and then allow exceptions to be captured as “other”, with the “other” being captured in free format (and as you say “always be available for future interpretation).
In your earlier comment, you mentioned the challenge of “the huge time costs required at data entry to locate the correct classification amongst the very numerous options in those systems.” You are right, this is a challenge – a challenge that requires good data entry process design. One approach I’ve seen work well for NACE codes is to use a “three stage process”, that mirrors the NACE code structure. This requires the data entry person to first select a High level business/industry, followed by two lower levels of detail. (Not perfect – since I have seen many examples of “Other”, “Other”, “Other”). Another approach, which will become more popular, will be to pull the SIC/NACE code in real time, from a trusted external reference data supplier (given the company name / identifier).
You cite the example of your own customer database, which you have moved to a “less structured system because you need to know the truth about each customer”. That model will work perfectly well for small businesses with a close relationship with each customer, and the time to read the truth.
Unfortunately that model does not work when large amounts of information has to be passed or shared between organisations or between parts of organisations, especially when the shared information needs to be “segmented”, e.g. by business/industry type.
This very debate has been made possible by the development of “standards” that facilitate the sharing of information easily over the web.
Standards such as XBRL now facilitate the sharing of structured business information. Is this business information “accurate”? A topic for further debate.
Ken