12 thoughts on “What are they doing?

  1. kenoconnordataconsultant 19th August 2010 / 11:41

    Hi Henrik,

    I like this “follow on” to your earlier post on Data Quality 3.0. It highlights further failings of the “fit for purpose” approach to information management and data quality.

    It reminds me of an experience I had with one client that historically had its own Market Classification codes. The client added a new system that captured NACE codes (wonderful). Over time, NACE codes would be captured for every business customer, allowing improved segmentation.

    Unfortunately:
    a) The client’s core systems (databases) could not accomodate the new NACE code – hence the old code stayed in the old systems, and the NACE code in the new system.

    b) The client’s automated “Master Customer” merge/demerge process failed to accomodate the new NACE code.

    RESULT: NACE code was captured for each business customer at data entry – then discarded in the Master Customer merge/demerge process.

    I detected the problem when assisting my client add a new downstream system that needed to segment based on the NACE code.

    Rgds Ken

  2. John Owens 19th August 2010 / 11:51

    Most times when I hear enterprises shouting loudly about how important unstructured data is to their business they are simply hiding behind the fact that they do not know how to structure it.

    In one such organistion, the head of the Information Management Department said to me, “Have you never heard of unstructured information? Have you never heard of a novel?”

    Well now, let’s see how unstructured a novel really is.

    >> It has characters with names and relationships to, and interactions with, other characters. Structure!

    >> The characters have life histories – events that happen in chronological order. Structure!

    >> Characters have dialogue, a series of words and sentences, delivered in a defined sequence. Structure!

    >> The novel has plots and subplots related to characters and events. Structure!

    >> It has sentences, paragraphs and chapters all related in a defined order. Structure!

    It’s almost impossible to find any part of a novel does not have structure. In fact if it did not have this structure it would be meaningless and unreadable – it simply would not make sense. It would not be a novel.

    I would suggest that any enterprise with unstructured information will also find it meaningless, unreadable and senseless.

    The basic SIC and NACE may not provide the level of granularity an organisation requires internally but this can be added through a simple matrix mapping that would ensure that it met both internal and global needs.

    Thanks for the post, Henrik. Good to have you back.

    Regards
    John

    • Jim Harris 19th August 2010 / 16:03

      Thanks for another great post, Henrik.

      John,

      Since you opened the literary door using a novel as an example of unstructured data that actually has structure, but the enterprise doesn’t know how to structure it…

      I am afraid that I have to disagree, with both your underlying premise and your specific example.

      The word novel derives from the Latin novella meaning “new,” and novella is now used to refer to a “short story of something new,” basically a shorter version of a novel.

      There is a structure in novels, but it is a structure eerily reminiscent of the structure we impose on reality by describing it with data. A novel is a fictional narrative creating a static, artificial, and comforting, but false, sense of reality.

      To make sense of the novel, to make sense of the data, we must enter its false reality, we must believe this false reality is real.

      But it is, as Samuel Taylor Coleridge wrote, this “asemblance of truth sufficient to procure for these shadows of imagination that willing suspension of disbelief for the moment, which constitutes poetic faith.”

      “The final belief,” Wallace Stevens once wrote, “is to believe in a fiction, which you know to be a fiction, there being nothing else.”

      Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.

      And that story is always novella and is always written in an a priori language, and never in a a posteriori language.

      To believe otherwise, is to mistake the fiction of data for the non-fiction of reality.

      Best Regards,

      Jim

  3. Henrik Liliendahl Sørensen 19th August 2010 / 13:54

    Ken and John, thanks a lot for the marvelous comments.

  4. kenoconnordataconsultant 19th August 2010 / 18:02

    Jim,

    Thank you for throwing open this debate with a dissenting view. That’s healthy.

    I think John may be on thin ice with the idea that a novel constitutes “Structured Data”. I agree with him that a good novel has structure, but I think it is different to what data quality professionals would regard as “Structured Data”.

    I believe the point that John is making (and I agree with him) is that free format text is a nightmare from a Data Quality / Data Governance perspective. Henrik cites a perfect example in his post, regarding early CRM systems that allowed free format entry of business/industry types.

    I realise and accept that the brave new world of social media means that free format text is here to stay, and Data Quality professionals must deal with it. However, wherever possible, data entry should be performed from selection lists only – selection lists that guarantee that only valid values are selected, and business rules are observed.

    Ken

  5. Graham Rhind 19th August 2010 / 18:23

    This may be heresy to many, but I’m not sure I agree completely with the general assumption that structured data is always better than unstructured.

    It’s only better for data input when the options are exhaustive, and that’s certainly not the case with business types. You won’t find a designation to describe my business in SIC, NACE, the Yellow Pages or in most other places, so any attempt to disallow any free text input would inevitably mean that inaccuracy creeps into the database, and that’s deadly for data quality.

    Then there’s the huge time costs required at data entry to locate the correct classification amongst the very numerous options in those systems. Even if you do find a classification, there’s a good chance it’s not the right one, so you’re getting consistent and valid data but not correct and accurate data.

    Furthermore, input systems which comply with business rules are all well and good, but those business rules have to accurately reflect reality, and many do not. I see this all the time with, for example, job title lists on input forms, where the business rules suggest that there are only 10 job titles in the world. Yeah, sure.

    I often find that classifications reduce accuracy, regardless of how carefully they are implemented, and I am not adverse to free form text input when closed questions won’t work. You do, then, have to post-classify that data for any business intelligence purposes, but I do often find that that works better than creating inaccurate data, which is often difficult to correct downstream.

  6. Jim Harris 19th August 2010 / 19:07

    “Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo . . .”

    This is the opening line from the novel A Portrait of the Artist as a Young Man by James Joyce.

    Stephen Dedalus, Joyce’s fictional alter ego, is the protagonist, and the structure of this novel’s unstructured data can be quite challenging, especially the opening chapter since it is written from the perspective of young child discovering both the world and the words used to describe it.

    Harry Levin, who edited a collection of Joyce’s work, commented that “the novelist through his command of words, is a mediator between the world of ideas and the world of reality.”

    I think that this is also an apt job description for any data management professional, who is a mediator between the world of ideas, whether they be recorded in the structured data of databases or the unstructured data of tweets, and the world of reality, which is what all of that structured and unstructured data are discovering and attempting to describe.

    What is a customer master data object other than the fictional alter ego of the real-world person that an organization does business with?

    Is Stephen Dedalus the equivalent of James Joyce?

    Is the database record, identified by CUSTOMER_KEY = 123 the equivalent of the real-world person it points its digital finger at and says “once upon a time and a very good time it was there was a customer record stored in the database and this customer record stored in the database described a nicens person whose name was entered as Jim. . .”

  7. Henrik Liliendahl Sørensen 19th August 2010 / 22:34

    Wow, a discussion ranging from mere technicalities to deep philosophical thoughts based on literary references.

    To start on the technical path and Graham’s comments I think using standards as SIC as NACE is often done along with integrating with an external business directory, so a given value is based on what that directory holds as the “single source of truth” on the industry vertical for that company as mentioned as a practice in Ken’s first comment. Maybe it is a bit stupid, but quite simple.

    For the philosophical question from Jim: Is Stephen Dedalus the equivalent of James Joyce? Answer: Yes – score = 97.12.

  8. John Owens 20th August 2010 / 01:04

    Jim

    Novel actually comes from the Latin “novus” meaning new. “Novella” was a name conjured up to mean a shorter and lighter novel. The term “Novel” was used because it told the reader that this was going to be a new tale, one they had not read before, it was unique.

    When we gather data about a new customer we are writing a new story called “Everything That This Enterprise Needs to Know About the Unique Customer …..”

    As you say, “Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.”

    But to extract the “truth” from data within our specific organisations we must know the structure in which it was (or ought to have been) laid down.

    Unstructured data is merely data. It has no intrinsic meaning. Structure gives data context and turns into information.

    Perhaps, we need novel ways of looking at unstructured data in oder to turn it into a useful truth.

  9. Graham Rhind 20th August 2010 / 07:08

    “Data is a fiction we believe in, which we know to be a fiction, but there being nothing else, data is the fiction through which we tell ourselves the story of reality.”

    Yes, absolutely right … except for “but there being nothing else” … because we have a choice to make the fiction we accept more like reality. Creating a single version of the truth, where that truth is fiction, is really a Pyrrhic exercise, and we’re better than that.

    Let me illustrate better what I meant with my comment above. This is what happens when the Dutch yellow pages calls me to check my database entry:

    “What does your company do?”

    “I’m a data consultant”

    “What’s that then?”

    I explain.

    “Oh, we don’t have a category for that. I’ll put you down as ‘Direct Marketing'”.

    Pure fiction. My answer has not been recorded. The only entry into their database is “Direct Marketing”. This cannot be corrected or post-classified (if, for example, the company adds or removes categories) because there’s no indication that this is not correct. What should happen is this:

    “Oh, we can’t classify that. I’ll type it in”. In that case my answer might have been classified on the spot (or flagged as requiring post-classification), but the truth would (also) have been recorded, and would always be available for any future interpretation.

    I understand entirely that context and circumstances often overrule best practice, but I, for one, have moved my own customer database to a less structured system because I need to know the truth about each customer, not a classification of them; and the amount of data I have allows me to use my brain to classify (if required) instead of a computer. That won’t always work, but we needn’t be dogmatic about data structure – I think accurate data is better (and far more valuable) than inaccurate but structured data.

  10. kenoconnordataconsultant 20th August 2010 / 10:52

    Hi Graham,

    You have raised excellent points, and I agree with your comment “Accurate data is better (and far more valuable) than inaccurate but structured data”. We need to strive for “Accurate, structured data”.

    Your Dutch Yellow Pages example is excellent. However, it illustrates a flaw in the data collection process – not a flaw in the concept that data should be captured in a structured manner. I completely agree with your proposal of what should happen.

    Your proposal, as a general rule, should be applied to all data collection processes (effectively applying the 80:20 rule). Accurately categorise what we can, and then allow exceptions to be captured as “other”, with the “other” being captured in free format (and as you say “always be available for future interpretation).

    In your earlier comment, you mentioned the challenge of “the huge time costs required at data entry to locate the correct classification amongst the very numerous options in those systems.” You are right, this is a challenge – a challenge that requires good data entry process design. One approach I’ve seen work well for NACE codes is to use a “three stage process”, that mirrors the NACE code structure. This requires the data entry person to first select a High level business/industry, followed by two lower levels of detail. (Not perfect – since I have seen many examples of “Other”, “Other”, “Other”). Another approach, which will become more popular, will be to pull the SIC/NACE code in real time, from a trusted external reference data supplier (given the company name / identifier).

    You cite the example of your own customer database, which you have moved to a “less structured system because you need to know the truth about each customer”. That model will work perfectly well for small businesses with a close relationship with each customer, and the time to read the truth.

    Unfortunately that model does not work when large amounts of information has to be passed or shared between organisations or between parts of organisations, especially when the shared information needs to be “segmented”, e.g. by business/industry type.

    This very debate has been made possible by the development of “standards” that facilitate the sharing of information easily over the web.

    Standards such as XBRL now facilitate the sharing of structured business information. Is this business information “accurate”? A topic for further debate.

    Ken

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s