A very common data quality issue is when a field in a data record is populated with more than one piece of information.
Sometimes this is done as a work around, because we have a piece of information, but we haven’t a field with that distinct purpose of use. Then we find a more or less related existing field where in we can squeeze this additional piece of information.
But we also have some very common cases where this bad habit is required by external business rules or wide spread tradition.
Legal Form in Company Names
This example is examined in the post Legal Forms from Hell.
One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.
An Address Line
An address line will typically hold a couple of elements as a street (thoroughfare) name, a house number and maybe some kind of unit identification.
By the way the order of street name and house number is opposite in approximately two equal parts of the world, with the exception of places where numbering within blocks between streets is the standard.
Education in Person Name
You can put professor in front of your name and even MBA – Master of Business Administration!! – after your name in the name field.
In the next few days I will put AFCM (Accidental Field Content Misuser) after my name.
Henrik, excellent… someone who gets it !!! to be blunt, it is the Lazy Best Practice that gets implemented here… changing the data model to support multiple fields in a name, entity or the like, is just too freaking difficult.
So, what can we do about this… yep, I hope you are saying “parse the information, and when mastering it (MDM for example), put the relevant data where it needs to be. ” i.e. First Name has a place, and Last Name has a place.
Oh well, I find a lot of data sources with “Informatica Corporation” as a Legal Name, so sometimes we need to keep it that way because, it is a legal entity… but you can always have a reference table or another set of attributes to call it “Informatica” or INFA or what else you like to spin “Informatica Corporation”… sometimes that Legal Entity has to be kept, for Traceability, Audit and Compliance…
Always like reading your posts… cheers…
@Garnie, I agree with you – it is a very good reflection. You are trying to give a recipe how to solve it. I agree with you we could put the “First Name” as the first etc. It is perfect if you are able to separate information to specific columns. But here starts other issues, like the lenght of name, etc. Other issue can be writing of academic titles (i.e. M.D, MUDr.,….) because it can be influenced by acts valid for a specific country, etc.
It is a first side of this problem. Only metadata management – describing significance of the field is necessary including having common definition of this field, and as well defined quality requirements.
For example in my country are small cities (mostly villages) not having streets. And you can imagine simple impact if the street is a mandatory field. At many cases it is an “important” situation because you could not be able to enter client into your system. I think it is possible to find other examples.
Information (data) quality starts in design. If you have system not allowing to set-up street as optional (at the case above) people at branches (if missing business rules) will be looking for an approach how to resolve it – from his/her view. And imediatelly you have an information (or if you can – data) quality issue.
And it would be possible to continue by training, etc…..I would like to say, metadata is not a medicine for all…. it is a much complex problem.
Milan, agreed, if you have the opportunity for a green field / new environment, but most of my engagements deal with enterprises that have existing systems / data, through organic growth or worse, acquisition (from mergers, or acquire data from supplies vendors or the like) …
The challenge is multi dimensional:
– existing … sometimes you can’t change the data. This is due to either legal reasons (legal entities) or that the data exists in systems that have been there before (think ERP or Legacy systems).
– cost … changing models on all your systems is a huge cost. is it worth time to spend resources, money and downtown to change the models to fix the “Naming Fields” ?
There are many more dimensions, too numerous to share here. I also agree on Metadata / Glossary of terms. I recently engaged with a client that they established a glossary of terms (with business context and contributions from the business) and had those terms related back to the Metadata (specifically Data Quality and Data Integration jobs) that reflected the linage and any impact to reports if the “Business Logic Changed” We used that knowledge to find the challenges, then using parsing and other data integration tools (DQ for example) to correct – split out names (and create reference tables) for the consuming applications.
In all, we solved a problem that existed. Parsing & Metadata are just two ways to help attack the “Putting Two Things in One Field”
Thanks for the reply Milan, yes, very complex, why we have folks like you and Henrik involved with the community of thought leaders… good stuff…
Thanks for your post, good reading as always… Even though AFCM would be incorrect if you’re not putting it in by accident 😃.
Your post brings back memories of trying to correctly tokenize customer address data back in 1997, trying to standardize address data, some of which had titles, suffixes, prefixes, middle initials, etc…
Thanks guys for commenting.
@Leif, well, it may be an accident within Aristotelian substance theory.
I find that much of this issue is poor implementation. Most of the major applications have fields for the information to be handled correctly. But the implementation does not properly take them account for their intended use. My (least) favorite example is the c/o in the address line. Why? “Because the shipping label does not print properly if I put the name in the c/o field”!