I remember when I had the first chemistry lesson in high school our teacher told us that we should forget all about the chemistry we had learned in primary school, because this was a too simply model not reflecting how the real world of chemistry actually work.
Since I have started working with data quality and master data management I have a pet peeve in data modeling, namely being the probably most common example of doing data modeling: The classic customer table. Example from a SQL tutorial here:
Compared to how the real world works this example has some diversity flaws, like:
- state code as a key to a state table will only work with one country (the United States)
- zipcode is a United States description only opposite to the more generic “Postal Code”
- fname (First name) and lname (Last name) don’t work in cultures where given name and surname have the opposite sequence
- The length of the state, zipcode and most other fields are obviously too small almost anywhere
More seriously we have:
- fname and lname (First name and Last name) and probably also phone should belong to an own party entity acting as a contact related to the company
- company name should belong to an own party entity acting in the role as customer
- address1, address2, city, state, zipcode should belong to an own place entity probably as the current visiting place related to the company
Now I know this is just a simple example from a tutorial where you should not confuse by adding too much complexity. Agreed.
However many home grown solutions in business life and even many commercial ready-made applications use that kind of a data model to describe one of the most important business entities being our customers.
It may be that such a model does fit the purpose of use in some operations. Sometimes yes, sometimes no. But when reusing data from such a model on enterprise level and when adding business intelligence you are in big trouble. That is why we need master data hubs and why we need to transform data coming into the master data hub.
From such a customer record we don’t create just one golden record. We make or link several different related multi-domain entities as:
- The contact as a person in our party domain – maybe we knew her before
- The company in our party domain – maybe we knew the sister as a supplier before
- The address in our place (location) domain – maybe we knew that address as a place in time before
Ah, the curse of bad examples. Your post fits well with my usual caveat that ALL examples are wrong. They are usually wrong for the reason you state: that the real world is too complex. I agree, but there is no valid reason for authors to make most of the mistakes you point out here.
When I do manuscript reviews, this is a common criticism of mine.
We need to stop teaching bad habits.
Thanks Karen for dropping by this place. Not completely sure if I want the honor having you reviewing some of my tutorials 🙂
Henrik,
You are spot on with your observation….even though example is simplistic it makes a real and practical point.
Another dimension of what you are saying is one of my pet peeves (since we are talking about pet peeves, thought I would add it here :))….while defining data models for operational systems, operational system developers/designers should consider impact on Analytical usage of the data. Very little or no attention is paid to the analytical usage of operational data….and this is where many of the legacy (and some newly developed) data models come short in satisfying holistic data/analytical requirements of the business.
I had made similar argument around data quality (in the context of building operational applications) in one of my posts Inject Data Quality before It Becomes Data Quality Issue
Vish Agashe
Thanks Vish. Indeed, better data models upstream is far better than introducing the right data model downstream.
I just worked on a data migration where in the zip code field if patient was from other than the US, it was noted as “foreign”.
Any data migration projects, one of the first questions I ask is “What are your data work arounds?”
Good question Jackie: “What are your data work arounds?”