A basic way of ensuring data quality in a database is to define that a certain attribute must be filled. This is done by specifying that the value “null” isn’t allowed or as said in SQL’ish: Setting the NOT NULL constraint.
A common data quality issue is that such constraints almost always are too rigid.
In my last post called Notes about the North Pole it was discussed that every place on earth has a latitude and a longitude except that the North Pole – and the South Pole – hasn’t a longitude. So if you have a table with geocodes you can’t set NOT NULL for the longitude if you (though very unlikely) should store the coordinates for the poles. Alternatively you could store 0 for longitude to make it complete – but then it would be very inaccurate. 360 degree inaccurate so to speak.
Another infrequent example from this blog is that every person in my country has a given (first) name and a family (last) name. But there are a few Royal Exceptions. So, no NOT NULL for the family name.
Related to people and places there are plenty of more frequent examples. If you only expect addresses form United States, Australia or India setting the NOT NULL for the state attribute seems wise. But expect foolish values in here when you get addresses from most other parts of the world. So, no NOT NULL for the state.
A common variant of the mandatory state value is when you register for data quality webinars, white papers and so on. Most often you must select from a value list containing the United States of America – in some cases also mixed in with Canadian Provinces. The NULL option to be used by strangers may hide as “Not Applicable” way down the list among states beginning with N.
I usually select Alaska which is among the first states in the alphabetical order – which also brings me back close to the North Pole making my data close to 360 degree inaccuracy.
Very interesting perspective Henrik.
The real question which should be asked is do you really need to ask the questions in the first place? Just because someone is registering for your webinar, do you really need to know their state or providence code? Perhaps just an email, name and phone number are more than sufficient, these registration forms have gotten way too detailed for my liking.
I believe that in “most” situations constraints such as not nulls and foreign keys are hugely valuable. From my experience with forms like the ones you’ve mentioned, folks will typically use default or incorrect values after they “try” to find their “correct” value.
Thanks for writing the post, I really enjoyed it…Rich
Thanks for commenting Rich.
I am actually also old school relational database practitioner – only with some bad experiences around null’s.
And I totally agree with you on registration forms length – sigh.
Hi Henrik
When it comes to NOT NULL values one can be either a pedant or pragmatist.
The pedant would point out that not ALL longitude values are are NOT NULL, citing the example of the North/South Poles. He/she would then build a database allowing NULL values in the field, thus enabling this to left blank in millions/billions of entries, where it is inappropriate, just in case someone wanted to enter the co-ordinates of the north pole.
The pragmatist would accept that there was two exceptions for the situation, e.g the Poles, but would employ the standard accepted convention of entering the value 0 W for these locations. Millions of erroneous entries avoided.
The same is true for Family Names. The pragmatist would ask, “How many of your customers are monarchs?”. If this is a low number – and for most businesses this is likely to be the case – then the pragmatist would, for the sake of data quality, make Family Name NOT NULL, adding the business rule, “If the customer is a monarch, then enter the value ‘King’ or ‘Queen’ here”>
Data quality is definitely all about rigour – yet this should not be mistaken for pedantry.
When it come to data quality, the pragmatic perfectionist will serve the enterprise far better than piqued pedant!
Regards
John
Point taken John. These are exceptions that confirm the rule and we must actually often choose practice before theory or as Yoggi Berra said: “In theory, there’s no difference between theory and practice. But in practice there is.”
Great post Henrik. And an excellent point from John about avoiding pedantry.
However, when balancing the pedantry vs practicality debate it is important to consider the downstream uses of the data.
For example if you insist on making a post code field mandatory (Not NULL) and then extract that onto mailing labels, then it is possible you’ll get Irish customers putting in noise, nonsense, or profanities (my postman loves the last one… and sometimes I leave him a note in a company’s postcode field). Having a clear standard default that people can enter (I’ve seen one company inform Irish customers to type “NONE”.. which is accurate and passes the not NULL constraint but looks odd on an address label)
You also need to consider the up-stream implications. If a field is mandatory, quite often the data quality team or the front-end interface team will seek to put some validation routines into the data capture process. This can often result in people who have no postcodes not being able to complete a transaction because the form won’t accept our fudge. Usually when faced with that on a US form I type “90210” because the pervasiveness of US TV culture means it is one zip code that I know by heart.
That is inaccurate and inconsistent data. But the junkmail still arrives. So does that mean it is ‘fit for purpose’?
This is an interesting discussion. If fields are bit allowed to be null, then a project team needs to think carefully about what the allowable values and what the default value should be. You have to use great care not to default to an answer that is inappropriate if the question has not been asked or that would be wrong.
In regards to address, as the world becomes more and more globalized, it becomes obvious that address needs to be driven off of a country selection. Once your user has selected the country, they can be presented with a list of states or provinces and enter a zip code (in the US) or a postal code for other countries. Graham Rhind has written some great books on the topic of globalization and localization.
Continuing the lines of “what’s practical” I also take into account the expected reporting maturity of those who access of the data.
Two questions come to mind:
1) Am I likely to have users querying with where clause filters on a nullable field?
2) If so, are those users likely to have a clue about NULL’s impact in tertiary logic?
The answer to the 2nd question is nearly always “no” if the system is allowing ad-hoc queries.
For this reason alone I tend to avoid NULL in fields that are candidates for where clause filtering. Proper design? Not really. More accurate reports? Usually.
Thanks a lot Daragh, Lee and Beth for adding to the discussion. Good points on data globalization (a favorite subject of mine), fitting upstream purposes of use and how 3 valued logic could be powerful but usually are painful.
Beware NULL avoidance. For example, if your business process is performing a credit scoring algorithm for, say, a credit card approval, a value or “zero” in “monthly mortgage payment” is evaluated a lot differently than “NULL” (meaning “unknown”) in the same field. I encountered this scenario about 25 years ago and, since then, I have allowed NULL value in any field that could contain it.
As well, I’m pretty safe in saying that the alternative of using NULL surrogates (“space” for character fields, “zero” or “9999[…]9999” for numeric fields, “January 1, 0000” or “December 31, 9999” for date fields, etc.) causes just as much overt SQL value-checking as allowing NULLs does.
To NULL or not to NULL — that is the question. And every business problem and data model deserves to have the question answered for every data fields modeled and implemented.
Data Rocks!!
Thanks for adding Jeff. Seems like the question about using Null can’t be answered with a simple yes or no 🙂