Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.
The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.
The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.
In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:
- Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
- Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.
As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.
Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.
Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:
- Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
- Identity Resolution is about the same but – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
- Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.
Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.
Excellent post Henrik,
Here is a SQL template for the Homo_Sapiens table that will guarantee data quality:
CREATE TABLE Homo_Sapiens
PERSON_ID INTEGER NOT NULL,
PERSON_NAME VARCHAR(100) NOT NULL,
PRIMARY KEY (PERSON_ID)
CREATE TRIGGER Perfect_Data_Quality
AFTER INSERT ON Homo_Sapiens
IF 1 <= (SELECT COUNT(*) FROM Homo_Sapiens)
Thanks Jim, I see you do speak other languages than plain English – SQL is another one.
I have worked on a solution to bypass your data quality constraint and found a way. I will create 6.8 billion tables called Homo_Sapiens_0000000001 and so on, and then later find a way to consolidate the tables.
REVOKE CREATE TABLE FROM Henrik;
Yes, SQL does count as another language I speak – perhaps one of the many universal languages of Information Technology?
DBAs have always been especially cautious of what database privileges I was granted – and for good reason.
🙂 Nice job here, but Henrik, the consolidation will be very simple, the tables are all empty because of Jim’s trigger (1<=count).
Thanks Tibor, my SQL and database skills are rusty. Will it count it self before a COMMIT?
If so, if Jim hadn’t played the REVOKE card I could have added the person name to the table name.
On the serious side I have actually seen such playing Tom and Jerry (cartoon hungry cat and a smart mouse) between application builders and application users.
Some data quality flaws originates from too simple constraints.
How many ERP systems have an additional entry in the vendor table for each employee so that additional compensation, or expenses can be paid? In this case, a table which should only have companies in it (the companies we buy things from), has people in it too. Even if we add an extra attribute for the employee number to enable the lookup, it’s messy. But its messy because the original system did not have the functionality to pay employees in the ways we needed.
I wonder if in 50 years, the company resulting from the merger of Oracle and SAP (SAPACLE?) will have an ERP that has “the real world complete model” that solves all these issues by modelling everything at an incredibly fine grain?
Until then I suppose we have to be pragmatic, and do the best we can within our constraints. 😉
To be honest I dont exactly know whether the counting can be seen by the trigger before commit. Maybe somebody has the time to test it on different DB-solutions 🙂
And remaining on the serious side, I would see with pleasure a new blog entry about cartoon figures, e.g. Magilla Gorilla.
Tibor, I love to be serious if it is OK to include cartoon characters.
James, thanks for commenting. I know SAP already has an overwhelming number of tables, so the SAPACLE vision could look like a nightmare.
But sometimes I have the thought that it is a bit strange how fine grained we model very specific business challenges opposite to how clumsy we model the most common business entities being party master data.
I recently saw a bespoke system within a small organisation used by their sales consultants. The primary key for the ‘Consultants’ table was based upon consultant name, rather than an ID field. There was also a web-based data entry form that was used to log an activity into an ‘Activity’ table, with a free entry text field for “Sales Consultant”.
Resulting in the following entries into the activities table:
Date | Consultant | Activity
01/01/10 | Phillip Wright | Telephone Call Made
02/01/10 | Phil Wright | E-Mail confirmation sent
02/01/10 | Philip Wright | Meeting booked
You can see the issues of linking ‘Activity’ back to the ‘Consultants’ table..
Excellent One Henrick!!!
As you depict designing tables, table structure and composition of the unique identifier must be done from a wider perspective of business rather than taking into account the present scenario alone.
Your post remembers me of a scenario that I encountered during my assignment for one of the world’s leading financial institution (don’t wanna name them though 🙂 ) back here at India. The bank had a table Customer_CDelinquent that held the records of delinquent customers with respect to credit cards with the unique identifier structured as Branch_Code+Card_Type+Customer id. When the bank introduced the ‘loans’ vertical business (a year later) it had to create another table Customer_LDelinquent with the unique identifier structured as Branch_Code+loan_Type+Customer id. Finally there were customers whose records fell into both categories (card + loan) so in order to link these records (of the same delinquent customer) they had to create a master table Customer_Delinqent and using a unique identifier as Branch_Code+Customer id which was the master(parent) data linked to loan/card info(child) data. But hold on is the problem solved no!!! the tables still reflect only the delinquency tracking part of the business, finally after several rounds of iteration a CUSTOMER_MASTER was created that now forms an important input to all the systems of the bank
Moral: IT systems must be designed forecasting the change that could occur in the business environment and as ur article points out table naming conventions and unique key structuring must be done in a holistic manner rather than current business climate!!!
Henrik, very good posting here.
Jim Harris, I believe your trigger would raise a mutating table error:
Mutation on a Homo_Sapiens table, doesn’t get any better then this one!
Thanks Phil and Satesh for providing these real life examples spanning from small to large organisations.
Rich, thanks, mutating tables – an alternative to intelligent design?
Interesting read. I would suggest, please have a look at party models like Universal Data Model (http://www.univdata.com/).
Ramon, thanks for sharing.
I am also very fond of generic data models.
Recently I have worked a lot with data management within public transportation. Being a person who use a private car all to much I was happy to find the answer for a lot of data modeling challenges in such a generic model (www.transmodel.org) for that trade
I also remember examining a generic data model for party master data some while ago (can’t find the link now). But I remember it did have a PERSON entity, which is probably a better name than Homo_Sapiens.
I am such a geek…
I’m laughing at Harris’ SQL statements as my own internal processor tries to debug the code.
SELECT NAME, *ID
GROUP BY MAX(GEEK)
ORDER BY NAME
I always suspected you guys were techies at heart!
Great post Henrik –
The cartoon is super, and reminds me of Darwin’s theory of evolution – which is what I have personally observed in every enterprise I have encountered.
For some reason (probably human nature, and the desire to be unique) – “standards” are thrown by the wayside, to be replaced by mutations and evolutions of both software and data.
On the other hand… If standards, and “the right thing” was done from day one – there would be no need for us…
Thanks Ken. I guess you can say that Data Quality people could be like dentists. They preach the right way but $hope$ that way aren’t followed 🙂
Phil, I have raised my geek score with a new post “Select Company_ID From External_Source where possible”.
Great article. We have something similar implemented in my current shop. Our data architect used data concepts as opposed to traditional modelling. So all individuals, organizations etc, belong to a a PARTY table which is then subtyped into PARTY_ORG and then PARTY_IND and so on. Very flexible and extendible but a little complex.
Thanks Muddasir. You’r right. Sometimes you just can’t keep it simple when you make a picture of a complex real world.