About NULL
https://news.harvard.edu/wp-content/uploads/2022/05/EHT_PR_Main_Image_Wide_2500-1600x900.jpg

About NULL

1. Introduction

Today (17-may-2022) we had a board meeting of DAMA Netherlands. One of my colleagues (Mark van der Veen) proposed a useful way of positioning where we want to make a contribution, using three systems (Note: I'm using the term 'system' very loosely here):

  1. The practical domain where we solve real-world problems in organizations
  2. The scientific domain where we develop theories and contribute to a sound body of knowledge (ideally through a peer-reviewed publications). Theories explain/predict phenomena in the real world, may be invalidated through counter-evidence etc. Here, we mean both fundamental research and applied research
  3. The heuristics domain where we develop rules of thumb, frameworks, and practical tools that are good enough to solve challenges in the real world.

Each of these levels has value / is equally important and ideally there is a balance between them. That is, it would be useful if our heuristics are at least inspired by scientific results. It would be equally useful if people using these heuristics would understand the (scientific) theories in order to evaluate when it is ok to use these heuristics, and when this would entail serious risk.

Also today, a colleague came with an interesting challenge, which is loosely defined as:

I am working at a large governmental organization where we are overhauling the data/systems landscape. The challenge is to come up with a way to ensure that data is stored in / flows between systems in an effective and efficient manner. To this end we want to use a canonical data model in 3NF. We keep having discussions about the use of NULL-values for this canonical model. What is the recommendation here?

When putting these two together during the long drive home after our board meeting, I made the following mental observations:

  1. The relational model is one of the pertinent theories here, and definitely comes from the scientific domain. Key contributors are scholars such as E. F. Codd, C. J. Date, F. Pascal and many others.
  2. The choice to go for a canonical model is one that is inspired by heuristics. Using a canonical model is a method that has proven its value in many situations and is likely to be useful here as well. Along the same lines, the DAMA DMBOK fits in the heuristics space.
  3. The problem that my colleague articulated is a real one with significant impact, and it pays to think about it carefully.

The heuristics fall short to give a proper answer, so we need to go back to the theory. What can we say about the use of NULL-values?

2. What the theory says

It seems safe to say that E. F. Codd is the father of the relational model. I went back to his landmark publication of 1970 [1]. In section 1.3 Codd writes:

The term relation is used here in its accepted mathematical sense. Given set S1, S2, ..., Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n tuples each of which has the first element from S1, the second element from S2, and so on.

This first publication does not yet explicitly mention NULL values (but does already introduce normal forms). It appears (the careful wording is deliberate) that the postion of Codd in this paper is that tuples in a relation represent true propositions about the underlying domain, which is in line with the information principle (loosely: all true statements about a domain are expressed as tuples in relations) and closed world assumption (loosely: all statements in a database are true, and everything that is not in the database is therefore not true).

In a later paper of 1979 [2], Codd changed his position with regard to NULL values. He proposed an extension of the relational model where a "null value" could denote either (1) a value that is currently unknown, or (2) a property inapplicable. It became immediately clear that this meant a departure from the information principle/ closed world assumption since we are now able to include statements of which the truth is presently unknown. To clarify, consider the following expressions:

  1. The PERSON with First name "Bas" and Last name "Van Gils" was born on Birthdate "6-december 1976"
  2. The PERSON with First name "Bas" and Last name "Van Gils" lives in the city NULL

Expression 1 is a true proposition (at least, according to my passport it is true). The second expression is not a proposition: it is neither true or false. In mathematics, it would be called a predicate since there is still an unbound variable. What Codd proposed was that it was "ok" to include these as well in a database.

His long term collaborator, C. J. Date, did not agree. In several of his publications he argues against the use of NULL values. Here, I'll include a quote of [3] to clarify his position:

The fact is, I reject the concept of "nulls" entirely; that is, it is my very strong opinion that nulls have no place in the relational model. (Codd thought otherwise, obviously, but I have strong reasons for taking the position I do.)?

A bit further he explains his position:

The fundamental point I want to make is that certain boolean expressions—and therefore certain queries—produce results that are correct according to three-valued logic but not correct in the real world.??

I should briefly explain the notion of a three-valued logic (3VL). Previously I already mentioned that, under the closed world assumption, propositions either either true or false. That is, there are only two truth-values: true, and false. When we allow NULL-values (and thus: predicates) in our database, we also allow for a third truth value namely unknown. According to C. J. Date, in a 3VL situations we can make inferences that are logically correct, but incorrect in the real world which is certainly an undesirable situation. As an illustration of this point, consider the following:

No alt text provided for this image

This is a relatively simple example using the PostgreSQL database. There is one table (relation) with three attributes. Two of these attributes make up the primary key. For the third, NULL "values" are allowed (NULL isn't a value, it marks the absence of a value). The predicate of this table (relation) is: the PERSON with a FIRSTNAME and a LASTNAME is born in BIRTHCITY.

You would expect that the second query would list everyone who is not born in Tilburg. Instead, it lists only those persons of whom we know that they are not in Tilburg. The subtle difference is that the row (tuple) with the NULL could go either way: it could be that this person was born in Tilburg, but then again, s/he could be born elsewhere. A better way to formulate it is to tell the dbms exactly what to do: look at all the persons that we know, and subtract those whom are known to be born in Tilburg. That would give is the correct answer to our puzzle. This is exactly what the third query does. A colleague from Tilburg University (nickname: Fruit) informed me recently that postgreSQL also supports a "IS DISTINCT FROM" function also gives the correct results (this is what the fourth query does). So certainly, there are ways to deal with NULL for this specific situation, but I'm willing to bet a good bottle of scotch that most practitioners are barely aware of the challenges, let alone the available solutions/work-arounds.

The third scholar I consulted is F. Pascal. Several years ago, I obtained a copy of his "database debunkings" [4]. In article nr 26 he has something to say about NULL-values that goes even further than C. J. Date wrote:

NULL is a SQL-specific implementation of a missing value mark. Codd did not endorse SQL as a true relational data language, let alone NULL, which should not be confused with his solution to missing data consisting of two missing value markers: (1) A-mark for 'applicable but unknown', and (2) I-mark for 'inapplicable'. Unfortunately they were a mistake. Any and all such markers are not values - NULL value is a contradiction in terms -and cannot be treated as such, including by the relational algebra. They violate Codd's own core Information Principle (IP) which mandates that all information in a relational database be represented explicitly and in exactly one way: as values of relation attributes defined on domains.?

After this, he explains that implementing Codd's ideas, in fact, required a 4VL (four-valued logic) and goes on to explain (in my words) that this leads to a) a very confusing situation which is b) hard to implement in theory, and c) is most definitely poorly / incorrectly implemented in the current SQL databases.

3. Applying the theory

Studying the relevant theory, it quickly becomes apparent that it might be possible to specify data models that allow for NULL-values, but it is a Very Bad Idea. More to the point:

  • It appears that a "heuristic" has emerged that "null values are a good way to deal with optional attributes and values that are currently unknown".
  • Studying the theory we can conclude that this is a bad heuristic.
  • A better heuristic would be to avoid null-values, unless you (a) have a very very very good reason for including them, and (b) know exactly what you are doing and why.

My hypothesis is that many data modellers allow optional attributes/ NULL "values" because it appears a quick fix for a practical problem. The teacher and scientist in me wants to say: being lazy is not a good reason for sloppy and poor design, so don't do it.

4. Recommendation to my colleague

This analysis leads to the following recommendation to my colleague.

First, note the purpose of a canonical model: to understand meaning and structure of data across systems. This will be a key design artefact for building and integrating systems. Given the application domain, making (big) mistakes in the data could have serious impact on citizens, so we should be risk averse. Recommendation: do not, under any circumstance, allow sloppy design for this artefact. Instead, use a proper 3NF/BCNF data model (if you'd like to know more about this, please get in touch).

Second, note that the canonical model will be used for implementation of new systems and data flows. Here, too, it pays to be careful and risk averse. If recent history has taught us anything, it is that sloppy design will eventually have Bad Consequence, so be very very careful in what you do.

It could, of course, be the case that performance issues or other considerations lead to denormalization (and thus: null-values). I've heard dba's claim that this is "necessary for performance reasons" often. I am not aware of any scientifically valid experiments in this are but I strongly suspect that this happens a lot less often that is presently claimed. If any such research does exist, then I stand corrected (and would love to see these papers).

What did I learn by writing this today? Two things: (1) understanding theory may be hard, but (2) there is nothing more practical than a good theory.


P.S. Comments and suggestions are more than welcome

Bibliography

[1] E. F. Codd (1970). A relational model of data for large shared data banks. In: communications of the ACM, 13(6).

[2] E. F. Codd (1979). Extending the relational model to capture more meaning. In: ACM transactions on database systems, 4(4).

[3] C. J. Date (2005). Database in depth, relational theory for practitioners. O'Reilly.

[4] F. Pascal (2016). The dbdebunk guide to misconceptions about data fundamentals. Databse debunkings, San Francisco.


No alt text provided for this image
Manfred J.

Professor an der Universit?t Sk?vde, Schweden

2 年

Hi Bas! Thanks for reviving the discussion on NULL values. Pragmatically, NULL can mean different things: the value is not yet known (but will eventually be filled), the value is not applicable for this record, the value was supposed to be known but was not entered. I think Yannis Vassiliou researched the subject earlier. From a logic point of view, a NULL value creates problems when for example retrieving all employees not living in a certain city. In my ConceptBase/Telos/RDF mind, the problem of NULL values does not exist since all relations are binary in ConceptBase/Telos. So, if the city of an employee is NULL in a relatrional database, we would simply not have a fact employee-city(bas,_) in the binary relation employee-city. Closed-world assumption applies here as well. Description logics (DL, a logic of unary and binary predicates) uses the open world assumption. That is, if something is not in the database, then one cannot just assume it is untrue. For example, in DL one could express that each employee has a city in which s/he lives. Then, even if the city is unknown, 'bas' would be returned as an answer to a query to list all employees that live in a city.

回复
Ronald Kunenborg

Enterprise Data Architect, data expert, coach and data modeller

2 年

I recommend that people read Fabian Pascal on this topic. Very clear, very practical in his solutions. One place where you can see his solution at work is in the Anacredit regulation where DNB has published the canonical 3NF model as well as the technical file layout for the reported data. This model contains very few optional fields, almost all of them have been modelled out.

Bas van der Raadt, PhD

Enterprise Ontologist | Founder hapsah.org

2 年

The NULL problem seems to be a very specific problem to relational databases. Graph databases have at least a way of making schema distinctions per entity instance, which a relational database doesn't. That allows for making a distinction between attribute doesn't apply versus value of applying attribute is unknown. One of the reasons I started a personal project of mine is to experiment how to cope in a complete different way wiith these kinds of data problems. I found a way to prevent the NULL problem altogether. You can make schema distinctions between different versions of the same entity type in one schema, so you never have to use the NULL value for that. And you can indicate if a value for an attribute is optional or not within the set of instances of the same type and version. You can have a try yourselves at: www.hapsah.org.

Bas, dank je wel voor deze notitie. Ik ken het probleem van de NULL (in mijn tijd: NILL) waarde al heel lang. De oorzaak ervan is dat databases (ook de relationele) los staan van de werkelijkheid. Ze komen voort uit en zijn behept met nadelen van de opslagtechnologie (lees: disks). Ik citeer je conclusie: What did I learn by writing this today? Two things: (1) understanding theory may be hard, but (2) there is nothing more practical than a good theory. De verwijzing naar Kurt Lewin is me uit het hart gegrepen. Ik voeg er een citaat van Sumantra Ghoshal aan toe: nothing is as dangerous as a bad theory. De theorie?n van Codd en Date zijn goed, maar ze hebben hun beperkingen. Een ervan is dat ze niet zijn bedoeld om de toestand van een wereld af te beelden. Daarvoor moet je terecht bij Wittgenstein of bij de FI-theorie (die je kent). Een feit is het geval of niet, en, zo ja, is bekend of niet. Als het feit van jouw geboorteplaats niet bekend is, is het niet bekend, punt uit. Dan kan er ook geen representatie ervan in een database zitten. Ik zie maar één manier om relationele databases hier goed mee om te laten gaan. Dat is het (uitsluitende) gebruik van binaire relaties. Dat heeft consequenties voor de performance, so what?

Martijn Evers

Enterprise Knowledge Facilitator/Philosopher, Full Scale (Holistic) Knowledge and data Architect

2 年

Impressed, not many know all the theoretical bits around NULL. Alas, normalization in relation to missing values is poorly understood. You failed to mention 6NF, which is far superior to 3NF/BCNF. If you don't use 6NF, you should use variations of 5NF that honor value existence (Nobody knows that one, btw). In practice, we still need to cater understanding the reason behind information that is not supplied. NULLs in databases are only there to fix database normalization issues because databases have the serious drawback that their storage paradigms are directly or indirectly tied to their data schemas. My biggest concerns however are using NULLS in (technical) communication, since implementation issues can be easily mitigated, but communication/meaning is far harder. Regardless if you use NULLS or other markers, Domain value exception handling (i.e. catering for values you don't have) is the real PITA.

要查看或添加评论,请登录

Bas van Gils的更多文章

  • 2024 reflection

    2024 reflection

    It has become a tradition to write an end of year reflection and post it here, on LinkedIN. Typically I do this by…

    4 条评论
  • The new generation, democracy, and data

    The new generation, democracy, and data

    Once or twice per year, I take the time to sit back and reflect, sharing my thoughts on what happened and what I expect…

    5 条评论
  • Review: data meesterschap

    Review: data meesterschap

    Recent zag ik op LinkedIN de aankondiging van het boek "data meesterschap" met als ondertitel "praktijkgerichte…

    1 条评论
  • Meyer's Management Models

    Meyer's Management Models

    There is a proverb that says "May you live in interesting times." Recent months have certainly been that: interesting.

  • Thoughts on data and models

    Thoughts on data and models

    In a recent conversation with prof. Steven De Haes , we discussed the necessity of a central theme in my research.

    18 条评论
  • Reflection 2023

    Reflection 2023

    The year 2023 has been “interesting” to say the least: several highs and lows. Usually I write my reflection a little…

    4 条评论
  • Mid-year reflection 2023

    Mid-year reflection 2023

    It has become a habit to write a reflection twice a year, and to post it here on LinkedIN. I find it useful to look…

    7 条评论
  • AI & running a company

    AI & running a company

    Not so long ago, I stumbled across the LinkedIN posts of Jo?o Ferr?o dos Santos about starting/running a company with…

  • 2022 reflection

    2022 reflection

    December is upon is. It is becoming a tradition for me to write a reflection about the year.

  • In memoriam: Kiske van Gils

    In memoriam: Kiske van Gils

    Introduction In January of 2022, my father passed away. His formal name was Cornelis, Adrianus, Johannes, Anthony…

    66 条评论

社区洞察

其他会员也浏览了