Thoughts on data and models
In a recent conversation with prof.
Steven De Haes
, we discussed the necessity of a central theme in my research. We concluded that the main themes (plural) of my work are broad. Loosely, you can say that the main theme is digital transformation
I’ve been thinking about the notion of focus for quite some time. Surely, the notion of the digital divide (TDD) fits within the overall theme of digital transformation. Even more, if we want to say something specific about TDD, then we must build up a good understanding of what this is. I consider that a modeling challenge. When we want to say something about the size and impact of TDD, then we need to collect data. All in all, it seems that particularly data (and data management
Perhaps some focus might be a good idea at some point. For now, researching these concepts is both interesting and fun. Keeping in mind the three domains that were suggested by prof. Jan Verelst (and at the same time were paraphrased by Mark van der Veen MMIT-Trilingual ):
- We are dealing with real world organizations with real world problems. Data ad models are not just some academic issue.?
- There are several scientific theories
about data/models. These are partly philosophical in nature, and partly based on a study of real world organizations. Sadly, they are insufficient to explain reality and help organizations solve their challenges. - Then there is also the heuristics domain. Many professionals use “good practices†which may or may not be grounded in scientific theory. There’s nothing wrong with using heuristics, but it would be nice if professionals are at least aware of the underlying scientific theories when applicable. That would–hopefully–ensure that we’ll use the right heuristics at the right time in the right way.?
In the near future, I will be working with prof. Henderik Alex Proper on a series of lecture notes (potentially: books) on models and modeling particularly in the digital transformation space. I suspect that the topic of data will also come up. As a preparation for that, I will use this note to capture my thinking on a number of topics. I will try to refrain from citing a bunch of papers and keep the tone of voice “lightâ€. The purpose is mainly summarize my thinking and paint a general picture. A formal exploration will surely follow at some point!
Reality
The first thing I want to do is take a stance on reality. In my view, reality exists independent of how we perceive it. Yet, the way we make sense of the world may differ quite a bit. For example, two people may observe some “thing†and call it a rock. One of the two observers may know it to be a diorite whereas the other doesn’t. So far so good. This is a rather simple example that involves physical objects in the real world that we can–quite literally–bump into. The same goes, more or less, with non-physical things. Our two observers may witness two people making some kind of deal. A may recognize it as a complex refinance deal with all sorts of legal checks and balances, whereas B may recognize it simply as “a dealâ€. And as a last example, it may also be the case that there is a situation where A recognize some phenomena to be a complex case of equipment whereas B doesn’t get any further than calling the phenomenon a “ding-a-ma-goo†(i.e.: no clue what it is). Regardless of how we experience it, I strongly believe that reality is “out there†for us to experience.?
Observing and experiencing reality
If reality is “out there†for us to observe and experience, then it makes sense to state that we as humans are part of that reality. A lot has been written about what it means to observe/experience reality. Some discussions focus on the physical aspects: our eyes are able to pick up visual signals of certain frequencies (but not others), whereas our skin is able to detect certain degrees of touch, etc. Others pertain to what an experience “does†for us. For example, running gives a runners high, listening to Vespers by Rachmaninov is in and of itself almost a religious experience, there is a feeling of loss when someone passes away.?
For present purposes, we do not have to go that far down the rabbit hole. It is sufficient to state that observing/experiencing reality does something to the brain, so this is where “the magic happensâ€: this is where we make sense of the world.?
Ontology
The notion of ontology is difficult to explain. I’ve had several discussions with Prof. Giancarlo Guizzardi and Prof. Henderik Alex Proper about ontological models which is even more difficult to express. I do like the second definition that the Merriam Webster dictionary gives:
A particular theory about the nature of being or the kinds of things that have existence.
In my words, (an) ontology is a set of “glasses†that we can put on, that helps us to see reality from a specific perspective. Allow me to elaborate. Suppose, that we see a situation in the real world where some person buys a fountain pen. If we look at this from a financial perspective (that is: adopt a financial ontology for interpreting what is going on), then we will see such things as the reduction in physical assets that the company has but at the same time an increase in financial assets, as well as an obligation to pay taxes at some point. If we adopt an ontology around writing instruments, then we’d see that a fountain pen is a specific kind of writing instrument which may have a specific type of ink filling system, purpose (better for writing than drawing), etc.?
Knowledge (and thus also the ontologies that we are comfortable with) grows by learning. As an example, taking a class in finance will strengthen the financial ontology (i.e., the kinds of things that belong to the finance territory).?
Last but not least, I also think that our minds are more than capable of making sense of some observation/experience using different ontologies. In our day to day lives we do this all the time. When buying a new fountain pen, I will weigh the advantages and characteristics of the new pen (writing instruments ontology) and consider whether it is worth the investment (financial ontology).?
Knowledge and information
The terms knowledge and information often are combined with data and wisdom after Ackoff’s famous statements about what became to be known as the DIKW-pyramid: data - information - knowledge - wisdom. I suspect that his statements are often misinterpreted. Perhaps my interpretation is off, too (meaning: not as he intended it). For what it is worth, this section presents my take.
The “stuff†that we have in our head is knowledge. I believe this to be a finite set of knowledge particles? (sometimes referred to as infons in literature) that are interconnected to form a web. Every observation/experience adds to this web. When we age and start to become forgetful, elements or connections among elements in the web start to fade and go away. Some parts of the web are easily accessible. Others not so much (think of the “Hmmm, let me think about that!â€-moments you have experienced).?
This allows me to also define the term information. In a way, we can see an observation/experience of reality as a state change. We have a certain amount of knowledge before the observation/experience, and a certain amount after the observation/experience. To me, the information (associated with an observation/experience) is the difference between our knowledge before and after that observation/experience.?
Note that we can also think about what we already know - so perhaps observation/experience is not the best set of verbs. Thinking about what we already know could lead to new information too.?
Note also that we can observe/experience something more than once. There are certain books (i.e. Data and Reality by W. Kent) that I’ve read more than once. Each time you get something new from a re-read. This implies that there is no such thing as “the†information that you can get from an observation/experience.
This makes knowledge and information mental constructs. Asking for information about a certain topic is, following this way of thinking, a little off. It would be slightly more correct to ask for knowledge about a certain topic. Since that figure of speech is so prevalent, we will simply have to accept it.?
Data
What we have discussed so far is mostly a mental process: we observe/experience new things and that has some effect on the knowledge that we have in our brain. We can even pinpoint the elusive term ‘information’. This is useful for our own purposes, but typically there is also a need to interact with others, for example because we write a paper together, work together, or wish to finish some project/adventure together. Since direct brain-to-brain connections are, to the best of my knowledge, not yet feasible, we need another mechanism to convey what is in our mind.?
Using the terminology from the field of semiotics, we have to produce a sign that captures our understanding of a referent (domain, the thing we are observing/experiencing) such that that sign can stand for that referent. These signs can be anything: the words we speak when interacting with a colleague, the picture we draw to explain some concept, the formal proposition in predicate logic, or even a tuple from the relational model that expresses such a proposition. As an example of the latter, the tuple
TUPLE { FirstName Name:Bas, LastName Name: Van Gils,?BirthDate Date:06-dec-1976 }
represents the proposition:
the Person with first name ‘Bas’ and?last name ‘Van Gils’ has birthdate? ‘06-dec-1976’
Note that in order to express ourselves, we need to use some language - ideally one that the person we intend to interact with can also understand. The proposition in natural language (or even: restricted natural language) can probably be understood by most English speakers, whereas the tuple-representation is in a language that is likely best understood by people with a mathematics or computer science background. This implies that choosing the language or representation is important.?
In my view, data are signs. In my book data in context, I define data as “the representation of our understanding of a domain such that it can stand for some domainâ€. This heavily relies on semiotics and also allows me to conclude that data has structure and meaning. In my book, I elaborate on the notion of context (i.e. a representation can stand for what happens in reality in some context, but perhaps not in another). This is beyond the scope of the present discussion.?
The diagram below summarizes the discussion so far:
领英推è
There is a little bit more to say about the languages that are used for representation. I already alluded to the fact that different languages may be used and that not everyone understands every language. I want to make it explicit that some “languages†(quotes deliberate) can be quite formal: boxes and arrows, a scribble on a piece of paper, a low grunt to signal disagreement: these are quite capable of representing our understanding of a domain such that they can stand for that domain.?
Observations on how we talk about data/information
A common figure of speech is: Please give me some information about X. If you follow the line of reasoning in this paper, then this question should be interpreted as: Please give me the representation of your knowledge about X such that, when I see this representation, it will give me information about X. That would be an awkward question indeed so perhaps the more colloquial and more common form makes more sense in practice.?
We tend to distinguish between structured data and unstructured data. The former usually refers to rows in tables in databases on some platform. The latter tends to refer to e-mails, chats, documents, images etc. Two observations about this: 1) if we follow the line of reasoning in this document–where data represents our understanding of a domain such that it can stand for that domain–then data must have structure and meaning. How else can we conclude that indeed some sign can stand for that domain? 2) modern legislation (such as the archival law [archiefwet in Dutch]) speaks of documents regardless of their shape or form. A row in table in a database can also be seen as a document. These two observations suggest, to me at least, that all data must have structure and meaning and that therefore the term unstructured data is a terrible misnomer. I propose we use “differently structured data†until we have a better name.?
The terms mis-information and dis-information are used frequently (hyphen added for clarity). The terms are difficult to define. Scholars like Floridi have made a brave attempt. In my view, mis-information is the term that we use for data that is not correct (i.e. it misrepresents what is going on in some domain) such that observing this data will lead to information that incorrectly represents something about the world. Dis-information is similar, but here the data is purposely constructed in such a way that observing it will lead to incorrect understanding of the domain. This is definitely something I will have to study more at a later point in time.?
Data and models
I am very much impressed by, and inclined to build on, the work by Prof. Guizzardi and Prof. Proper about domain conceptualization. They speak of a domain model (i.e. a model of a domain, where the term domain should be interpreted in a similar fashion as I have done in this paper so far). Their notion of a domain model is also built on the foundation of the semiotic triangle:?
A social artifact that is acknowledged by an observer to represent an abstraction of some domain for a particular purpose.
Note the following:?
- The phrase “social artifact†alludes to the fact that it is some sort of representation.?
- The artifact is created with a purpose. For example, we can create a model with the purpose of understanding some domain (a good name would be: a conceptual domain model) or we could create a model with the purpose of designing something we want to eventually construct/realize (a good name would be: a blueprint). I am assuming that the purpose also determines the ontology with which we look at the domain.?
- Artifact is a very generic term. A boxes and arrows diagram is surely an artifact, but so is a photograph, a miniature building and perhaps even the score of your favorite piece of music. I am assuming that the types of models we are considering are put in a language that is in line with the ontology selected based on the purpose of the modeling exercise.?
- The acknowledgement is the equivalent of the can stand for phrase in my definition of data.?
- Last but not least, I want to point out that Guizzardi and Proper speak of the abstraction of some domain - rather than the domain itself.?
This last point requires a bit more thought. You could argue that the process of observing/experiencing something and making sense of it (leading to information) already is a process of abstraction. Whether that is a productive line of thought? I’m not sure. Something to think about for sure.?
Since the definition of data and model are both rooted in semiotics, I have been pushing the definitions a little to see what the main differences and main similarities are between the two.?
- When creating a model, we effectively produce an artifact. It seems safe to say that the artifact is a sign that captures our understanding of a domain and can, indeed, also stand for that domain (which is very much in line with the definition of a model).?
- When creating data, we do this for a purpose. We need human assessment to confirm that, indeed, the data can stand for reality - a statement that is dangerously close to “represent an abstraction of some domainâ€.?
- In some cases (i.e. fact based modeling), the modeling process starts with collecting a representative set of examples (i.e. data!) which is the basis for abstraction, leading to models. We can ask the question whether the model is an abstraction of the data or of the domain (or both!)?
I’m inclined to conclude that data can also be a model of a domain (if stakeholders agree that the data is an abstraction of that domain) and that some models are also data. A counterexample is a lego miniature of a house which would, in my opinion, be a model but not data. The following diagram explains this further.
Exploring the notion of ‘models’
It seems that professionals with different backgrounds use the term model in many different ways and going through the motions of aligning terminology has been frustrating at times. In this section, I hope to go through some things that I would call a model, and present some thoughts. I’ll take a simple example of cats in a cat population in some area.?
For the first type of model, I’m going to assume that the purpose is to get a shared understanding of the types of things we’re talking about here. Typically this has been the realm of conceptual modelling (or: conceptual domain modeling
The ERD says something about two types of things (the cat population and cats), properties/attributes of these types of things, and a relationship between them. Simply put: a cat population (with some size, that can be computed) consists of cats with a birth date, name, and gender. Perhaps not super fancy, but it illustrates the point that we can study a domain through some ontological lens and create a diagram that captures the abstraction that we’ve made.?
I would argue that such a diagram is rather static: it captures only the types of things in the domain but not the dynamics aspects of a cat population in terms of its size. This is where system dynamics
Here we a are dealing with three variables as well as the interplay between them. From our basic ontology about animals and their reproduction, we know that cat births increase the population which, in turn, lead to more cat births. This is a reinforcing loop. Similarly, we know that cat deaths reduce the cat population which in turn also reduces the number of cat deaths. This is a balancing loop.?
Let’s assume we’ve made this model and feel that there is more to know about this domain. Rather than continue our study of that domain, we can now study the model that we’ve just created! This will, hopefully, lead to more insights on the domain that we’re interested in. We could, for example, come to the conclusion that the rate of births and deaths is of interest in this domain. Still in the realm of system dynamics, we could end up with:?
Such a stock and flow diagram gives a good overview of the variables in the domain, how they interact, and what that means for the overall cat population. I would argue that this is a pretty abstract model. Not all stakeholders would be able to interpret/understand this correctly. This is not due to the fact that the ontology / resulting information is so complex. The abstraction level and representation language might make it difficult for them to see what is going on. This may be the basis for generating a simple (and more concrete) graph:
Going back to the ERD that we started with, it is simple to see that we are still dealing with the same domain. We’re still using a similar ontology and we’re still trying to understand the domain (purpose). Yet the models that we produce are very different.?
I want to point out one last thing about these models. If we put the stock-and-flow model in a tool then we can play with it. Then the model becomes interactive, allowing us to do a simulation. At the same time, if we guestimate the values of the variables, then the model at the same time becomes a predictive model with the purpose of figuring out if we’re still happy with the number of cats in the area. The correctness (is it a correct abstraction of a domain such that it can stand for that domain) can only be assessed by comparing the predicted values with real values. Would we call the simulation-model still data??
Parting thoughts
This "short note" turns out to be a bit longer than I originally thought. As the saying goes: sorry for the long letter, I didn't have time to write a short one. Perhaps I'm further with my thinking than I anticipated.
I don't want to go into all the sources that I studied to arrive at this point. Perhaps a later point when I turn this into an article. I do want to mention a few people that have influenced my thinking. First, I recall several interesting discussions with Prof. Henderik Alex Proper and Prof. Giancarlo Guizzardi . Some of this also goes back to discussions with prof. Hans Weigand , prof. Patrick van Bommel , and prof. Stijn Hoppenbrouwers . Of course, I also want to mention the discussions with people from the DEMO-community, notably prof. Victor van Reijswoud , prof. Hans Mulder , prof. Mark Mulder , prof. Eduard BABKIN , and prof. Jan Dietz . And last but not least, I want to thank my esteemed colleague Alfred Stern for introducing me to the work of Floridi. Much appreciated!
Data Architect at Tata Steel BV, designer of data constructs.
10 个月This is an interesting one: "The “stuff†that we have in our head is knowledge. I believe this to be a finite set of knowledge particles?(sometimes referred to as infons in literature) that are interconnected to form a web. Every observation/experience adds to this web. When we age and start to become forgetful, elements or connections among elements in the web start to fade and go away. Some parts of the web are easily accessible. Others not so much (think of the “Hmmm, let me think about that!â€-moments you have experienced). " While this may be true for dementia, I believe that the 'fading' is mostly due to the overload of the brain with knowledge past and present, possibly even updated (state changed) over a lifetime with new and contradicting information. While the 'data' is still there, the number of connections and ontologies to travel and decide between (those state changes) become too much to digest at the speed required.
Ask me how to reduce bias
10 个月Hi, just initial questions while reading: have you seen Soren Brier's cybersemiotics? what is your opinion on it.. and also Beynon-Davies work on information
Nice article. The data vs model debate is also one of perspective IMO. I’ve written about this here: https://blog.selman.org/2023/05/02/custom-attributes-properties-and-concepts/
Researcher and Developer, Object Event Modeling & Simulation (sim4edu.com/reading/oems)
10 个月Normally, people discussing "data and models", or "data modeling", do not consider a problem domain as a dynamical system, but rather restrict their attention to its static conceptual structures. With your cat population example, you consider a discrete dynamical system (since births and deaths are discrete events), but choose to describe it with a continuous model (adopting System Dynamics diagrams). Notice that it would be more faithful to describe your cat population domain/system with a discrete process model, like in Discrete Event Simulation. In this approach, you would not only model object types, as you did in your ERD of a cat population, but also the two event types "cat birth" and "cat death". Then, after describing the types of objects and events of your domain in an information model, you would also make a process model (e.g., an Object Event Graph) based on the information model. The focus on these two most important categories of things, objects and events, can be motivated/justified by a foundational ontology like UFO, which includes an Ontology of Objects/Endurants (UFO-A) and an Ontology of Events/Perdurants (UFO-B). Giancarlo Guizzardi + Henderik Alex Proper
Full Professor of Computer Science working in the areas of Conceptual Modeling, Formal Ontology, Enterprise Semantics, Advanced Information Systems Engineering, and Foundations of Data Science and AI.
10 个月Interesting reading, Bas van Gils. I think we would answer some of these questions differently (e.g., I don’t think information is a mental construct) but we certainly agree very much on the importance of these questions! Let’s continue these interesting discussions and please keep asking interesting and important questions! (And thanks for citing out stuff)