Intensional and Extensional Quality
There is a lot of talk about the quality of data, but I often have the impression that it is done almost as if to exorcise it, recognizing its importance without delving into its complexities, its real meaning and the implications that this has on the quality of what is generated from the data, according to that obvious principle of amplification of the error and its harmful effects on the actions which, by virtue of the decisions taken, are then implemented.
The first aspect that I believe is worth analyzing is that quality is not an absolute value, a sort of ontological property of the data, but depends on the use you want to make of that data. We could say, with a certain freedom of interpretation, that the use determines the very nature of the data and, consequently, also their quality, making it, for example, more than adequate for one specific use and decidedly insufficient for another.
This dependence on use should not be surprising, being a manifestation of that intentionality that governs our very being in the world and the way in which we act on it, an action that, precisely, depends on our intentions, so strong as to be an element of determining the meaning of things and, therefore, also of their quality.
If we accept this sort of qualitative relativism, we should then not pursue an absolute quality of the data - which I doubt could ever exist - but rather understand what the depth of our intervention in this regard is and, above all, where this should be directed. I believe that this last point is crucial and such that the correct identification of where quality should be investigated is considered a much more important issue than its measurement itself, which leads us to consider the two aspects that contribute to characterizing a piece of data in its entirety. – intensionality and extensionality – as reference elements, on which to turn our attention and on which to take note of what can and cannot be done.
If the quality of the extensional component is widely well treated - after all it is not difficult to understand whether a measurement or a value is correct and precise - the same cannot be said of the intensional one, that is, of the fact that a datum is, for what it represents and how it does so, suitable or not for the purposes for which one wants to use it, because we must not forget that using data that does not adequately represent what one wants to study, but which has an extension of quality, is definitely worse than using data that is a perfect conceptualization, but whose extension can be improved or even approximated.
In other words, with a perhaps irreverent comparison, we could say that to insert a screw into a piece of wood, a one-euro screwdriver is better than a hundred-euro hammer, precisely because, even before thinking about the values that a piece of data takes on, we should worry about whether this is the right one to use. Reducing the quality of a piece of data to that of the values that represent its extension is not only reductive, but also dangerous, because it can make us think we are using the perfect piece of data, when in fact it is not due to a wrong choice right from the start.
If the breadth of the concept of quality is well understood, it remains to understand what can be done, how far to go and what to expect in relation to its improvement, issues on which I believe there is a substantial difference between the intensional and extensional components: while for the second, substantially objective - or rather, to quote John Searle, epistemologically objective - there are well-tested techniques, such as, for example, in the case of missing values, replacing them with others generated on the basis of the statistical properties calculated on the basis of the values known, or use synthetic data to increase the sample size or correct any imbalances, the question of the extensional component is much more complex, first of all because here the subjective dimension clearly prevails[1], being the definition of the data being vitiated by the role played by those who take care of it and, secondarily, because inevitably those who model something do so in relation to its primary use, which is what actually made the modeling necessary and, as Ludwig Wittgenstein reminds us when he tells us that "meaning is use"[2], this use is always "here and now", which does not guarantee that the same person, in a moment of temporal differences, is not led to call into question what has already been modeled simply because the use he wants to make of it has changed.
The difficulty of acting on the intensional component with respect to the extensional one therefore emerges forcefully, a difficulty aggravated by the consideration that we, as human beings, are splendid simplifying machines, which, on the one hand, has allowed us to adapt and to evolve over time with a minimum expenditure of energy, on the other hand it tends to make us use what is already available rather than analyzing it with a critical spirit to verify whether it is precisely what we need, a way of acting which greatly increases the risk of reach incorrect conclusions, not because the findings of the data chosen are not of good quality, but because data has been chosen which is only an approximate representation of what one really wants to investigate. In short, I personally have the clear impression that, in the well-known phrase "garbage in, garbage out", the reference to garbage is always directed towards the accuracy of the measurement and detection and not, on the contrary, the semantic adequacy of this that you are using.
领英推荐
But a solution, or at least an attempt towards it, must be made and therefore we must ask ourselves what can be done to tend towards a qualitative balance between the two components and, in particular for the intensional one, which cannot ignore the fact that every decision it is always a choral action, where everyone's subjective vision must harmonize with the corporate one - ideally objective - and where both have equal dignity, given that in every organization there are different objectives and responsibilities, which, despite working towards a common goal, must necessarily take into account the different points of view that characterize the individual organizational roles.
Therefore, since we cannot eliminate subjectivity, because not only is it not possible, but it would not even make sense to do so[3], we must ask ourselves what is best to do to accept such undecidability and how this nevertheless becomes part of daily action, elevating the inevitable semantic diversity in value, making it evident so that the awareness of its existence becomes a criterion of choice every time one wants to model new data, which will be possible to do, if one really feels one cannot do without it, distilling from the different interpretations that have already been made of this data, those elements from which to create its further version, because if it is true that reinventing the wheel every time does not make much sense, perhaps trying to improve it continuously does, because each wheel has its ideal terrain on which to roll.
This desire for awareness and the consequent way of acting affects various dimensions within every organization, whether public or private: the cultural one, because everyone must free themselves from that innate tendency which often leads them to start from scratch, in belief that his needs are unique and that, therefore, what already exists is not in principle suitable for him; the organizational one, because a collective way of working must be spread and supported, which respects the needs of individuals but which, at the same time, brings them back to a sort of overall conceptual harmony, which starts from the use of what is already available, but which does not precludes the possibility of defining new data where there is a proven need; the technological one, because any organizational system is doomed to failure if solutions are not made available which, by managing the underlying complexity, hide it from the users, giving them that simplicity of use that allows them to concentrate on what needs to be done, rather than spending time understanding how to do it, offering features that allow you to evaluate the extensional quality and, where possible, correct it as previously described with some examples, but above all by creating a digital place where the information heritage can be explored and understood, because it is always we must start from here, giving the possibility of making the link between the two components alive, because in the end the choice of data will always be a dance, where whoever carries it is intensionality and whoever is brought it is extensionality and if this does not happens in harmony with grace, the stumble will always be around the corner.
#dataquality #semantics #dataintegration #datamarketplace #datacatalog
[1] The ontological objectivity defined by John Searle does not help here, because we are not talking about the concept itself, but rather about its representation and it is precisely this aspect that introduces an inevitable subjectivity. For example, according to Searle, mountains are ontologically objective, in the sense that they exist independently of a perceiving subject, but their representation within a data model or ontology will be the work of a specific subject or group of subjects and, inevitably, it will tend to capture some characteristics to the detriment of others.
[2] “Philosophische Untersuchungen” – Ludwig Wittgenstein – 1953. The full form is “The meaning of a word is its use in language”.
[3] One might think of adopting an autocratic organizational model, where a group of elected officials have the power to impose what they believe to be the meaning of each piece of data, but I doubt that such a choice leads anywhere other than self-annihilation.