The Case for RDF (Revisited)
Over the past few years, growth in the uptake of RDF has picked up steadily. In some domains, such as asset management and systems engineering, this growth is quite significant and driven by national and European standards. In this article, I revisit the core argumentation for choosing RDF as the prime vehicle to represent and exchange data and data models. Now that project proposals involving RDF become more abundant, it is useful to have this handy. To bring out the unique features of RDF more clearly, I contrast this with XML. In the final paragraph, we answer the question as to why RDF is so popular in the domain of asset management by showing how the core advantages of RDF help answering important requirements.
… then they fight us
Despite popular belief it was not Mahatma Gandhi who said that first they ignore you, then they laugh at you, then they fight you, and then you win. However this may be, as far as the uptake of RDF is concerned, we are most certainly in the then-they-fight-you stage. During the past 18 months I have personally witnessed three separate RDF implementation projects come under heavy attack. In one case, the plug was pulled on it before it could start properly. In all three cases, the resistance came from people who argued that the project goals were much better served using XML. Why use obscure and unproven technologies when we have dozens of highly experienced XML-experts in the house? RDF-tooling is so much more complex and expensive than XML-tooling. Oh, and by the way, RDF is XML!
For people who, like me, have a passion for RDF and semantic technology, this is the best news since years. When the incumbent players with their vested interests start picking fights with the underdog, large-scale change is in the air.
It also means that as RDF-enthusiasts, we better make sure we get our act together. Project plans involving RDF will more than previously be proposed and discussed. Many of us will be in a position where they have a chance to weigh in, directly or indirectly. Unfortunately, when put on the spot, it is often hard to make the case for RDF clearly and concisely. I remember several recent occasions where afterwards I had at least some regrets. If you have been working with RDF for a long period of time, your focus tends to be on the details. I used to read, think, and talk frequently and intensively about the greater picture and the potential of RDF way back when we were still mostly ignored. So, I dug up the main arguments from my memory, so that next time when I have a chance to make the case for RDF, I will be prepared.
The core argument for RDF
The core argument for RDF rests on three properties it uniquely combines — I am indebted to Ralph Hodgson of TopQuadrant for the precise wording of the following. RDF is a machine-readable language for representing knowledge (including data and data models) that supports:
- Well-partitioned inheritance with the ability to define traits (aspects) for property mixins
- Composability both at design time and run-time; this we call the “graph coalescence principle”
- Introspection, that is, the ability to query and use data models at run-time
Let us zoom in on each of these properties.
Inheritance
Class hierarchy constitutes one of the most basic and essential tools of knowledge engineering and has been used to make sense of the world around us since ancient times.
A class hierarchy is needed to be able to express generalizations properly. The notion of class implies that there are instances of these classes. Inheriting properties by being an instance of a class and by being a subclass of a parent class are two different things. There are also different modes of being a subclass. Take a person who is also a customer of some web shop. Being a customer does not involve identity, being a person does. You can stop being a customer and still be the same person, but you cannot turn into an orange or otherwise cease to exist as a person and still be the same customer. The relevant notion of mixin is defined in the literature on formal ontology, see Guizzardi (2005) — an example called “Spatial thing” follows later on. (Be careful not to confuse the ontological notions of class, instance, and mixin with what these terms mean in the context of object-oriented programming.) RDF and RDFS support all these notions by design, which, incidentally, constitute a significant part of its semantics. XML has no semantics and no notion of class.
Composability
The next feature is composability. RDF represents knowledge as a graph, that is, as a set consisting of pairs of nodes connected by edges. From this elementary mathematical property, it follows that any two graphs can be combined to form a larger graph without changing anything in them. In contrast, you cannot combine any two XML documents or two arbitrary database tables to form a new document or table without doing work and changing things. The graph coalescence principle is important to support modularity of data graphs and model graphs, and to support easy ways of combining data from different sources.
Introspection
The third and last feature to discuss is introspection. In the classical approach to managing data using computers, the data model in terms of which the data are expressed is implicit, at least from the viewpoint of the computer. It is hard-coded, in part in the database structure or the XML-document, in part in application logic. In RDF this data model is expressed in the same language as the data itself. Hence, a computer can use the model to do useful things with the data. This includes generating data display forms and data entry forms, executing validation rules and other rules, and making use of more advanced forms of knowledge representation.
As an example of this latter capability, consider that the function of a pump is to displace fluids or gas by creating pressure differentials. In a data set that describes assets including pumps, you would not want to explicitly specify this for every pump separately. Rather, you define this in the data model. Next, if you need to select all assets with a particular function in a certain area, you need to query the data graph and the model graph at the same time. Of course, people have found ways to work with that kind of information in XML and relational databases too. But they must invent methods that are specific to each case. In RDF, we use standard knowledge engineering vocabulary to achieve this. A previous article of mine discusses the pump example — taken from a real-world case — in more detail.
How this creates value: an example
Let me now illustrate how these three features work together to create value using a simple example. Consider the following screen image of TopBraid EDG, the generic RDF-platform created by TopQuadrant. It displays a dataset with geographical data that lives on my local computer. Some of the data, however, are not in this local dataset but are, instead, retrieved from another dataset, one that lives on the web: wikidata.org. To get the population size of Cologne displayed a simple mapping rule is defined in the data model that declares how this property value can be retrieved from wikidata.org using the source’s data structure. I also added a rule that calculates the size of the circle on the map based on the size of the population. You can download the data graph and the model graph (along with other examples) from the TopQuadrant website.
This example illustrates each of the three properties simultaneously. Introspection is at play in several ways: the form lay-out displaying the data is generated on the fly based on information in the data model, and some of the data displayed are obtained from elsewhere or computed in real-time following rules defined in it. The data model is not a passive UML-diagram sitting in a PDF-document. Rather, it is a working piece of declarative code that does useful things at run-time.
Composability plays an equally important part. The local graph containing my geography data is combined with the graph describing the data model. This latter graph combines with SKOS, a generic data model graph published by W3C, on which it is based. In addition, the local data graph combines with the wikidata data graph, which itself coalesces with its data model. I could go on describing graphs that play a role here. Why is not everybody just blown out of their minds by the amazing power that the principle of graph coalescence offers?
Finally, the example would be unthinkable without the different modes of inheritance that are at play. The resource displayed on the form is Cologne. This resource is of a type, specifically, the type “City”. This statement determines which properties Cologne can, must and cannot have. Some properties, like ball size and colour (for displaying on the map) and the optional property “is capital of” are specifically defined for cities. Other properties are inherited. The property “preferred label” (including the rules governing its use) is inherited from the class of concepts, of which City is (ultimately) a subclass. The properties longitude and latitude are also inherited by City, but from a class called Spatial thing, a mixin.
Before ending this paragraph, let us turn to the critique from the sceptics quoted earlier. I do plead guilty to the charge that the tool I use here is, indeed, complex, and expensive. It is also very useful for processing and managing data, much more so than simple and cheap XML-tools. By the way, if you do not want to spend money on tools you can use the free opensource RDF-platform VocBench. With a little effort, the same example works on that platform also, as it does on any platform that supports the relevant standards. In fact, now that I think of it, this is the fourth fundamental property of RDF: it is completely and rigorously standardized. Not only its syntax (as with XML), but also its semantics.
Why Object Type Libraries need RDF
An Object Type Library, or OTL for short, is a data model for asset data. It defines the language in terms of which we can express knowledge — and communicate — about these assets. The information flow supporting the management of infrastructural assets is increasingly driven by RDF-based OTLs. In the Netherlands, most OTLs are published in closed networks, but some of them are published publicly, such as Waternet OTL, CB-NL, RWS OTL, and GWSW. Today, there are about 200 asset databases available (though only to a restricted audience) that use GWSW as their data model.
A growing body of Dutch and European standards pertaining to OTL-design all but require the use of RDF. RDF-based OTLs are object of serious research. RDF is at the core of the business strategy of the Netherlands' Cadastre Land Registry and Mapping Agency (Kadaster). This development is a result of the growing need for simplicity in combining information obtained from different sources. To achieve this simplicity, asset data needs to be FAIR: findable, accessible, interoperable, and reusable. RDF is an essential tool for making that possible.
Information about infrastructural assets is typically dispersed over a vast landscape of different systems and different private and public organisations. With RDF, this data soup transforms into a coherent knowledge graph that effortlessly spans borders of any kind. Processes that depend on this information can therefore be disruptively improved. Among other things, exchanging large amounts of information about assets between organisations would become much more straightforward.
When different parties use the same OTL, information can be exchanged seamlessly. But this is difficult to achieve. The asset owner will require the use of their own OTL, which is, of course, designed to optimally support the owner’s requirements. A construction company involved in constructing or maintaining the asset must deal with many different customers, each requiring a different OTL. Thus, the ability to translate is key. A standardized semantics of inheritance helps with that, as do graph coalescence (so that it is relatively easy to express mapping rules) and introspection (so that we can query data and models simultaneously).
The most common needs that an OTL must satisfy include the following. An OTL typically must offer the capability to:
Relate the data model (OTL) to other data models
- To translate data structures, you need to map your OTL to another party’s OTL using mapping rules
- To enrich your asset data with information from different sources, you need to link your OTL to other ontologies such as BOT (building topology), OMG (geometry linking), DOT (damage linking and topology), and many, many more
Define a class hierarchy of assets to express that, for instance, a booster station is a wastewater station, which is a pump station, which is a constructed object, which, finally is a physical object. The notions of class and inheritance need to have a formally standardized meaning to avoid misinterpretation when translating data
Express constraints on these classes and articulate rules that relate them to other things, such as a rule that says that a two-pylon bridge is a bridge that consists of exactly two pylons and one bridge deck. This need arises for several reasons:
- To manage data quality, by validating existing and incoming data against the business rules
- To be able to make selections of resources from the OTL for specific purposes. A project that constructs a two-pylon bridge is not interested in tunnels or railway crossings: only in the relevant subclass of bridge plus all related classes and properties. That requires that the business rules describing asset types are explicitly modelled in the OTL. Through these rules, we can select the resources “pylon” and “bridge deck” along with “two-pylon bridge” from the OTL
- To generate forms for displaying and editing data. The OTL becomes so large that handcrafting forms is not feasible
Provide a means to define compositional structures in a unified way, such as a breakdown of:
- Parts and their subparts (a bridge has pylons as parts)
- Functions and subfunctions (a pylon transfers load)
- Requirements and more elementary requirements (a cable must have a strength)
- Work, units of work and tasks
Support modularity. An OTL quickly becomes very large and very complex. In practice, organisations struggle to create, use, manage, and maintain their OTLs. A modular structure is necessary to support these processes effectively.
These requirements can be met making use of RDF’s unique features: inheritance, composability, and introspection —and, of course, its unique level of standardization. It is no surprise, therefore, that RDF plays a role of growing importance. Suppose we use XML instead to express OTLs in. All the features that RDF delivers by design would have to be coded in application logic. There are no standards out there that offer guidance in this. Such a project would be a complete waste of money. We can expect, therefore, that the uptake of RDF in the context of OTLs will continue to speed up for years to come.
…and then we win
The domain of asset management is not unique. Other domains deal with similar problems and could equally profit from cashing in on the unique features that RDF offers. On the other hand, change is not easy. It necessitates acquiring new skill sets and saying goodbye to old ones. Organisations need to go through a long and potentially fraught learning curve. Before you reap the benefits, investments are necessary. Mistakes will be made. Moreover, there is the cultural change that the introduction of RDF involves. However, ultimately, when it offers real benefits, new technology cannot be stopped. After the domain of asset management, many other domains will follow. And if we keep our act together, we can speed up this process.
Disambiguation Specialist
3 年Jan - A few comments. Class hierarchies are not *necessary* to establish generalizations if you have the correct classifications in hand. For example, since 'Customer' is not a subclass of 'Person' you are free to assign properties to both, independent of how the former term is interpreted. This also helps with separation of concerns. The Role('Customer') can be defined independently of its structural implementation by a DBA for example CUST_TBL), which is also independent of its association to Person('Wilamena Smith'). This is a roundabout way of saying that on a semantic level, Role and Person are separate but attracted and, like in a natural language sentence (again, independent of but equivalent to a database record) forms a natural 'mixed class'. The other comment I have is in the form of a question: Absent any name-space independent identifiers, how do we avoid creating RDF silos on one hand and disambiguate homographs, for example, on the other? "Cologne" is also a Place in France, right? If every name space creates its own RDF statement for the German Cologne, how do we link (compose) them without worrying about the French one?
CEO Taxonic & Ontologist
3 年Pieter Pauwels : any thoughts on the 'successor of XML': JSON? I think JSON and GraphQL are very important to bridge the gap with web application design practices. But it is easier to generate GraphQL and JSON from an RDF platform than generating RDF the other way around.