Torn Between UUIDs and Friendly IRIs? Use Both!
Having been involved in the Semantic space for more than a decade and a half, I've seen quite a few arguments that seem to be eternal. Do you use upper ontologies or not? Is SHACL better than OWL? Property Graphs vs. Semantic Graphs? Yet of all the arguments that I've heard, one of the most common is whether it is better to use randomly generated IRIs (usually using some form of UUID) or some form of human friendly IRI?
It is likely that this one will be a key question that people first coming to knowledge graphs grapple with, and every semantic project I've worked on usually ends up taking hours to come up with a good answer to that - with the answers evenly split on both sides. For instance, which is better?
#Turtle urn:uuid:3cfaecd7-ef1a-479e-a614-734d48836d25
or
#Turtle https://dscnews.com/article/_my-article-on-guids-by-kurt-cagle
If the goal here is to ensure uniqueness, then there is no question that the first expression is unique. A typical type 4 UUID will be repeated about once every 100 sextillion times, which means that, even in a superfast environment with lots of data, the first form will be about as unique as you could possibly want.
What the second form provides, however, is identification. I do not know what the first IRI stands for. Neither do you? I do know what the second indicates. This is great is the goal is reducing collision of identifiers, but from a pedagological (e.g., teaching) perspective, the second expression is far more valuable.
This is especially true for the workhorses of an ontology: the classes and properties that are used to draw relationships together. If I have a situation where I want to say that a particular article is of type article, using UUIDs will give you the following very helpful assertion*:
#Turtle urn:uuid:203c11f8-91cd-4c23-8a8a-0be13efd403e urn:uuid:c3ef1d56-f896-4478-af16-33a1c6357416 urn:uuid:fec14769-c595-4438-8f45-9aa1f017ed55.
* The statement just given was with tongue firmly in cheek.
The statement just given might very well be the same as the following
#Turtle https://dscnews.com/article/_My-article-on-guids-by-kurt-cagle https://www.w3.org/1999/02/22-rdf-syntax-ns#type https://dscnews.com/class/_Article.
Then again, it may be something completely different, and you would never know.
The reality is that we routinely use both uuids and friendly IRIs all the time, and even (sometimes) make efficient use of namespaces. The problem though is that all too often the either or conundrum obscures the fact that not all can you use both together, you probably should.
For instance, namespaces (which in my personal opinion are badly underutilized) are far more helpful when you do want to use friendly IRIs.
#Turtle @prefix article: <https://dscnews.com/article/>. @prefix class: <https://dscnews.com/class/>. @prefix rsd: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>. article:_My-article-on-guids-by-kurt-cagle rdf:type class:_Article.
Namespaces work reasonably well in conjunction with friendly IRIs. While this seems like a lot for one statement, when you have ten billion such assertions in your database, being able to manually see what's going on makes a big difference in getting to the root of a problem. It also tends to make it easier to write queries in languages such as SPARQL:
#SPARQL select ?article where { ?article rdf:type class:_article }
So given this, why don't we use friendly names more often. There are actually a number of reasons, some valid, some rather bogus:
- Friendly names are much more likely to collide with one another (e.g., two different concepts may end up with the same friendly name) than UUIDs.
- Generating friendly names can be computationally intensive, especially as the rules may change from one object to the next.
- If a generated friendly name's content changes, the resulting IRI may no longer describe accurately describe the content of the resulting record.
For instance, if I changed the name of my article to "Versioning and RDF objects" the IRI will no longer represent the represented content of the object in question.
However, what if a friendly name was in fact simply another property of a specific resource IRI that was a randomly generated GUID. For instance, consider the case above. Let's say that every object did have a UUID name. Blank nodes are in fact one way of talking about such UUID objects, as a Turtle blank node such as
_:article1
may be represented internally as:
urn:uuid:203c11f8-91cd-4c23-8a8a-0be13efd403e
or perhaps (for base 64 fans):
urn:base64:ehNB5L2aAEa91B-AdE61cA
A blank node simply says: here's a randomly generated sequence of characters that can guarantee non-collision.
So, if I create a statement:
# Turtle _ _:article # = urn:7KpAGMxIe owl:sameAs article:_My-article-on-guids-by-kurt-cagle; rdfs:label "My Article on Guids"^^xsd:string; schema:author _:author_kurt_cagle . _:author_kurt_cagle # = urn:ehNB5L2aA owl:sameAs author:_kurtCagle; rdfs:label "Kurt Cagle"^^xsd:string; .
in essence, what I am doing is associating a UUID with a friendly IRI. If you have a reasoner in use, the engine should allow you to use the friendly URI anywhere that UUID is invoked. If you don't (and increasingly that's the case), you can take advantage of sparql to follow the dereference:
#sparql select ?author ?author_label ?article ?article_label where { ?article owl:sameAs article:_My-article-on-guids-by-kurt-cagle. ?article rdfs:label ?article_label. ?article schema:author ?author. ?author rdfs:label ?author_label. }
This generates a table along the lines of:
+------------------+----------------+-------------------+-----------------------+ | author | author_label | article | article_label | +------------------+----------------+-------------------+------------------=----+ | urn:ehNB5L2aA | "Kurt Cagle" | urn:7KpAGMxIe | "My Article On Guids" | +------------------+----------------+-------------------+-----------------------+
So far, this makes a good argument for using UUIDs. Where things become a little more problematic is when the resource in question is itself a property or a class. For instance, let's say that rather using rdfs:label and schema:author, we use thing:label and article:author (with appropriately defined namespaces). The Turtle requires three more assertions:
#turtle _:thing-label owl:sameAs thing:label. _:article_author owl:sameAs article:author. _:article _:article_author _:article_:author_kurt_cagle
The Sparql does change:
#sparql select ?author ?author_label ?article ?article_label where { ?article_author owl:sameAs article:author. ?thing_label owl:sameAs thing:label. ?article rdfs:label ?article_label. ?article ?article_author ?author. ?author rdfs:label ?author_label. }
The first statement in the select block is worth examining. In effect, the variable ?article_author gets assigned the randomly generated UUID of the blank node that has an owl:sameAs relationships with the friendly IRI article:author. Note that article:author is not itself the assigned IRI of the relationship - it only appears as an object, never a subject. So the first statement can be read as "find the UUID that has the owl:sameAs property value of article:author".
So, while making for some interesting syntactical sugar, what does this matter? It turns out that it matters a great deal in the problem of versioning. The biggest issue that you face with versioning is that triples are not really records. They don't have intrinsic cohesiveness. If I change the property value on a given entity, the collection of triples associated with that subject is now a different entity. In other words, the entities must become immutable.
If the system of triples doesn't change once created, then immutability isn't an issue, but in any reasonable system, if an entity does change, you need to both have some way to keep the identity of the object the same while at the same time have the versioning change. This can be done through this same kind of mechanism, along with SPARQL UPDATE.
For instance, let's say that you wanted to add a new triple to the entity that identifies a particular topic. The following represents the initial and updated version of the article entity:
#Turtle # Namespaces here _:_author_kurt-cagle owl:sameAs author:_kurt_cagle. _:_topic:_semantics owl:sameAs topic:_semantics. _:_articleVersion1 a class:_Article; owl:sameAs article:_My_article_on_guids_by_kurt_cagle; rdfs:label "My Article on Guids"; article:author _:_author_kurt-cagle; version:creationDate "2019-03-21"^^xsd:date; . _:_articleVersion2 a class:_Article; owl:sameAs article:_My_article_on_guids_by_kurt_cagle; rdfs:label "My Article on Guids"; article:author _:_author_kurt-cagle; article:topic _:_topic_semantics; version:precedingVersion _:articleVersion1; version:creationDate "2021-01-06"^^xsd:date; version:currentVersion "true"^^xsd:Boolean;
. .
In this case, _:_articleVersion1 and _:_articleVersion2 are both UUID based IRIs. They are bound to the same named IRI, `article:_My_article_on_guids_by_kurt_cagle. When a new version is created, the version:currentVersion triple is removed from the old version, and a version:precedingVersion is added, linking the new version to its immediate predecessor. (This is an exercise that can be done in Sparql Update, and is left to the reader).
To get the most recent version, given the friendly IRI, the query is straightforward:
select ?version where { ?version owl:sameAs ?publicArticle. ?version currentVersion ?isCurrentVersion. filter ( ?isCurrentVersion = "true"^^xsd:Boolean) values ?publicArticle article:_My_article_on_guids_by_kurt_cagle }
The use of the variable ?publicArticle should be instructive: it is an IRI that is used outside of the current system and is publicly available. The internal IRIs, in this case, are irrelevant - they exist primarily to act as unique keys into the triple store.
Additionally, I've not covered the specific use case of blank nodes acting as a surrogate for properties, primarily because more that a few triple stores do not let you use blank nodes as predicates. This doesn't mean that you can create ersatz blank nodes (something like the above base64 or UUID based urn for instance), which can then be used to version classes and predicates That's a topic for another article, however.
Finally, it should be noted here that I've made use of the owl:sameAs statement as a way of creating an explicit relationship between a named and anonymously generated IRI. All too often people coming into the world of RDF see the property as being the same as A is the same as B. In an inferential system (one with a reasoner), this is a useful side effect, but what it really means is that there is a (potentially one-sided) relationship that ties two IRIs together. I could have just as readily called it thing:hasPublicIRI and be just as correct (and perhaps more precise).
Summary
All too often in computing circles you get an either/or mindset where one approach or the other is the ONLY way to work. That mindset, unfortunately, can blind you as a programmer or information architect to the ways that you can work with both approaches to create far more flexibility in the way that you design information spaces. Anonymous vs named IRIs is a classic example of this: be creative, use both!
Kurt Cagle is the author of The Cagle Report and is the Community Editor for Data Science Central, a TechTarget property. When not writing about semantics or editing other people's writing, he writes science fiction and urban fantasy set in the Pacific Northwest.
Head of Innovation Propylon
4 年I have grappled with this my whole life :-) A turning point for me was reading Naming and Necessity by Saul Kripke and the concept of rigid designators. I feel a great unease with opaque identifiers. It might be genetic or something because I can see their value :-) But I just cannot..... Anyway, the middle ground I use a lot is semantic identifiers that include a timestamp. The timestamp "locks down" the de-referencing of the identifier to a point-in-time and helps greatly in dealing with the inevitability of change through time. Part of my problem, perhaps, is that I am from the file-system generation. I am used to the idea of giving names to byte-streams with a view to finding them again - by name. The cloud compute paradigm seems to me to be a lot less concerned with naming bytestreams for findability. Instead, we have intermediating layers of filtering systems and search systems that find byte-streams based on their content.
Model Manager | Enterprise Architecture & ArchiMate Advocate | Expert in MBSE, PLM, STEP Standards & Ontologies | Open Source Innovator(ArchiCG)
4 年Kurt Cagle Interesting article on an important topic. Working in the Product Lifecycle Management area, where we have to deal with manufacturing data exchange, sharing and long term archiving within the supply chain and all along the phases of the life of an industrial product (which can be 50 years long), identification of data assets is quite important. We have to face heterogeneous identification or naming systems. On important point: let's avoid to provide significant identifiers, as the function and primary usage of an identifier is not to give information. Using names, in particular with a namespace context (which is often related to a decomposition, e.g. enterprise/department/service/uniqueName -here we contextualize the identified thing inside an organization managing the identification process - we have the risk to have to face to organization changes (In one year, my company changed 3 times its name) while we considered it as non mutable. However, as you state it, it's theory, and in practice people often (I do it to) relies on friendly name, managing they unicity. Fortunately, my favorite ontology editor, Protégé, started to provide features allowing to switch easily what is presented in the interface, URIS, labels or any other annotation of you choice, making it easier to combine non significant IDs with naming. For something planned for long term in an industrial context, it is something mandatory to manage both in a standardized way. Thanks again for the article, Kurt.