登录查看更多内容

Torn Between UUIDs and Friendly IRIs? Use Both!

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2021年2月5日

Having been involved in the Semantic space for more than a decade and a half, I've seen quite a few arguments that seem to be eternal. Do you use upper ontologies or not? Is SHACL better than OWL? Property Graphs vs. Semantic Graphs? Yet of all the arguments that I've heard, one of the most common is whether it is better to use randomly generated IRIs (usually using some form of UUID) or some form of human friendly IRI?

It is likely that this one will be a key question that people first coming to knowledge graphs grapple with, and every semantic project I've worked on usually ends up taking hours to come up with a good answer to that - with the answers evenly split on both sides. For instance, which is better?

#Turtle

urn:uuid:3cfaecd7-ef1a-479e-a614-734d48836d25

#Turtle

https://dscnews.com/article/_my-article-on-guids-by-kurt-cagle

If the goal here is to ensure uniqueness, then there is no question that the first expression is unique. A typical type 4 UUID will be repeated about once every 100 sextillion times, which means that, even in a superfast environment with lots of data, the first form will be about as unique as you could possibly want.

What the second form provides, however, is identification. I do not know what the first IRI stands for. Neither do you? I do know what the second indicates. This is great is the goal is reducing collision of identifiers, but from a pedagological (e.g., teaching) perspective, the second expression is far more valuable.

This is especially true for the workhorses of an ontology: the classes and properties that are used to draw relationships together. If I have a situation where I want to say that a particular article is of type article, using UUIDs will give you the following very helpful assertion*:

#Turtle
urn:uuid:203c11f8-91cd-4c23-8a8a-0be13efd403e
     urn:uuid:c3ef1d56-f896-4478-af16-33a1c6357416
          urn:uuid:fec14769-c595-4438-8f45-9aa1f017ed55.

* The statement just given was with tongue firmly in cheek.

The statement just given might very well be the same as the following

#Turtle
https://dscnews.com/article/_My-article-on-guids-by-kurt-cagle
     https://www.w3.org/1999/02/22-rdf-syntax-ns#type
         https://dscnews.com/class/_Article.

Then again, it may be something completely different, and you would never know.

The reality is that we routinely use both uuids and friendly IRIs all the time, and even (sometimes) make efficient use of namespaces. The problem though is that all too often the either or conundrum obscures the fact that not all can you use both together, you probably should.

For instance, namespaces (which in my personal opinion are badly underutilized) are far more helpful when you do want to use friendly IRIs.

#Turtle
@prefix article: <https://dscnews.com/article/>.

@prefix class: <https://dscnews.com/class/>.
@prefix rsd: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>.

article:_My-article-on-guids-by-kurt-cagle   rdf:type   class:_Article.

Namespaces work reasonably well in conjunction with friendly IRIs. While this seems like a lot for one statement, when you have ten billion such assertions in your database, being able to manually see what's going on makes a big difference in getting to the root of a problem. It also tends to make it easier to write queries in languages such as SPARQL:

#SPARQL

select ?article where {
     ?article rdf:type class:_article
     }

So given this, why don't we use friendly names more often. There are actually a number of reasons, some valid, some rather bogus:

Friendly names are much more likely to collide with one another (e.g., two different concepts may end up with the same friendly name) than UUIDs.
Generating friendly names can be computationally intensive, especially as the rules may change from one object to the next.
If a generated friendly name's content changes, the resulting IRI may no longer describe accurately describe the content of the resulting record.

For instance, if I changed the name of my article to "Versioning and RDF objects" the IRI will no longer represent the represented content of the object in question.

However, what if a friendly name was in fact simply another property of a specific resource IRI that was a randomly generated GUID. For instance, consider the case above. Let's say that every object did have a UUID name. Blank nodes are in fact one way of talking about such UUID objects, as a Turtle blank node such as

_:article1

may be represented internally as:

urn:uuid:203c11f8-91cd-4c23-8a8a-0be13efd403e

or perhaps (for base 64 fans):

urn:base64:ehNB5L2aAEa91B-AdE61cA

A blank node simply says: here's a randomly generated sequence of characters that can guarantee non-collision.

So, if I create a statement:

# Turtle

_
_:article             # = urn:7KpAGMxIe
    owl:sameAs article:_My-article-on-guids-by-kurt-cagle;
    rdfs:label "My Article on Guids"^^xsd:string;
    schema:author _:author_kurt_cagle
    .


_:author_kurt_cagle    # = urn:ehNB5L2aA
    owl:sameAs author:_kurtCagle;
    rdfs:label "Kurt Cagle"^^xsd:string;
    .

in essence, what I am doing is associating a UUID with a friendly IRI. If you have a reasoner in use, the engine should allow you to use the friendly URI anywhere that UUID is invoked. If you don't (and increasingly that's the case), you can take advantage of sparql to follow the dereference:

#sparql

select ?author ?author_label ?article ?article_label where {
   ?article owl:sameAs article:_My-article-on-guids-by-kurt-cagle.
   ?article rdfs:label ?article_label.
   ?article schema:author ?author.
   ?author rdfs:label ?author_label.
   }

This generates a table along the lines of:

+------------------+----------------+-------------------+-----------------------+
|     author       |   author_label |      article      |   article_label       |
+------------------+----------------+-------------------+------------------=----+
| urn:ehNB5L2aA    | "Kurt Cagle"   |   urn:7KpAGMxIe   | "My Article On Guids" |

+------------------+----------------+-------------------+-----------------------+

So far, this makes a good argument for using UUIDs. Where things become a little more problematic is when the resource in question is itself a property or a class. For instance, let's say that rather using rdfs:label and schema:author, we use thing:label and article:author (with appropriately defined namespaces). The Turtle requires three more assertions:

#turtle

_:thing-label    owl:sameAs thing:label.
_:article_author owl:sameAs article:author.
_:article _:article_author _:article_:author_kurt_cagle

The Sparql does change:

#sparql

select ?author ?author_label ?article ?article_label where {
   ?article_author owl:sameAs article:author.
   ?thing_label owl:sameAs thing:label.
   ?article rdfs:label ?article_label.
   ?article ?article_author ?author.
   ?author rdfs:label ?author_label.
   
   }

The first statement in the select block is worth examining. In effect, the variable ?article_author gets assigned the randomly generated UUID of the blank node that has an owl:sameAs relationships with the friendly IRI article:author. Note that article:author is not itself the assigned IRI of the relationship - it only appears as an object, never a subject. So the first statement can be read as "find the UUID that has the owl:sameAs property value of article:author".

So, while making for some interesting syntactical sugar, what does this matter? It turns out that it matters a great deal in the problem of versioning. The biggest issue that you face with versioning is that triples are not really records. They don't have intrinsic cohesiveness. If I change the property value on a given entity, the collection of triples associated with that subject is now a different entity. In other words, the entities must become immutable.

If the system of triples doesn't change once created, then immutability isn't an issue, but in any reasonable system, if an entity does change, you need to both have some way to keep the identity of the object the same while at the same time have the versioning change. This can be done through this same kind of mechanism, along with SPARQL UPDATE.

For instance, let's say that you wanted to add a new triple to the entity that identifies a particular topic. The following represents the initial and updated version of the article entity:

#Turtle
# Namespaces here


_:_author_kurt-cagle owl:sameAs author:_kurt_cagle.
_:_topic:_semantics owl:sameAs topic:_semantics.


_:_articleVersion1
     a class:_Article;
     owl:sameAs article:_My_article_on_guids_by_kurt_cagle;
     rdfs:label "My Article on Guids";
     article:author _:_author_kurt-cagle;
     version:creationDate "2019-03-21"^^xsd:date;
     .


_:_articleVersion2
     a class:_Article;
     owl:sameAs article:_My_article_on_guids_by_kurt_cagle;
     rdfs:label "My Article on Guids";
     article:author _:_author_kurt-cagle;
     article:topic _:_topic_semantics;
     version:precedingVersion _:articleVersion1;
     version:creationDate "2021-01-06"^^xsd:date;
     version:currentVersion "true"^^xsd:Boolean;

.
.

In this case, _:_articleVersion1 and _:_articleVersion2 are both UUID based IRIs. They are bound to the same named IRI, `article:_My_article_on_guids_by_kurt_cagle. When a new version is created, the version:currentVersion triple is removed from the old version, and a version:precedingVersion is added, linking the new version to its immediate predecessor. (This is an exercise that can be done in Sparql Update, and is left to the reader).

To get the most recent version, given the friendly IRI, the query is straightforward:

select ?version where {
     ?version owl:sameAs ?publicArticle.
     ?version currentVersion ?isCurrentVersion.
     filter ( ?isCurrentVersion = "true"^^xsd:Boolean)
     values ?publicArticle article:_My_article_on_guids_by_kurt_cagle
}

The use of the variable ?publicArticle should be instructive: it is an IRI that is used outside of the current system and is publicly available. The internal IRIs, in this case, are irrelevant - they exist primarily to act as unique keys into the triple store.

Additionally, I've not covered the specific use case of blank nodes acting as a surrogate for properties, primarily because more that a few triple stores do not let you use blank nodes as predicates. This doesn't mean that you can create ersatz blank nodes (something like the above base64 or UUID based urn for instance), which can then be used to version classes and predicates That's a topic for another article, however.

Finally, it should be noted here that I've made use of the owl:sameAs statement as a way of creating an explicit relationship between a named and anonymously generated IRI. All too often people coming into the world of RDF see the property as being the same as A is the same as B. In an inferential system (one with a reasoner), this is a useful side effect, but what it really means is that there is a (potentially one-sided) relationship that ties two IRIs together. I could have just as readily called it thing:hasPublicIRI and be just as correct (and perhaps more precise).

Summary

All too often in computing circles you get an either/or mindset where one approach or the other is the ONLY way to work. That mindset, unfortunately, can blind you as a programmer or information architect to the ways that you can work with both approaches to create far more flexibility in the way that you design information spaces. Anonymous vs named IRIs is a classic example of this: be creative, use both!

Kurt Cagle is the author of The Cagle Report and is the Community Editor for Data Science Central, a TechTarget property. When not writing about semantics or editing other people's writing, he writes science fiction and urban fantasy set in the Pacific Northwest.

The Cagle Report

10,418 位关注者

Sean McGrath

Head of Innovation Propylon

4 年

I have grappled with this my whole life :-) A turning point for me was reading Naming and Necessity by Saul Kripke and the concept of rigid designators. I feel a great unease with opaque identifiers. It might be genetic or something because I can see their value :-) But I just cannot..... Anyway, the middle ground I use a lot is semantic identifiers that include a timestamp. The timestamp "locks down" the de-referencing of the identifier to a point-in-time and helps greatly in dealing with the inevitability of change through time. Part of my problem, perhaps, is that I am from the file-system generation. I am used to the idea of giving names to byte-streams with a view to finding them again - by name. The cloud compute paradigm seems to me to be a lot less concerned with naming bytestreams for findability. Instead, we have intermediating layers of filtering systems and search systems that find byte-streams based on their content.

Nicolas Figay

Model Manager | Enterprise Architecture & ArchiMate Advocate | Expert in MBSE, PLM, STEP Standards & Ontologies | Open Source Innovator(ArchiCG)

4 年

Kurt Cagle Interesting article on an important topic. Working in the Product Lifecycle Management area, where we have to deal with manufacturing data exchange, sharing and long term archiving within the supply chain and all along the phases of the life of an industrial product (which can be 50 years long), identification of data assets is quite important. We have to face heterogeneous identification or naming systems. On important point: let's avoid to provide significant identifiers, as the function and primary usage of an identifier is not to give information. Using names, in particular with a namespace context (which is often related to a decomposition, e.g. enterprise/department/service/uniqueName -here we contextualize the identified thing inside an organization managing the identification process - we have the risk to have to face to organization changes (In one year, my company changed 3 times its name) while we considered it as non mutable. However, as you state it, it's theory, and in practice people often (I do it to) relies on friendly name, managing they unicity. Fortunately, my favorite ontology editor, Protégé, started to provide features allowing to switch easily what is presented in the interface, URIS, labels or any other annotation of you choice, making it easier to combine non significant IDs with naming. For something planned for long term in an industrial context, it is something mandatory to manage both in a standardized way. Thanks again for the article, Kurt.

1 次回应

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

85 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

Torn Between UUIDs and Friendly IRIs? Use Both!

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

Summary

The Cagle Report

10,418 位关注者

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

Read Entropy's recent Article "Some Constructions and Mathematical Properties of Zero-Correlation-Zone Sonar Sequences"

A Minor Blump In the Toad: Explaining the New Roman Numeral Metric System

Entropy Recent Publication "Game Theoretic Clustering for Finding Strong Communities"

DRAFT SURVEY: Allowing for Anchor Chain Catenary

2023 in Books..

Heuristics of String Theory

Register 27, Etymology or Origin 201, Polysemy 301, English: “abhorrence” formality, origin, meanings, examples, synonyms, sociolinguistic registers

Physics analysis into Particles Collision Energy

Protein Structures and Airport Networks

The Three Systems

Summary

The Cagle Report

10,418 位关注者

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

Read Entropy's recent Article "Some Constructions and Mathematical Properties of Zero-Correlation-Zone Sonar Sequences"

A Minor Blump In the Toad: Explaining the New Roman Numeral Metric System

Entropy Recent Publication "Game Theoretic Clustering for Finding Strong Communities"

DRAFT SURVEY: Allowing for Anchor Chain Catenary

2023 in Books..

Heuristics of String Theory

Register 27, Etymology or Origin 201, Polysemy 301, English: “abhorrence” formality, origin, meanings, examples, synonyms, sociolinguistic registers

Physics analysis into Particles Collision Energy

Protein Structures and Airport Networks

The Three Systems