On Ontologies, LLMs & Domain Specific Languages (DSLs)
What is a programming language? I think, before ChatGPT, Gemini, etc., that this was one of those questions with an obvious answer. A programming language was a set of instructions for a computer to follow to achieve a particular objective. By that light, C++, Java, Javascript, Python, R, etc. were programming languages, while SQL, Sparql, XPath, etc., were, well, something else. Query languages, perhaps. Spreadsheets were applications. Everything had a bucket, and all was right with the world.
Today, things are not quite so cut and dried, although I'm not sure that the boundaries were ever as precise as we tended to make them. All programming languages, every single one of them, started with the intention of making programming more accessible. Every program designer has said in their mission statement that this or that language will simplify the process of writing programs and make them easier to understand.
Javascript started as a very simple mechanism to connect DOM elements on a web page, using a syntax that borrowed a lot from Java. It has since become one of the most used (and abused) languages on the planet, primarily because it was needed for the web, was eventually incorporated onto the back-end servers, and now has become integral to just about the major applications in use today. Python has a similar history and would likely have been one more fairly obscure language (cf. Ruby) had someone not developed a set of libraries for data analytics, thus catapulting it past R, which was explicitly designed to do data analytics but not much else.
These languages, for the most part, are imperative. As a general rule of thumb, most languages fall into one of six major categories:
It should be noted that none of these is absolute - you can program declaratively with Javascript, for example, even though most people use it imperatively - but they do generally describe broad patterns.
The Value and Peril of DSLs
A Domain Specific Language (or DSL) is usually a special-purpose language intended to perform a certain limited set of actions. As the name suggests, DSLs typically have a very limited scope, such as DOT or MermaidJS, both DSLs for creating network graphs and related diagrams. DSLs, more often than not, are declarative - they describe the state of things relative to other things. For instance, SVG, Scalable Vector Graphics, is a DSL describing vector-based graphics. It is normally expressed in XML, but nothing explicitly requires XML - it could have just as readily been expressed in JSON, Turtle, YAML, or, for that matter, in any other arbitrary format.
The primary benefit of a DSL is that it is a specialist's tool—it does one set of tasks very well while not requiring the burden of learning a general-purpose language. This makes them astonishingly useful to subject matter experts who often have very specific requirements. One interesting case study I saw recently involved using a DSL in order to encode street characteristics for urban planning applications. Indeed, many DSLs start out as notational shorthand that then gets standardized and systematized to to do things like generate diagrams, annotations, documents, and so forth.
Of course, this often exposes the flip-side to this benefit. The notation that is created often makes a great deal of sense to the original user, but as others begin to adopt it, the flaws in the original model can become obvious. Such flaws may be syntactical (the notation may be inconsistent, for instance), but these things can usually be fixed by agreeing upon a standard syntax, then deprecating older syntax in favor of the new.
Sometimes the flaws come when trying to mix different modality types - trying to add imperative features to a declarative language, for instance (I've made this neat airplane description language, now I want the airplane to fly). Because such use cases are often not obvious until after the DSL has been around for awhile, adding that level of functionality may be beyond the scope of the original design, and this new functionality consequently may often generate yet another DSL that is only marginally connected to the first.
This points to the big headache in DSL design - creating modularization and extensibility. Modularization - the art of encapsulation - is a key part of most larger languages but because DSLs typically are intended for very specific use cases, the idea of rebuilding with modularization and extensibility in mind is often not on the original designer's wish list.
For a long time, I have felt that one of HTML's biggest flaws was that it was not designed for extensibility, meaning that it couldn't evolve as a language. Language purists will often argue that extensibility can dilute the language's value. In reality, lack of extensibility simply means that people have to go outside of the language itself to extend it inconsistently, because they WILL extend it somehow.
DSLs and Ontologies
An ontology is a DSL. There, I've said it. I recently wrote an article about RDF and Mermaid diagrams, where I hinted at this point, but I want to make it explicit here. Once you strip away the syntax, the expression of the ontology in a particular dialect, what you ultimately have is semantics, or put another way, intent. A DSL is the representation (specification) of a model (a conceptualization).
For instance, let's say that I wanted to create a genealogy model. Real models for genealogies can get very complicated (as is true of almost all social institutions), but at its core, a genealogy is about marriages (in any of its myriad forms) and offsprings. I can mock up a simple DSL for that:
Jane Doe <=> Dennis Fry:
Marcia Fry <=> Bill Goethe:
Kevin Goethe
=> Jenna Marques:
Lila Fry-Marques
Michael Fry <=> Elaine Shoe:
Dan Shoe
This is a very basic DSL. It could be extended with dates and things like that, but let's keep this simple for now. It tells us a lot of things, however: we have a way of identifying both people and marriages. The notation:
A <=> B :
tells us that two people (A and B) married, and had zero or more children. The children are indicated by an indentation, with a single name indicating an unmarried individual, a marriage being indicated by the same syntax above, and the same person having multiple (presumably sequential) marriages, and an array and a name indicating a second marriage for the previously indicated child.
This can be expressed in RDF, using Turtle:
@PREFIX Person <https://www.example.com/ns/Person#> .
@PREFIX Marriage <https://www.example.com/ns/Marriage#> .
Person:JaneDoe a Person: .
Person:DennisFry a Person: .
Person:MarciaFry a Person: .
Person:BillGoethe a Person: .
Person:KevinGoethe a Person: .
Person:JennaMarques a Person: .
Person:LilyFryMarques a Person: .
Person:MichaelFry a Person: .
Person:ElaineShoe a Person: .
Person:DanShow a Person: .
Marriage:m1 a Marriage;
Marriage:hasSpouse Person:JaneDoe ;
Marriage:hasSpouse Person:DennisFry ;
.
Person:MarciaFry Person:childOf Marriage:m1 ;
Person:MichaelFry Person:childOf Marriage:m1 ;
Marriage:m2 a Marriage;
Marriage:hasSpouse Person:MarciaFry ;
Marriage:hasSpouse Person:BillGoethe ;
.
Person:KevinGoethe Person:childOf Marriage:m2 .
Marriage:m3 a Marriage;
Marriage:hasSpouse Person:MarciaFry ;
Marriage:hasSpouse Person:JennaMarque ;
.
Person:LilaFryMarques Person:childOf Marriage:m3 .
Marriage:m4 a Marriage;
Marriage:hasSpouse Person:MichaelFry ;
Marriage:hasSpouse Person:ElaineShoe ;
.
Person:DanShoe Person:childOf Marriage:m4 .
Now, this is more verbose (and in general Turtle will be), but the important thing here is that there is, in fact, an underlying model that the DSL represents simply as a shorthand notation. The point to take away from this, though, is that there DOES exist an ontological model, as can be expressed in a Mermaid Model.
This diagram can be expressed using the Mermaid Markdown language as follows:
flowchart BT
Person:JaneDoe([<b>Person</b>\nJaneDoe]):::Person
Person:DennisFry([<b>Person</b>\nDennisFry]):::Person
Person:MarciaFry([<b>Person</b>\nMarciaFry]):::Person
Person:BillGoethe([<b>Person</b>\nBillGoethe]):::Person
Person:KevinGoethe([<b>Person</b>\nKevinGoethe]):::Person
Person:JennaMarques([<b>Person</b>\nJennaMarques]):::Person
Person:LilyFryMarques([<b>Person</b>\nLilyFryMarques]):::Person
Person:MichaelFry([<b>Person</b>\nMichaelFry]):::Person
Person:ElaineShoe([<b>Person</b>\nElaineShoe]):::Person
Person:DanFry([<b>Person</b>\nDanFry]):::Person
Marriage:m1([<b>Marriage</b>\nJane Doe And Dennis Fry]):::Marriage
Marriage:m2([<b>Marriage</b>\nMarcia Fry And Bill Goethe]):::Marriage
Marriage:m3([<b>Marriage</b>\nMarcia Fry And Jenna Marques]):::Marriage
Marriage:m4([<b>Marriage</b>\nMichael Fry And Elaine Shoe]):::Marriage
Marriage:m1 -- has spouse --> Person:JaneDoe
Marriage:m1 -- has spouse --> Person:DennisFry
Person:MarciaFry -- child of --> Marriage:m1
Marriage:m2 -- has spouse --> Person:BillGoethe
Marriage:m2 -- has spouse --> Person:MarciaFry
Person:KevinGoethe -- child of --> Marriage:m2
Marriage:m3 -- has spouse --> Person:MarciaFry
Marriage:m3 -- has spouse --> Person:JennaMarques
Person:LilyFryMarques -- child of --> Marriage:m3
Person:MichaelFry -- child of --> Marriage:m1
Marriage:m4 -- has spouse --> Person:MichaelFry
Marriage:m4 -- has spouse --> Person:ElaineShoe
Person:DanFry -- child of --> Marriage:m4
classDef Person stroke:black,fill:blue,color:white;
classDef Marriage stroke:black,fill:green,color:white;
One of the advantages of creating such a model is that you can also experiment to see if there are improvements to be made. For instance, suppose that a person has a property Person:hasMarriage.
Person:JaneDoe Person:childOf Marriage:m1 .
In SHACL, this would be represented as:
Person: a sh:NodeShape ;
sh:targetClass Person: ;
sh:name "Person" ;
sh:property Person:childOf ;
.
Person:childOf sh:PropertyShape;
sh:name "child of;
sh:path Person:childOf ;
sh:class Marriage: ;
.
Marriage: a sh:NodeShape ;
sh:targetClass Marriage: ;
sh:name "Marriage" ;
sh:property Marriage:hasSpouse;
.
Marriage:hasSpouse a sh:PropertyShape;
sh:name "spouse" ;
sh:path Marriage:hasSpouse;
sh:class Person: ;
.
This has nothing to do with syntax, but everything to do with underlying meaning or intent. This SHACL (everything in the sh: namespace) can be thought of as a DSL for describing structure or schema, and isn't limited to Turtle notation.
DSLs, Templates and LLMs
Language Learning Models (LLMs) such as ChatGPT are template-driven, meaning that, rather than actually storing data as structures, they use proximity metrics in a very large information space in order to determine what properties get retrieved, usually in a format.
For instance, given the above schema (and adding birth, death, and marriage dates), I asked ChatGPT to convert the above schema to JSON-LD:
领英推荐
{
"@context": {
"sh": "https://www.w3.org/ns/shacl#",
"xsd": "https://www.w3.org/2001/XMLSchema#",
"Person": "https://example.org/Person#",
"Marriage": "https://example.org/Marriage#"
},
"@graph": [
{
"@id": "Person:PersonShape",
"@type": "sh:NodeShape",
"sh:targetClass": "Person:Person",
"sh:property": [
{
"sh:path": "Person:hasChild",
"sh:nodeKind": "sh:IRI",
"sh:class": "Person:Person",
"sh:name": "has child"
},
{
"sh:path": "Person:dateOfBirth",
"sh:datatype": "xsd:date",
"sh:name": "date of birth"
},
{
"sh:path": "Person:dateOfDeath",
"sh:datatype": "xsd:date",
"sh:name": "date of death",
"sh:minCount": 0
}
]
},
{
"@id": "Marriage:MarriageShape",
"@type": "sh:NodeShape",
"sh:targetClass": "Marriage:Marriage",
"sh:property": [
{
"sh:path": "Marriage:hasSpouse",
"sh:nodeKind": "sh:IRI",
"sh:class": "Person:Person",
"sh:name": "has spouse"
},
{
"sh:path": "Marriage:dateMarried",
"sh:datatype": "xsd:date",
"sh:name": "date married"
}
]
}
]
}
with the prompt command:
Comprehensively convert the SHACL format to JSON
The "comprehensively" request was to ensure that ChatGPT wouldn't cheat by providing only a fragment. ChatGPT prefers working with JSON over Turtle, so I usually make requests transformed to JSON for any templating work.
Once you have the template, you can then make a request against ChatGPT to populate to that structure:
Given the JSON based SHACL, retrieve a JSON output showing the descendants of Queen Elizabeth II and Prince Philip.
which generates the output:
{
"@context": {
"Person": "https://example.org/Person#",
"Marriage": "https://example.org/Marriage#"
},
"@graph": [
{
"@id": "Person:QueenElizabethII",
"@type": "Person:Person",
"Person:name": "Queen Elizabeth II",
"Person:dateOfBirth": "1926-04-21"
},
{
"@id": "Person:PrincePhilip",
"@type": "Person:Person",
"Person:name": "Prince Philip",
"Person:dateOfBirth": "1921-06-10",
"Person:dateOfDeath": "2021-04-09"
},
{
"@id": "Marriage:QueenElizabethIIAndPrincePhilip",
"@type": "Marriage:Marriage",
"Marriage:hasSpouse": ["Person:QueenElizabethII", "Person:PrincePhilip"],
"Marriage:dateMarried": "1947-11-20"
},
{
"@id": "Person:PrinceCharles",
"@type": "Person:Person",
"Person:name": "Prince Charles",
"Person:dateOfBirth": "1948-11-14",
"Person:childOf": "Marriage:QueenElizabethIIAndPrincePhilip"
},
{
"@id": "Person:PrincessDiana",
"@type": "Person:Person",
"Person:name": "Princess Diana",
"Person:dateOfBirth": "1961-07-01",
"Person:dateOfDeath": "1997-08-31"
},
{
"@id": "Marriage:PrinceCharlesAndPrincessDiana",
"@type": "Marriage:Marriage",
"Marriage:hasSpouse": ["Person:PrinceCharles", "Person:PrincessDiana"],
"Marriage:dateMarried": "1981-07-29"
},
{
"@id": "Person:PrinceWilliam",
"@type": "Person:Person",
"Person:name": "Prince William",
"Person:dateOfBirth": "1982-06-21",
"Person:childOf": "Marriage:PrinceCharlesAndPrincessDiana"
},
{
"@id": "Person:PrincessCatherine",
"@type": "Person:Person",
"Person:name": "Princess Catherine",
"Person:dateOfBirth": "1982-01-09"
},
{
"@id": "Marriage:PrinceWilliamAndPrincessCatherine",
"@type": "Marriage:Marriage",
"Marriage:hasSpouse": ["Person:PrinceWilliam", "Person:PrincessCatherine"],
"Marriage:dateMarried": "2011-04-29"
},
{
"@id": "Person:PrinceGeorge",
"@type": "Person:Person",
"Person:name": "Prince George",
"Person:dateOfBirth": "2013-07-22",
"Person:childOf": "Marriage:PrinceWilliamAndPrincessCatherine"
},
{
"@id": "Person:PrincessCharlotte",
"@type": "Person:Person",
"Person:name": "Princess Charlotte",
"Person:dateOfBirth": "2015-05-02",
"Person:childOf": "Marriage:PrinceWilliamAndPrincessCatherine"
},
{
"@id": "Person:PrinceLouis",
"@type": "Person:Person",
"Person:name": "Prince Louis",
"Person:dateOfBirth": "2018-04-23",
"Person:childOf": "Marriage:PrinceWilliamAndPrincessCatherine"
},
{
"@id": "Person:PrinceHarry",
"@type": "Person:Person",
"Person:name": "Prince Harry",
"Person:dateOfBirth": "1984-09-15",
"Person:childOf": "Marriage:PrinceCharlesAndPrincessDiana"
},
{
"@id": "Person:MeghanMarkle",
"@type": "Person:Person",
"Person:name": "Meghan Markle",
"Person:dateOfBirth": "1981-08-04"
},
{
"@id": "Marriage:PrinceHarryAndMeghanMarkle",
"@type": "Marriage:Marriage",
"Marriage:hasSpouse": ["Person:PrinceHarry", "Person:MeghanMarkle"],
"Marriage:dateMarried": "2018-05-19"
},
{
"@id": "Person:ArchieMountbattenWindsor",
"@type": "Person:Person",
"Person:name": "Archie Mountbatten-Windsor",
"Person:dateOfBirth": "2019-05-06",
"Person:childOf": "Marriage:PrinceHarryAndMeghanMarkle"
},
{
"@id": "Person:LilibetMountbattenWindsor",
"@type": "Person:Person",
"Person:name": "Lilibet Mountbatten-Windsor",
"Person:dateOfBirth": "2021-06-04",
"Person:childOf": "Marriage:PrinceHarryAndMeghanMarkle"
}
]
}
This same output can be used to generate diagrams automatically, with a simple prompt:
Given this data and the SHACL schema, generate a Mermaid diagram that shows the relationships between marriages and the descendants of Queen Elizabeth II and Prince Philip.
Generating the following diagram:
Finally, we can roundtrip back to the original DSL, as follows:
Given the following Genealogical Domain Specific Language (DSL) as a template, generate a corresponding DSL output based upon the JSON data.
Jane Doe <=> Dennis Fry:
Marcia Fry <=> Bill Goethe:
Kevin Goethe
Marcia Fry <=> Jenna Marques:
Lila Fry-Marques
Michael Fry <=> Elaine Shoe:
Dan Fry
This then generated the following genealogical DSL output for the family of Queen Elizabeth II.
Queen Elizabeth II <=> Prince Philip:
King Charles <=> Princess Diana:
Prince William <=> Princess Catherine:
Prince George
Princess Charlotte
Prince Louis
Prince Harry <=> Meghan Markle:
Archie Mountbatten-Windsor
Lilibet Mountbatten-Windsor
By the way, I also asked ChatGPT for some suggestions to add dates (a critical component) to the above DSL, and it made the following suggestion that I rather liked (I've updated some of the out-of-date information here as well):
Queen Elizabeth II (1926-2022) <=1947=> Prince Philip (1921-2021):
King Charles (1948-) <=1981-1996=> Princess Diana (1961-1997):
Prince William (1982-) <=2011=> Princess Catherine (1982-):
Prince George (2013-)
Princess Charlotte (2015-)
Prince Louis (2018-)
Prince Harry (1984-) <=2018=> Meghan Markle (1981-):
Archie Mountbatten-Windsor (2019-)
Lilibet Mountbatten-Windsor (2021-)
Takeaways
We are reaching the end of the syntactical era of programming. This is a big deal, as most major computer language shifts in the last 60 years have ultimately been about syntax optimization, and about creating complex (and deep) APIs to cover just about every possible contingency. Java 14 has 4,500 classes, and as an approximation about ten times that many methods, setters and getters. That's insane. Python and Javascript are not quite there yet, but the numbers are still climbing. No one can be an expert on something that has nearly five thousand classes, and frankly, we spend a huge amount of time and money migrating from language to language trying to find that magic optimization sweet spot.
What is going to happen is that we will likely see the rise of DSLs, hopefully, built upon decent underlying data models, that can effectively migrate from language to language based on user preference. This is going to have a significant impact:
This change won't happen overnight, but it will likely represent a sea change in how we communicate with computer systems. By the way, most prompt "dialogs" are themselves DSLs of some sort, idiomatic representations which can annotate or direct changes to the output. You see this especially with image, sound and video processing, but the whole field of prompt engineering can be thought of as a form of creating DSLs.
In Media Res,
Kurt Cagle
Editor, The Cagle Report
My Newsletters:
Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
11 个月Today, you can achieve many useful things with insightful posts like this one. For example, with the help of the OpenLink Software Personal Assistant [#OPAL] (which works in conjunction with #ChatGPT) and the OpenLink Structured Data Sniffer (#OSDS), I can generate the following and then save it directly to a #KnowledgeGraph hosted by a public (or private) #VirtuosoRDBMS instance (via its #SPARQL endpoint): [1] FAQ [2] Defined Term Set or Concept Scheme. See: [1] https://linkeddata.uriburner.com/describe/?url=https%3A%2F%2Fwww.dhirubhai.net%2Fpulse%2Fontologies-llms-domain-specific-languages-dsls-kurt-cagle-bongc%23DefinedTermSet&distinct=1 [2] https://tinyurl.com/3skbsxwx -- SPARQL Query Results Page that doubles and a Knowledge Graph explorer OPAL Session Transcript Link: https://linkeddata.uriburner.com/chat/?chat_id=s-6MVcyQ2hBXmmj8rTZRc37SoYKRGc15mzGQ2JzdhCrbmt#asi-1901 -- this took a few back and forths with ChatGPT to get right. #SemanticWeb #LinkedData
realizing new digital products | deep modeling
11 个月??
Institute of Animal Science
11 个月Dear Kurt, many thanks for this nice and comprehensible overview, ..... ?? ??
U2 LAB - Smart Tech Solutions?
11 个月Very interesting! ???
Semantic data and disclosure modernization strategist
11 个月Excellent article showing how semantics can be syntax agnostic. The Financial Data Transparency Act of 2023 (FDTA) assumed this to be the case. That law specifies that semantics be at the core of data standards, and rather than declaring any particular syntax instead sets the objective that standards be expressed as taxonomies or ontologies. The law also anticipated the need the create standards that can elide with generative AI. We can thank its original sponsors for the foresight of agnostic syntax and preparation for AI within the financial regulatory agencies. Your article can shows another reason why we should think about standards as specifications of conceptualizations i.e., that regulatory data standards first a concepts (data elements in collections of information), then specifications as to what makes that concept a particular type of concept, and then finally, expressed as machine-readable. Your article shows that machine-readability can be agnostic and automated. Thanks for this excellent work.