RAG, Context and Knowledge Graphs
Copyright 2024 Kurt Cagle / The Cagle Report
There is an interesting tug-of-war going on right now. On one side are the machine learning folks, those who have been harnessing neural networks for a few years. In their view of the world, everything should be a neural net, and if only everyone else would adopt neural nets, they would see that it solves all of humanity's problems. To them, everything is an LLM.
On the other side of the rope are the Symbolic AI folk, mostly rallying around knowledge graphs. A knowledge graph is a graph of information, a connected network where each node is a concept and each edge is a relationship. These people have been around for a lot longer, though it can be argued that neural nets and knowledge graphs have both been around since the 1950s. It can be further argued that the argument about which is superior has been going on for nearly as long.
The latest point of contention concerns two similar technologies that serve somewhat different functions. The first is Retrieval Augmented Generation, known by the rather inflatory appellation RAG. RAG can be thought of as an escape hatch for LLM, a way for other data stores to weigh in on matters within large language models that involves (horrors!) querying external datasets.
The second technology involves context, which can be thought of as the amount of tokens that can be held in an LLM's "memory" at any given time. The larger the context, the more memory that's available to build information used within a response, and consequently the more "accurate" (i.e., non-hallucinatory) a given response is likely to be. In the neural network world, if you could build a large enough context, you would not in fact need any other data source, because everything could be kept in the context.
There are a few limitations with context, however, that can be seen after a bit of introspection:
These factors don't obviate the power of contexts - you can, in fact, do a number of useful things with a sufficiently large context, just as you can with any memory system. However, after a certain point of making the context larger you get diminishing returns, and contexts generally don't solve the problem of dynamic data, which is something that RAG is much more suited for.
The Problems (and Power) of Knowledge Graphs
It's worth noting that knowledge graphs have their own set of problems, which have prevented them from gaining the same level of penetration that LLMs are beginning to enjoy.
Given these things, using knowledge graphs would seem like a non-starter. However, their advantages outweigh the disadvantages significantly:
Driving Natural Language KG Queries With LLM
These factors combine to make knowledge graphs powerful tools for managing large language models, while at the same time LLMs can be used to simplify the process of querying these databases.
Most queries actually rely upon both structural and taxonomic ontologies the first identifying structural relationships, the second describing entities such as people, places, things, categories, and intellectual property.
One useful combination of ontologies is the schematic structure of schema.org, which provides a surprisingly comprehensive general ontology for describing general constructs, along with the use of Wikipedia URLs to describe various entities, from presidents to countries to movies to technologies and books.
The following prompt set the stage in ChatGPT:
You are a knowledge graph accessible via SPARQL. Your dataset is built using schema.org as the underlying structure, and using Wikipedia URIs in order to identify resources (these use the namespace prefixes schema: and wiki: respectively. The schema.org properties and classes are all defined using SHACL. Translate each prompt into a SPARQL query (and display this prompt), then produce sample output from this query as Turtle. Continue until the words "end queries" are typed as a prompt. If a property URI is not in the schema.org namespace, then use ex: <https://example.com/ns#> as the namespace for the property.
The first query is a simple one:
List all movies produced by Ridley Scott
The initial script cogitates a bit, then spits out:
PREFIX schema: <https://schema.org/>
PREFIX wiki: <https://en.wikipedia.org/wiki/>
SELECT ?movie ?movieLabel WHERE {
?movie a schema:Movie ;
schema:producer wiki:Ridley_Scott .
?movie rdfs:label ?movieLabel .
}
Generating a sample output as follows:
领英推荐
@prefix schema: <https://schema.org/> .
@prefix wiki: <https://en.wikipedia.org/wiki/> .
wiki:Alien a schema:Movie ;
schema:name "Alien" ;
schema:producer wiki:Ridley_Scott .
wiki:Blade_Runner a schema:Movie ;
schema:name "Blade Runner" ;
schema:producer wiki:Ridley_Scott .
wiki:The_Martian_(film) a schema:Movie ;
schema:name "The Martian" ;
schema:producer wiki:Ridley_Scott .
There are a few items to note: In this case, ChatGPT spidered both Wikipedia and Schema.org, and consequently has the identifiers for both in its rather capacious latent space. This can be very useful, both because it means that ChatGPT does in fact have at least one (and in fact several) schemas and reference taxonomies that are currently loaded, meaning that if you are using either ChatGPT or Copilot, you stand a good chance of being able to do inferencing if you follow the schema.org model, without having to do anything else in prep.
The second thing to note is that an intrinsic lookup has already taken place to retrieve the reference to wiki:Ridley_Scott. You can change this so that it is a parameter passed via a values statement, with an additional rule.
when generating a SPARQL query, pass parameters in using the $ notation rather than ?, and assign the parameter passed as a default using the VALUES construct.
The $ in SPARQL queries is equivalent to the question mark, but it often is used to indicate a variable being passed into the sparql query. If no variable is passed, the value (in the above example) wiki:George_Lucas is used.
what were the films produced by George Lucas, including the year of release and the genre of the film?
This generates the Sparql query:
PREFIX schema: <https://schema.org/>
PREFIX wiki: <https://en.wikipedia.org/wiki/>
CONSTRUCT {
?movie a schema:Movie ;
schema:name ?movieLabel ;
schema:producer $producer ;
schema:releaseDate ?releaseDate ;
schema:genre "Science Fiction" .
}
WHERE {
VALUES $producer { wiki:George_Lucas }
?movie a schema:Movie ;
schema:producer $producer ;
schema:releaseDate ?releaseDate ;
schema:genre ?genre .
?movie rdfs:label ?movieLabel .
?genre rdfs:label ?genreLabel .
}
and the sample output of
@prefix schema: <https://schema.org/> .
@prefix wiki: <https://en.wikipedia.org/wiki/> .
wiki:Star_Wars a schema:Movie ;
schema:name "Star Wars" ;
schema:producer wiki:George_Lucas ;
schema:releaseDate "1977-05-25" ;
schema:genre "Science Fiction" .
wiki:American_Graffiti a schema:Movie ;
schema:name "American Graffiti" ;
schema:producer wiki:George_Lucas ;
schema:releaseDate "1973-08-01" ;
schema:genre "Comedy" .
The advantage to this approach is that you can use it to build templates. Note also this made use of the CONSTRUCT command to build a set of triples. This output in turn can also be used to construct a compact JSON-LD structure document:
Create a compact JSON-LD representation of this document.
{
"@context": {
"schema": "https://schema.org/",
"wiki": "https://en.wikipedia.org/wiki/",
"name": "schema:name",
"type": "@type",
"id": "@id",
"producer": "schema:producer",
"releaseDate": "schema:releaseDate",
"genre": "schema:genre"
},
"@graph": [
{
"id": "wiki:Star_Wars",
"type": "schema:Movie",
"name": "Star Wars",
"producer": {"id": "wiki:George_Lucas"},
"releaseDate": "1977-05-25",
"genre": "Science Fiction"
},
{
"id": "wiki:American_Graffiti",
"type": "schema:Movie",
"name": "American Graffiti",
"producer": {"id": "wiki:George_Lucas"},
"releaseDate": "1973-08-01",
"genre": "Comedy"
}
]
}
Finally, this can make use of an LLM call to convert the response from the Turtle into a form that can be passed into a LangChain as part of a RAG (Retrieval Augmented Generation) process:
From the last Turtle response, generate an output that would be passed on to a LangChain calling this as part of a RAG process.
[
{
"title": "Star Wars",
"content": "Star Wars is a science fiction movie produced by George Lucas. It was released on May 25, 1977."
},
{
"title": "American Graffiti",
"content": "American Graffiti is a comedy movie produced by George Lucas. It was released on August 1, 1973."
}
]
Note that the LLM has converted the content into a minimal text representation for each movie, taking advantage of the fact that it knows the identifiers in order to retrieve labels for each of the relevant fields.
You can also see that ChatGPT does have an understanding of at least Wikipedia entries:
User:
What does wiki:George_Lucas represent?
ChatGPT:
In the context of the data model being used, wiki:George_Lucas represents a resource identified by a Wikipedia URI. Specifically, it refers to the Wikipedia page for George Lucas, the famous film director, producer, and screenwriter known for creating the Star Wars and Indiana Jones franchises.The prefix wiki: is used as a shorthand for the base URL of Wikipedia, which is typically https://en.wikipedia.org/wiki/. Therefore, wiki:George_Lucas would expand to https://en.wikipedia.org/wiki/George_Lucas, which is the URL of the Wikipedia page about George Lucas.
Conclusion
This is a fairly basic example of what can be done with combining knowledge graphs and LLMs, and is a topic I will revisit in the near future. This approach of combining schema.org and Wikipedia is also especially useful because it is generally much easier to convert RDF into other RDF forms than it is to convert non-RDF into RDF.
In media res,
Kurt Cagle
Editor, The Cagle Report
Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
11 个月Kurt Cagle, I've taken your basic SPARQL example and expanded it to demonstrate how hyperlinks aid sharing and comparison of what's generated by various Chat Bots. See: https://www.dhirubhai.net/pulse/large-language-model-llm-chat-bot-comparison-basic-sparql-idehen-5cv2e/
Helping people govern and make sense of data
11 个月Yes, KG's take a bit of work. But, so do most software implementations. The productivity gains of KG + LLM for enterprises is fun to watch - and we're still in the 1st inning!
Balaji D Loganathan Surendran Sukumaran
Information, Data, Knowledge Expert
11 个月I think you have to change the first (select -> construct) query or the result (graph -> table). Nice article (as usual ??) ????♂?
Disambiguation Specialist
11 个月??