RAG, Context and Knowledge Graphs
Where did that robot get to?!

RAG, Context and Knowledge Graphs

Copyright 2024 Kurt Cagle / The Cagle Report

There is an interesting tug-of-war going on right now. On one side are the machine learning folks, those who have been harnessing neural networks for a few years. In their view of the world, everything should be a neural net, and if only everyone else would adopt neural nets, they would see that it solves all of humanity's problems. To them, everything is an LLM.

On the other side of the rope are the Symbolic AI folk, mostly rallying around knowledge graphs. A knowledge graph is a graph of information, a connected network where each node is a concept and each edge is a relationship. These people have been around for a lot longer, though it can be argued that neural nets and knowledge graphs have both been around since the 1950s. It can be further argued that the argument about which is superior has been going on for nearly as long.

The latest point of contention concerns two similar technologies that serve somewhat different functions. The first is Retrieval Augmented Generation, known by the rather inflatory appellation RAG. RAG can be thought of as an escape hatch for LLM, a way for other data stores to weigh in on matters within large language models that involves (horrors!) querying external datasets.

The second technology involves context, which can be thought of as the amount of tokens that can be held in an LLM's "memory" at any given time. The larger the context, the more memory that's available to build information used within a response, and consequently the more "accurate" (i.e., non-hallucinatory) a given response is likely to be. In the neural network world, if you could build a large enough context, you would not in fact need any other data source, because everything could be kept in the context.

There are a few limitations with context, however, that can be seen after a bit of introspection:

  • The larger the context, the longer it takes to read that context, and especially as some queries are actually negative in they return no reasonable responses (how many Tyrannosaurs can be found in South America, for instance), the cost of reading all of these context tokens can be prohibitive looking for an answer can be prohibitive.
  • Sometimes you simply can't get all the data that you're looking for within an LLM. When you consider that the information outside of LLMs (even taking into account the massive Internet crawlers that Bing, Google, etc. all have) still dwarfs by several orders of magnitude the data within LLMs. This should not be that surprising.
  • In both humans and LLMs, most contextual information is not relevant to the focused task at hand. Humans are remarkably adept at screening most of this out (one of the loudest sounds in the human body is the sound of blood rushing into and out of your ears, yet, except in very rare circumstances, you likely don't even realize that it's there). This doesn't hold true for contexts (though there are some interesting developments there). Determining what is not relevant within a context stream is a considerably more costly venture for a machine-learning system.

These factors don't obviate the power of contexts - you can, in fact, do a number of useful things with a sufficiently large context, just as you can with any memory system. However, after a certain point of making the context larger you get diminishing returns, and contexts generally don't solve the problem of dynamic data, which is something that RAG is much more suited for.

The Problems (and Power) of Knowledge Graphs

It's worth noting that knowledge graphs have their own set of problems, which have prevented them from gaining the same level of penetration that LLMs are beginning to enjoy.

  • They require a fairly significant amount of work to design, set up and manage, especially if you are unfamiliar with them.
  • While they can be deeply queried using SPARQL or similar languages, writing such queries often requires specialist knowledge that is not widely available.
  • You need to understand the schema used to query the knowledge graph, and such schemas can be complex and poorly documented.

Given these things, using knowledge graphs would seem like a non-starter. However, their advantages outweigh the disadvantages significantly:

  • It is possible to summarize complex, highly interdependent relationships with knowledge graphs that would be difficult or even impossible to do in relational graphs.
  • Knowledge graphs work with unique conceptual identifiers. This means that concepts can be related to other concepts in different systems, making them quite useful for data interchange across systems as well as for handling master data management.
  • Knowledge graphs can be serialized and transferred between systems because they are database agnostic. They are also designed to remove duplication. They can be represented as JSON, XML, Turtle, CSV or many other formats.
  • They are a superset of other formats - you can represent XML, JSON, Excel documents relational data and document structures as a graph, but its much harder to represent hierarchical JSON or XML in a traditional relational database without treating it as a BLOB of some sort.
  • While the way that LLMs represent information is very different from knowledge graphs, by embedding knowledge graphs into LLMs you cause considerably better inferencing and reasoning from the LLM than otherwise would be the case, as well as significantly mitigating hallucinations.

Driving Natural Language KG Queries With LLM

These factors combine to make knowledge graphs powerful tools for managing large language models, while at the same time LLMs can be used to simplify the process of querying these databases.

Most queries actually rely upon both structural and taxonomic ontologies the first identifying structural relationships, the second describing entities such as people, places, things, categories, and intellectual property.

One useful combination of ontologies is the schematic structure of schema.org, which provides a surprisingly comprehensive general ontology for describing general constructs, along with the use of Wikipedia URLs to describe various entities, from presidents to countries to movies to technologies and books.

The following prompt set the stage in ChatGPT:

You are a knowledge graph accessible via SPARQL. Your dataset is built using schema.org as the underlying structure, and using Wikipedia URIs in order to identify resources (these use the namespace prefixes schema: and wiki: respectively.  The schema.org properties and classes are all defined using SHACL. Translate each prompt into a SPARQL query (and display this prompt), then produce sample output from this query as Turtle. Continue until the words "end queries" are typed as a prompt. If a property URI is not in the schema.org namespace, then use ex: <https://example.com/ns#> as the namespace for the property.         

The first query is a simple one:

List all movies produced by Ridley Scott        

The initial script cogitates a bit, then spits out:

PREFIX schema: <https://schema.org/>
PREFIX wiki: <https://en.wikipedia.org/wiki/>

SELECT ?movie ?movieLabel WHERE {
  ?movie a schema:Movie ;
         schema:producer wiki:Ridley_Scott .
  ?movie rdfs:label ?movieLabel .
}        

Generating a sample output as follows:

@prefix schema: <https://schema.org/> .
@prefix wiki: <https://en.wikipedia.org/wiki/> .

wiki:Alien a schema:Movie ;
    schema:name "Alien" ;
    schema:producer wiki:Ridley_Scott .

wiki:Blade_Runner a schema:Movie ;
    schema:name "Blade Runner" ;
    schema:producer wiki:Ridley_Scott .

wiki:The_Martian_(film) a schema:Movie ;
    schema:name "The Martian" ;
    schema:producer wiki:Ridley_Scott .        

There are a few items to note: In this case, ChatGPT spidered both Wikipedia and Schema.org, and consequently has the identifiers for both in its rather capacious latent space. This can be very useful, both because it means that ChatGPT does in fact have at least one (and in fact several) schemas and reference taxonomies that are currently loaded, meaning that if you are using either ChatGPT or Copilot, you stand a good chance of being able to do inferencing if you follow the schema.org model, without having to do anything else in prep.

The second thing to note is that an intrinsic lookup has already taken place to retrieve the reference to wiki:Ridley_Scott. You can change this so that it is a parameter passed via a values statement, with an additional rule.

when generating a SPARQL query, pass parameters in using the $ notation rather than ?, and assign the parameter passed as a default using the VALUES construct.        

The $ in SPARQL queries is equivalent to the question mark, but it often is used to indicate a variable being passed into the sparql query. If no variable is passed, the value (in the above example) wiki:George_Lucas is used.

what were the films produced by George Lucas, including the year of release and the genre of the film?        

This generates the Sparql query:

PREFIX schema: <https://schema.org/>
PREFIX wiki: <https://en.wikipedia.org/wiki/>
CONSTRUCT {
?movie a schema:Movie ;
     schema:name ?movieLabel ;
     schema:producer $producer ;
     schema:releaseDate ?releaseDate ;
     schema:genre "Science Fiction" .
}
WHERE {
VALUES $producer { wiki:George_Lucas }
  ?movie a schema:Movie ;
         schema:producer $producer ;
         schema:releaseDate ?releaseDate ;
         schema:genre ?genre .
  ?movie rdfs:label ?movieLabel .
  ?genre rdfs:label ?genreLabel .
}        

and the sample output of

@prefix schema: <https://schema.org/> .
@prefix wiki: <https://en.wikipedia.org/wiki/> .

wiki:Star_Wars a schema:Movie ;
    schema:name "Star Wars" ;
    schema:producer wiki:George_Lucas ;
    schema:releaseDate "1977-05-25" ;
    schema:genre "Science Fiction" .

wiki:American_Graffiti a schema:Movie ;
    schema:name "American Graffiti" ;
    schema:producer wiki:George_Lucas ;
    schema:releaseDate "1973-08-01" ;
    schema:genre "Comedy" .        

The advantage to this approach is that you can use it to build templates. Note also this made use of the CONSTRUCT command to build a set of triples. This output in turn can also be used to construct a compact JSON-LD structure document:

Create a compact JSON-LD representation of this document.        
{
  "@context": {
    "schema": "https://schema.org/",
    "wiki": "https://en.wikipedia.org/wiki/",
    "name": "schema:name",
    "type": "@type",
    "id": "@id",
    "producer": "schema:producer",
    "releaseDate": "schema:releaseDate",
    "genre": "schema:genre"
  },
  "@graph": [
    {
      "id": "wiki:Star_Wars",
      "type": "schema:Movie",
      "name": "Star Wars",
      "producer": {"id": "wiki:George_Lucas"},
      "releaseDate": "1977-05-25",
      "genre": "Science Fiction"
    },
    {
      "id": "wiki:American_Graffiti",
      "type": "schema:Movie",
      "name": "American Graffiti",
      "producer": {"id": "wiki:George_Lucas"},
      "releaseDate": "1973-08-01",
      "genre": "Comedy"
    }
  ]
}        

Finally, this can make use of an LLM call to convert the response from the Turtle into a form that can be passed into a LangChain as part of a RAG (Retrieval Augmented Generation) process:

From the last Turtle response, generate an output that would be passed on to a LangChain calling this as part of a RAG process.        
[
  {
    "title": "Star Wars",
    "content": "Star Wars is a science fiction movie produced by George Lucas. It was released on May 25, 1977."
  },
  {
    "title": "American Graffiti",
    "content": "American Graffiti is a comedy movie produced by George Lucas. It was released on August 1, 1973."
  }
]
        

Note that the LLM has converted the content into a minimal text representation for each movie, taking advantage of the fact that it knows the identifiers in order to retrieve labels for each of the relevant fields.

You can also see that ChatGPT does have an understanding of at least Wikipedia entries:

User:
What does wiki:George_Lucas represent?

ChatGPT:
In the context of the data model being used, wiki:George_Lucas represents a resource identified by a Wikipedia URI. Specifically, it refers to the Wikipedia page for George Lucas, the famous film director, producer, and screenwriter known for creating the Star Wars and Indiana Jones franchises.The prefix wiki: is used as a shorthand for the base URL of Wikipedia, which is typically https://en.wikipedia.org/wiki/. Therefore, wiki:George_Lucas would expand to https://en.wikipedia.org/wiki/George_Lucas, which is the URL of the Wikipedia page about George Lucas.        

Conclusion

This is a fairly basic example of what can be done with combining knowledge graphs and LLMs, and is a topic I will revisit in the near future. This approach of combining schema.org and Wikipedia is also especially useful because it is generally much easier to convert RDF into other RDF forms than it is to convert non-RDF into RDF.

In media res,

Oh, damn, Hans didn't shoot first this time. Better get the mop.


Kurt Cagle

Editor, The Cagle Report

Kingsley Uyi Idehen

Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)

11 个月

Kurt Cagle, I've taken your basic SPARQL example and expanded it to demonstrate how hyperlinks aid sharing and comparison of what's generated by various Chat Bots. See: https://www.dhirubhai.net/pulse/large-language-model-llm-chat-bot-comparison-basic-sparql-idehen-5cv2e/

Sam Evans

Helping people govern and make sense of data

11 个月

Yes, KG's take a bit of work. But, so do most software implementations. The productivity gains of KG + LLM for enterprises is fun to watch - and we're still in the 1st inning!

Jan Campschroer

Information, Data, Knowledge Expert

11 个月

I think you have to change the first (select -> construct) query or the result (graph -> table). Nice article (as usual ??) ????♂?

John O'Gorman

Disambiguation Specialist

11 个月

??

回复

要查看或添加评论,请登录

Kurt Cagle的更多文章

  • Reality Check

    Reality Check

    Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

    14 条评论
  • MarkLogic Gets a Serious Upgrade

    MarkLogic Gets a Serious Upgrade

    Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

    14 条评论
  • Beyond Copyright

    Beyond Copyright

    Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

    5 条评论
  • Beware Those Seeking Efficiency

    Beware Those Seeking Efficiency

    Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

    85 条评论
  • A Decentralized AI/KG Web

    A Decentralized AI/KG Web

    Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

    48 条评论
  • Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    I am currently working on Deepseek (https://chat.deepseek.

    41 条评论
  • The (Fake) Testerone Crisis

    The (Fake) Testerone Crisis

    Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

    22 条评论
  • Why AI Agents Aren't Agents

    Why AI Agents Aren't Agents

    Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

    22 条评论
  • What to Study in 2025 If You Want A Job in 2030

    What to Study in 2025 If You Want A Job in 2030

    Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

    28 条评论
  • Ontologies and Knowledge Graphs

    Ontologies and Knowledge Graphs

    Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

    53 条评论

社区洞察

其他会员也浏览了