Augmenting Large Language Models with Dynamic External Data

Nithila Jeyakumar CPCU

Assistant VP @ Berkshire Hathaway | IT Leadership, Strategic Thinking

发布日期: 2024年4月27日

ind myself looking to get under the hood of the latest technologies in order to deliver production-ready software products that support our business needs. One area that has caught my attention recently is, of course, generative AI.

Going off of the previous primer on generative AI which refers to artificial intelligence systems that can generate new content like text, images, audio, code, and more. Unlike the traditional machine learning models that output predictions or classifications based on their training data. Generative AI allows for much more open-ended and creative applications by constructing entirely new content.

Here, I’d like to get into the Retrieval-Augmented Generation (RAG) models. A RAG model is an approach that combines a pre-trained Large Language Model (like GPT-3) with a retrieval system to allow the model to access and use external information from a database or corpus during inference.

All standalone Large Language Models (LLMs) have a fixed knowledge base that reflects a specific cutoff point in time. Unlike continual learning methods that can continuously ingest new information, the pre-training process for LLMs is a one-time endeavor. This means that LLMs only have access to data and information available up until the cutoff date used during their training. Any knowledge or events after that point are not represented in the LLM's understanding unless additional steps are taken to update or finetune the model on newer data. This training data is generic and also may not have sufficient contextual information for enterprise production usage.

So, what is RAG?

RAG is a specific type of generative AI model that augments its generations by also retrieving and conditioning on relevant information from a knowledge base. This allows the model to incorporate up-to-date, factual knowledge into its generated outputs in a way that improves accuracy, consistency, and truthfulness compared to models operating solely from their training data.

As the name implies, these models retrieve relevant external data the language model wasn't trained on. This supplementary data, pulled in real-time based on the input query, gets fed into the model to provide extra context for its generated response. Crucially, the retrieved information must directly apply to the topic at hand. By augmenting with this timely, contextualized retrieval, the model can produce more complete, accurate outputs than using just its initial training data. The retrieval component ensures access to the latest pertinent knowledge before generation.

The retrieval-augmented approach hinges on one key principle: allowing the model to fetch semantically relevant data for the specific context or prompt. At its core, the architecture identifies and incorporates applicable knowledge sources before generating output, enhancing the model compared to those using only initial training data.

Let’s dive in a little deeper

??????????? The RAG process involves the use of three elements:

1.???? The?embedding?model

2.???? The?retriever, often a vector database

3.???? And the?generator, the LLM

For retrieval to work, the data must be in 'embedding' form - text represented as numerical vectors. This embedded format allows efficient retrieval of relevant information from the knowledge base.

And more importantly, these embeddings have a similarity principle:?similar concepts will have similar vectors.

For instance, consider the concepts of 'apple' and 'orange.' We perceive them as related - both are fruits, grow on trees, are round, and have a distinctive smell. In vector form, 'apple' could be represented as [2.5, 1.2, -0.8] while 'orange' is [2.3, 1.5, -0.7]. Each number in the vector corresponds to an attribute or characteristic of that concept. The similar values across the vectors indicate shared attributes between apples and oranges, capturing their intuitive relatedness. Concepts with more divergent vectors would signify greater dissimilarity in their underlying attributes. After we have the embeddings, we insert them into the vector database (the retriever),?a high-dimensional database that stores these embeddings.

Then, whenever the user sends a request like "give me similar results to a 'green apple'", the vector database performs a 'semantic query'. It calculates the vector for 'green apple', something like [2.6, 1.1, -0.9], and searches for other vectors in the embedded data that are numerically closest to those values. This allows it to retrieve concepts or entries that have highly similar attributes or characteristics to a 'green apple', even if they don't explicitly use those words. The semantic query identifies matches based on the relatedness captured in the vector representations, not just literal keyword matching.

In other words,?it performs an extraction of the closest vectors (in distance) to that of the user’s query.

Once we have the extracted content, we build the LLM prompt which has:

·?????? The user’s request

·?????? The extracted content

·?????? and, generally, a set of system instructions

A typical system instruction provided to guide the model's output style could be "respond concisely", as part of prompt engineering.

So that’s RAG in a nutshell,?a system that provides relevant content in real-time (at inference time) to the user query to enhance the LLM’s response. The secret sauce in RAG systems is the language models' in-context learning superpowers. This allows models to use previously unseen data to perform accurate predictions without weight training.

But wait, what’s the catch?

To help explain, visualize the output using RAG as the following pair of jeans.

Although these trousers might work for some, most folks aren't going to want to rock that style, as there’s no homogeneity. Although retrieval-augmented models might work for some use cases, other applications would fail to fully leverage them, as there isn’t an inherent cohesion between the components despite retrieval being built with the intention to unobtrusively supplement language models.

In the standard RAG model, pretraining, fine-tuning, and Reinforcement Learning from Human Feedback (RLHF), all essential components of standard LLM training are not performed from scratch, by including both the LLM and the retriever (the vector database).

In more technical terms, this means that, during backpropagation, in the algorithm to train these models, the gradients are not propagated through the entire LLM, AND through the retriever, so that the entire system, as a unison, does not learn from the training data. The complete system is not trained end-to-end while being connected.

An additional factor that must be contemplated regarding the implementation of the standard RAG is the arrival of LLM’s ability to handle huge sequence length.

LLMs like Gemini 1.5 or Claude 3, have huge context windows that go up to a?million tokens (750k words) in their production-released models?and up to?10 million tokens (7.5 million words) in the research labs. This means that these models can be fed extremely long sequences of text in every single prompt. For reference,?Harry Potter's entire book series has around 1,084,170 words. Therefore,?a 7.5-million-word context window could it fit it in seven-fold, in every single prompt.

So,?I leave you with something to think about - do we really need a knowledge retriever knowledge base instead of just feeding the information in every prompt?

Looking forward to continuing to explore the technical and strategic frontiers of AI in the years ahead.

Onward and Upward!