Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Fundamentals of RAG - Retrieval Augmented Generation - Part 1

Retrieval Augmented Generation (RAG) is an innovative approach that combines the power of retrieval-based models and generation-based models to enhance the performance of natural language processing tasks. In this article, we will delve into the intricacies of RAG and explore its potential applications in various domains.

As you know that Large Language Models can only answer questions or generate text based on their training data. So for example if we have a language model that was trained in 2018 and we ask it about Covid, probably the language model will not know anything about Covid. And one of the way to augment knowledge of language model is to fine-tune the model on the latest data.

It is common knowledge that Large Language Models are limited to answering questions or producing text solely based on the information they were trained on. For instance, if a language model was trained in 2018 and is asked about Covid, it is likely that the model will not have any knowledge about it. One effective method to enhance the knowledge of a language model is to fine-tune it using the most recent data available.

However, there exists an alternative method known as Retrieval Augmented Generation (RAG) that proves to be highly beneficial, particularly in the realm of Question-Answering. Recently, X (formerly known as Twitter) has utilized this technique to develop their latest language model named Grok.

Groq by X

Grok has the ability to access real-time data from tweets and respond to inquiries about current trends. This article delves into the concept of Retrieval Augmented Generation, detailing the pipelines and architecture of each component.

One of the key motivations behind RAG is the fact that LLMs may not have been exposed to all the data that is relevant to you. Private or recent data might not have been part of the pre-training process for these LLMs. As illustrated in the graph, the number of tokens they are pre-trained on is extensive, but it will always be limited in comparison to the private or recent data that is important to you.

Context Window Vs Token for various variants of LLMs

However, it is worth noting that LLMs possess context windows that are progressively expanding, ranging from thousands of tokens to numerous thousands of tokens. This allows us to incorporate information from external sources, spanning from dozens to hundreds of pages.

And a way to think about this is LLMs are kind of a kernel of a new kind of operating system and connecting them to external data is kind of a very Central capability in the development of this kind new emergent operating system.

LLMs as a Central Processing Unit

Lets review very briefly what is a language model.

What is a Language Model ?

A language model is a probabilistic model that assigns probabilities to sequence of words. In practice, a language model allows us to compute the following:

LLM as a Conditional Probability

The language model allows us to predict the probability of the next word in the sequence. For example, the probability of word "India" following the sequence "Delhi is a city in".

The probability that the word "India" comes next in the sequence "Delhi is a city in" is the kind of probability that we model using language models which is a Neural Network trained on a large corpora of text.

We usually train a Neural Network to predict these probabilities. A Neural Network trained on a very large corpora of text is known as a Large Language Model (LLM).

How do we train and inference and inference a Language Model ?

Training

A language model is trained on a corpora of text, i.e. a large collection of documents. Typically, language models are trained on the entirety of Wikipedia along with millions of web pages. This extensive training enables the Language Model to amass a wealth of knowledge.

We usually train a Transformer-based neural network as Language Model.

Inference

In order to infer a Language Model, a prompt is constructed and the Language Model is tasked with generating the remaining text by progressively adding tokens.

LLMs Poem - Proof of existence of Infinite Prime Numbers

When the poem is read in its entirety, it becomes comprehensible. It aligns with the content we specifically requested or wrote in the prompt.

You Are What You Eat

A Language Model is limited to generating text and information based on its training data. Consequently, if the model is exclusively trained on English content, it is highly unlikely to produce Japanese or French output. In order to introduce new concepts, it is necessary to fine-tune the model.

Cons of Fine-tuning

  • It can be computationally expensive
  • The number of parameters of the model may not be sufficient to capture all the knowledge we want to teach to it. That's why LLaMA was introduced with 7B, 13B and 70B parameters.
  • Fine-tuning is not additive. It may replace existing knowledge of the model with new knowledge. For example, a language model trained on English that is (heavily) fine-tuned on Chinese may "forget" English

We can compensate for the fine-tuning with prompt engineering.

Prompt Engineering comes to the aid!

It is feasible to educate a language model on how to execute a new task by manipulating the prompt. In other words, it is achievable to request the language model to carry out a task that it was not explicitly trained for by utilizing the prompt.

For example, by using "few-shot" prompting technique.

Here is an illustration where we initially provide an instruction to the language model, explaining the task it is expected to perform. Subsequently, we present the language model with a few examples on how to execute this task, and finally, we request the language model to perform it.

For instance, Ringo, a parrot, enjoys playing pranks on its companion Bob by substituting all the names with squawk

Learning with Prompt Engineering

LLM (say chatGPT) was not trained on performing this task but by looking at the prompt and the example that we provided it, it was able to come up with the right solution.

Question-Answering with Prompt Engineering

We can utilize prompt engineering for Question-Answering purposes as well, employing a similar approach. This involves constructing a comprehensive prompt that consists of an instruction, context, question, and answer.

For instance, we can inquire about Grok-0 from a language model like chatGPT. Despite the fact that Grok-0 is a recent introduction and chatGPT may not be familiar with it, the model can still retrieve the relevant information from the context provided.

Prompt Engineering

Fine-Tuning a model is NOT always wrong, as it can be a valid approach to addressing knowledge gaps in the Language Model. This typically leads to improved results in terms of quality when focusing on specific content, rather than solely relying on prompt engineering.

The pros of Fine-Tuning

  • Higher quality results compared to prompt engineering
  • Smaller context size (input size) during interference since we don't need to include the context and instructions

Furthermore, it was observed previously that in order to ask a question to chatGPT,, we had to construct extensive prompts. Consequently, the number of tokens allocated to the model was considerably large. The more tokens we provided, the greater the context required to answer a specific question. It is well-known that the more context we offer, the more information the language model will possess to generate an accurate response. Typically, a larger context is necessary, but the drawback is that it becomes more computationally intensive.

However, through the process of fine-tuning, we can actually diminish the size of this context. This is because we no longer need to provide the context, as we are fine-tuning the language model on specific data that will be used for questioning. Therefore, for a language model that has undergone proper fine-tuning, we only need to pose the question. Consequently, without supplying all the context, we can inquire about the number of parameters in Groq-0. If the language model has been fine-tuned correctly, it will be capable of generating the correct answer.

Retrieval Augment Generation Pipeline

Now we need to introduce the Retrieval Augmented Generation pipeline and thats or next step and we will explore each building block that makes up the pipeline.

QA with Retrieval Augment Generation

So imagine we want to do question answering with Retrieval Augment Generation compared to what we did before. Before we did it with prompt engineering.

Question-Answering with RAG

Question - "How many parameters are there in Groq-0 ?

Imagine a scenario where we have access to additional resources in the form of documents that may contain the answer we are looking for. These documents could be in PDF format or they could be web pages that hold the information we seek. To effectively analyze these documents, we break them down into smaller chunks of text. For instance, a document may consist of multiple pages, each page containing paragraphs, and each paragraph comprising sentences.

Typically, we divide the documents into smaller sentence fragments, and the web pages are segmented in a similar manner. Subsequently, we generate embeddings for these sentences, where each embedding is a fixed-size vector that encapsulates the meaning of the sentence. These embeddings are then stored in a vector database, allowing us to analyze how they function. Next, we transform a query, which is a sentence, into an embedding using the same method employed for the documents. We then search for this query embedding within our database, which contains numerous embeddings representing sentences from our documents, yielding results with the most closely matching (or similar) embeddings for our specific query.

Furthermore, every embedding is linked to the specific text it originates from, enabling the vector database to retrieve the original text associated with that particular embedding.

If, for instance, we inquire about the number of parameters in grok-0, the vector database will scan through all of its embeddings and provide us with the embeddings that most closely align with our query. Consequently, it will likely search for all the relevant information regarding Grok-0 and the parameters it encompasses.

After obtaining the context and query, we proceed to generate a prompt template. This template is based on the previous prompt used, where you act as an assistant trained to respond to inquiries using the provided context. We insert both the context and query into the prompt template, similar to our previous approach. Subsequently, we input this modified prompt into the language model. By doing so, the language model becomes capable of answering our question by utilizing the given context.

With Retrieval Augmented Generation, we are not simply fine-tuning a model to respond to our inquiries; instead, we are employing prompt engineering. Additionally, we are implementing a Vector Database, which allows us to access the relevant context based on our query. This enables the system to retrieve the necessary context to effectively answer the specific question posed to the language model. Subsequently, by utilizing this context along with our question, the language model can likely provide the correct answer.

Now let's see what are embeddings vectors and how do they work.

Why do we use vectors to represent words?

Given the words "apple", "bits" and "information, if we represent the embedding vectors using only 2 dimensions (X,Y) and we plot them, we hope to se something like this: the angle between words with similar meaning is small, while the angle between words with different meaning is big. So, the embedding "capture" the meaning of the words they represent by projecting them into a high-dimensional space of size d_model (embedding size).

Similarity as a measure of angle between tokens

For example, the words "bits" and "information" will point to the similar directions in space as they both capture the same kind of semantic meaning and we can measure the similarity by measuring the angle between them. Hence, the angle between "bits" and "information" is very small, while the angle between "apple" and " bits" is very large because they represent different semantic groups so they have different meaning.

We commonly use the cosine similarity, which is based on the dot product between the two vectors.

Word Embeddings - the Idea

So lets see how did we come up with the idea of representing words as embeddings.

  • Words that are synonyms tend to occur in the same context(surrounded by the same words). For example, the word "doctor" and "surgeon" usually occur surrounded by the words "physician", "hospital", "medical", "clinic", "prescription" e.t.c.
  • The inverse can also be true: words that occur in the same context tend to have similar meanings. This is known as the distributional hypothesis
  • This means that to capture the meaning of the word, we also need to have access to its context(the words embedding surrounding it).
  • This is why we employ the Self-Attention mechanism in the Transformer model to capture contextual information for every token. The self-attention mechanism relates every token to all the other tokens in the sentence.

Words Embeddings - the Cloze Task

  • Imagine I give you the following sentence:

Delhi is the ______ of India, and that is has numerous government buildings.

Can you tell me what is the missing word?

  • Its obvious! The missing word is "capital", because by looking at the rest of the sentence, it is the one that makes the most sense.
  • This is how we train BERT: we want the Self-Attention mechanism to relate all the input tokens with each other, so that the BERT has enough information about the "context" of the missing word to predict it.

How do we train embedding vectors in BERT?

Input Sentence - "Delhi, being the capital of India, is home to numerous government offices."

Assuming the randomly selected word to be masked is "capital" as before, this will lead to a masked input sequence containing 14 tokens that will be inputted into BERT. Given that BERT is a transformer model, it will generate an output sequence of 14 tokens due to the input sequence consisting of 14 tokens.

BERT will specifically focus on the masked position, which is the 4th position (i.e. token-3, with index = 3). Subsequently, we will compare token-3 (at the 4th position) with the target token, "capital", and calculate the loss. Finally, we will run the back-propagation process to update the weights.

Essentially, our objective is for the resulting token, 'token-3' to represent 'capital'.

BERT - Masked Language Modeling

Sentence Embeddings

  • We can use the Self-Attention mechanism also to capture the "meaning" of the entire sentence.
  • We can use a pre-trained BERT model to produce embeddings of entire sentence.

Let us see how.

Sentence Embedding with BERT

Sentence Embedding - Training architecture

Sentence Embedding: Comparisons with BERT

How can we compare Sentence Embeddings to see if two sentences have similar meaning?

We could use the cosine similarity, which measures the cosine of the angle between between the two vectors. A small angle results in a high cosine similarity score.

Cosine Similarity Formula

But there's a problem !!

Nobody told BERT that the embeddings it produces should be comparable with the cosine similarity, i.e. two similar sentences should be represented by vectors pointing to the same direction in space.

How can we teach BERT to produce embeddings that can be compared with a similarity function of our choice?

Introducing Sentence BERT

Introducing Sentence BERT, a widely used model for generating embeddings for complete sentences that can be evaluated using a similarity function of our choosing, such as the cosine similarity score.

Sentence BERT was introduced in a Paper with the following title:

Sentence-BERT Research Paper

Furthermore, we will explore the significance and functionality of these elements, delving into the concept of a Siamese Network.

Sentence BERT - Architecture

Imagine we have two sentences that are similar in meaning as follows:

Sentence A - I like to be in my house

Sentence B - I enjoy staying home

In order to utilize BERT, it is necessary to convert each sentence into a sequence of tokens. For instance, if Sentence A consists of 7 tokens and Sentence B consists of 4 tokens, both sentences are fed into BERT separately. As a result, BERT generates outputs of 7 tokens and 4 tokens for Sentence A and Sentence B respectively. Subsequently, we perform mean pooling as previously done, where we average all output tokens to generate a single vector of 768 dimensions (if BERT Base is being utilized).

So we'll have two vector embeddings - Sentence Embedding A and Sentence Embedding B. We then measure the cosine similarity between these two vectors. We also have our Target cosine similarity because we are training the BERT model, so for example given the two sentences A and sentences B which are quite similar in meanings we may have a Target that is very close to 1, because the angle between them will be small. We hope that the angle between them is small so we calculate the loss because we have a target and the output of the model and then we run back propagation to update all the weights of the model.

Siamese Network

You may observe that this particular model consists of two identical branches in terms of structure, known as the Siamese Network. This network is composed of two or more branches that are not only identical in architecture but also in weights. When representing the Siamese Network with two branches, we visually show two branches, but when coding the model, we will have only one branch that generates cosine similarities.

At the operational level, first we input sentence A into the model followed by inputting sentence B. Subsequently, we calculate the cosine similarity between the sentence embeddings generated by BERT for sentences A and B. Next, we compare this similarity to the target and execute the back propagation model, ensuring that only the parameters of this model are adjusted.

To illustrate, we have depicted two branches for the sake of representation. However, it is important to note that there is actually only one branch, one set of weights, and the same number of parameters.

In this manner, by training the Sentence BERT, sentence embeddings will be generated in a way where similar sentences exhibit comparable cosine similarity (or a high cosine similarity measure).

How to reduce the size of sentence embedding?

Recall that the BERT Base model generates embeddings of size 768. To obtain sentence embeddings of a smaller size, such as 512, we can achieve this by adding a linear layer after the pooling layer.

Effective Approaches for Teaching New Concepts to LLM

As we saw earlier, augment the knowledge of language model - there are two strategies:

  1. Fine-tuning on a custom dataset, and
  2. using a vector database made up of embeddings

We can also use the combination of both strategies - Fine-tuning for few epochs and the using a vector database to compliment with knowledge retrieved from the web.

Combined Fine-tuning + RAG

Whichever approach we choose to implement, it is essential that we offer a reliable, scalable, and user-friendly service for constructing our Retrieval Augmented Generation Pipelines.

Let us now delve into the realm of Vector databases, understanding their nature and the functioning of their matching algorithm. Specifically, we shall explore how the vector database conducts a search for our query within the multitude of vectors it has stored.

Vector Database - Introduction

A Vector Database is designed to store fixed-dimensional vectors, also known as embeddings. It allows us to perform queries on the database to identify all the embeddings that closely resemble a given query vector. This is achieved by utilizing a distance metric, typically cosine similarity, although the Euclidean distance can also be employed. The database employs a modified version of the KNN (K Nearest Neighbor) algorithm or another similarity search algorithm.

The Vector DBs are also used for finding the similarity in songs(e.g. Spotify), images(e.g. Google Images) or products(e.g. Amazon).

KNN - A Naive Approach

Imagine we want to search for the query in our database: a simple way would be comparing the query with all the vectors, sorting them by distance, and keeping the top K.

Searching via KNN

The KNN algorithm enables us to compare a specific vector with all the vectors stored in our database. We can sort them either by distance or by similarity, depending on our preference, and retain the top-K best matching vectors.

For instance, let's consider a scenario where we have a query - "What is the total count of parameters in Grok-0". Additionally, we possess a vector database consisting of one million embeddings (to be precise, we actually have billions of embeddings, so one million is not a significant number in this context). In this KNN (K-Nearest Neighbors) with a naive approach, the simplest method to match the query with all the other vectors is by utilizing our cosine similarity function. Consequently, the similarity between the query and the first embedding might be 0.34, while with the second embedding it could be 0.51, and so forth. Subsequently, we arrange the embeddings in descending order based on their cosine similarity scores and retain the top K embeddings, which could be either the top 4 or top 5 embeddings, depending on the value of K.

Now this is actually a very simple approach but it is also very slow because if there N embedding vectors (in this case N=1,000,000) and each one has D dimension (D=768, in case of BERT Base), then the computational complexity is of the order of O(N*D) which is very very slow.

Let us see if there are better approaches.

Similarity Search: Let's trade precision for speed

The previous naive approach we employed, which was rather simplistic, consistently yielded precise outcomes as it involved comparing the query with all the stored vectors. However, what if we were to minimize the number of comparisons while still achieving accurate results with a high probability?

The metric we usually care about in Similarity Search is Recall.

Precision Vs Recall

In the forthcoming sections, we shall delve into an algorithm known as Hierarchical Navigable Small Worlds (HNSW) that aims to find Approximate Nearest Neighbors.

HNSW - In the Real World

It is the same algorithm that powers Qdrant, the open source Vector DB used by Twitter's (now X) Grok LLM, which can access tweets in real time.

HNSW - Idea #1

HNSW is an evolution of the Navigable Small Worlds algorithm for Approximate Nearest Neighbors, which is based on the concept of Six Degrees of Separation.

Milgram's research was designed to investigate the social relationships among individuals in the United States. At first, the participants were based in Nebraska and Kansas and were assigned the task of delivering a letter to a specific person in Boston. Nevertheless, they were not permitted to send the letter directly to the intended recipient. Instead, they were instructed to pass it on to an acquaintance whom they knew on a first-name basis, and who they believed might have a better chance of knowing the target individual.

At the end of Milgram's small-world experiment, Milgram found that most of the letters reached the final recipients in five or six steps, creating the concept all over the world are all connected by six degrees of separation.

Facebook found in 2016, its 1.59 billion users were connected on average by 3.5 degrees of separation. This means that you and MarkZuckerberg are only 3.5 connections apart!

Navigable Small Worlds - Searching for K-NN

The NSW algorithm constructs a graph that, similar to the connections between friends on Facebook, links nearby vectors while limiting the overall number of connections. To illustrate, each vector can be linked to a maximum of 6 other vectors, resembling the concept of Six Degrees of Separation.

Given the following query: "How many Encoder layers are there in the Transformer model?"

How does the algorithm find the K Nearest Neighbors?

Searching for

Navigate Small Worlds - Inserting a New Vector

We can insert a new vector by searching top KNN with the searching algorithm described before and making an edge between the vector and the top K results.

HNSW - Idea #2

To go from NSW (Navigable Small Worlds) to HNSW (Hierarchical Navigable Small Worlds), we need to the algorithm behind the data structure known as Skip-List.

The skip list is a data structure that maintains a sorted list and allows search and insertion with an average O(log N) time complexity.

Multi-level skip lists

Lets search the number 9 ?

To begin with, it is evident that there are multiple linked lists for different levels in this skip list. These levels include level-0 (H0), level-1 (H1), and so on, with a total of four levels. The lowest level, level-0, contains the highest number of items, while the number of items decreases as we move up the levels. In order to search for a specific number, such as 9, in this skip list, we start from the highest level, which is head-3 (H3). We then examine the next item, which is 5, and compare it with the subsequent item, which is END (E). This indicates that the number 9 must be at a lower level. Consequently, we navigate down the skip list associated with number 5 and inspect the next item, which is 17. Since it is 17, it cannot be the number we are searching for, so it must be further down. We continue descending and compare it with the next node, which is number 9, and at this point, we halt our search.

Searching number 9

HNSW - Hierarchical Navigable Small Worlds

To create the Hierarchical Navigable Small Worlds we combine the concept of navigable small worlds with the idea of skip list.

Hierarchical Navigational Small Worlds

So we have a lower level (level-0) which has more nodes and more connections and we have a upper level (level-3) which has less nodes and and less connections.

The lowest level has maximum nodes and connections so we say it is more dense. Similarly the upper most level has minimum number of nodes and connections so we say it is sparse.

Hierarchical Navigable Small Worlds + Skip lists

How does the search works in this graph?

Suppose there is a query similar to the previous one, and we aim to conduct a search within this graph. We select a random entry point at the highest level of the graph and proceed to visit it. Subsequently, we calculate the cosine similarity between the Query and the visited node, along with all its connections. Consequently, we identify the optimal node at the top layer, which happens to be the purple one, and proceed to descend to the subsequent layer.

We descend one level and replicate the same process to pinpoint the most suitable local node. This sequence is reiterated until we descend to the lowest level and identify the most appropriate local node, at which point we stop the search.


Searching Process

Having observed the functioning of Vector DB, let us now revisit the process of Retrieval Augmented Generation by summarizing our findings thus far.

To begin with, we possess a query and a collection of documents and web pages from which we seek to retrieve knowledge. Subsequently, we divide them into smaller segments of text and generate embeddings using Sentence BERT (for instance). These embeddings are then stored in a vector database.

We convert our query into an embedding and we then utilize the HNWS algorithm to search for this embedding in the vector database, just as we did previously. This algorithm will identify the top-K embeddings that best match our query, allowing us to trace back to the original text they were derived from. Next, we merge the query and the context that we retrieved into a prompt template. This combined input is then fed into the Large Language Model (LLM). Ultimately, the LLM generates a response or answer based on the provided prompt.


RAG-Pipeline




To be continued in Part-2..., where we will code the RAG from scratch.

Mohit Ahuja

Associate Director, Data Science @ Majid Al Futtaim

2 个月

Thanks for the detailed overview of RAG, Akash! I especially appreciated the insights on combining fine-tuning with prompt engineering. It's a valuable approach for enhancing model performance.

Piyush Kapoor

Senior Data Scientist

2 个月

Well Explained !

Great post! The fundamentals of RAG provide a solid foundation for understanding how this innovative technology enhances NLP. Exploring the RAG pipeline in detail highlights its potential to revolutionize various AI applications by leveraging external knowledge and advanced retrieval techniques. Exciting times ahead for AI enthusiasts!

Divyam Singhal

Management Consultant at EYP

3 个月

Very neatly explained the concept of RAG pipeline. Loved the novel approach of HNSW algorithm!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了