RAG (Retrieval Augmented Generation) 101
Generated using Leonardo.ai

RAG (Retrieval Augmented Generation) 101

Large Language Models (LLMs) have been a huge trend on building AI based solutions recently. This is obvious because they have been trained on a large text corpus covering almost all areas that a human can possibly cover so they have a wide range of knowledge base. But sometimes they tend to provide made-up facts,simply lies or complete nonsense when they encounter something they don't know. This is due to many factors. Some of them are,

  • Source - Reference divergence
  • Exploitation through jailbreak prompts
  • Reliance on incomplete or contradictory datasets
  • overfitting and lack of novelty
  • Guesswork from Vague or insufficiently detailed prompts

This phenomena is often called as "Hallucination" and this have been the most common issue with almost all LLMs when they are used to a downstream task. For an example ChatGPT and Google Bard (now Gemini) has recorded 10% and 29% of Hallucination rate respectively in the PHM Knowledge examination.

There are several ways to encounter hallucination and following are some common methods.

  1. Contextual prompt engineering
  2. Domain Adaptation and Augmentation
  3. Adjusting model parameters or incorporating additional parameters

RAG or Retrieval Augmented Generation lies under the Domain adaptation and augmentation and it is a proved and sustainable way to encounter the hallucination issue while using a LLM for a downstream task.

The concept of RAG is pretty simple. The LLM is simply incorporated/ connected to your specific dataset. So LLM's parameters are kept intact but the knowledge base connection allows the LLM to refer them and provide better results, reducing the vulnerability towards hallucination.

How does this happen ? In a RAG pipeline the LLM dynamically incorporate the KB(Knowledge Base) data during the generation process. This is done by allowing the model to access and utilize the data in the KB in real-time without altering it. So the model's results will be more contextually relevant to the request.


Overview of RAG (Simplified)

In a RAG pipeline, the reference data/ KB data initially should be "indexed". Indexing in the sense is preparing the data so they would be compatible for querying. So when a user query for a specific data-point the index should be able to filter down the most contextually relevant data. Then the LLM uses the filtered data along with user query and an instructive prompt (often called as the system prompt) to provide the response.

Creating a RAG pipeline involves few steps.

  1. Loading the data : Here the data that should be used as the knowledge base will be loaded and stored. The data type is not restricted to a single type or format. It could be a text, PDF website, image , audio etc.
  2. Indexing : Most of the mentioned data is unstructured, so the data should be structured which allows them to be compatible with querying. Almost every time the data is vectorized; which means the data will be converted in to a vector. Most of the cases Embedding models are used for this. Embedding models have the unique ability to extract the contextual information from the data and represent it in vector form, which is quite remarkable. But the ability to represent contextual information in vector form merely depends on the model.
  3. Storing : Once the data is indexed, they should be stored in somewhere else the data should be re-indexed again and again. Infamous Vector Databases or Indexes are used for this. These VDBs are more like NoSQL databases but tailored for vector operations. So along with some metadata, the embedding vectors will be saved here.
  4. Querying : For a given user input , the most contextually similar/relative reference data point or points will be searched from the index. Initially the user input will be also indexed using the same embedding model and the index will find the most relevant vectors for the vectorized query from its vector space using a similarity metric. Some of the most used similarity metrics are cosine, euclidean and hamming (Yeah they are not new! some old school vector distance metrics).
  5. Evaluation : Here the LLM's response is evaluated for its effectiveness and relevancy with the user input. This step will give a good intuition about the Pipeline overall performance. Mainly two aspects of the RAG will be evaluated here; Retriever and the LLM itself. Retrieval Evaluation and Response evaluation are the two respective methods for it.


Myth-buster : Not just for text data, RAG can be implemented for other data types such as audio and images also. All you need an embedding model that supports those formats and LLM which has multi-modality features.


Following are some tools/tech-stack that can be used to implement RAG pipeline for your needs.

  1. Frameworks : LlamaIndex and LangChain
  2. Indexes/ Vector Databases : Qdrant , Weaviate , Pinecone , Chroma , FAISS etc..
  3. LLMs : OpenAI GPTs, 谷歌 Gemini, PaLM., Anthropic Claude, Mistral AI Mistral, Meta Llama,
  4. Embeddings : OpenAI text embedding models/CLIP(image embeddings), Hugging Face models(bge models), Meta ImageBind(Multi modal capabilites)


Following is a more advanced multi-modal RAG pipeline which allows users to query both text and images.

Source :





Refernces :



要查看或添加评论,请登录

Udara Nilupul的更多文章

社区洞察

其他会员也浏览了