Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

What is RAG?

  • Retrieval-Augmented Generation (RAG) is the concept to provide LLMs with additional information from an external knowledge source. It combines the strengths of retrieval and generative models. Sometimes, LLMs are not capable of generating an appropriate answer that requires knowledge that was not included in training such as new or domain-specific information. RAG bridges the gap between LLM's general knowledge and external content to help the LLM to generate more accurate and contextual results.

How Does RAG Work?

First, we convert external documents into a format that's accessible to LLM. Usually in vectors.

Steps to convert your document into vectors.

  • Collect all source documents
  • Clean documents
  • Load documents
  • Split text into chunks
  • Create embeddings for the text chunks
  • Store it in a vector store

Flow of RAG

The RAG process comes in three key parts:

  • Retrieval:?The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded with an embedding model into the same vector space as the additional context in the vector database. This allows us to perform a similarity search, and the top k closest data objects from the vector database are returned.
  • Augmentation:?The user query and the retrieved additional context are stuffed into a prompt template.
  • Generate:?Retrieval-augmented prompt is fed to the LLM and it will generate the text.

What Problems Does RAG Solve?

  • Reduced hallucinations:

LLM model may generate responses that are not accurate or relevant to context, especially when it’s assuming what it doesn’t know. RAG allows LLMs to draw upon external knowledge sources to supplement

  • Up-to-date information :?

If the external documents used for retrieval are regularly updated, the RAG model can have more recent information. This solves the problem of producing outdated and incorrect information.

  • Easy updates:

RAG frameworks bypass the need for costly time-intensive retraining and updating of foundation models. Source data can be easily updated by adding new documents.

  • Domain-specific knowledge:

RAG is an effective way to augment the foundation model with domain-specific data.LLM will be able to provide contextually relevant responses tailored to domain-specific data.

Prompt Engineering, RAG, or Fine-Tunining?

The choice between Prompt Engineering, RAG (Retrieval-Augmented Generation), and Fine-Tuning depends on the specific use case and requirements. Each approach serves different purposes and is suited to different scenarios. Here are a few questions you need to consider.

  • Will the amount of knowledge in a pre-trained model suffice for what I need it to do or does my use case require additional info and context?
  • Is my use case a standardized task or is it a domain-specific task?
  • Do I have a plethora of training data or am I limited?
  • Does the task require additional context and does the information need to be up-to-date?

?Here's a brief overview of when to use each approach:

Prompt Engineering

When you want to provide specific instructions or guidance to the AI model for generating responses. It's ideal for situations where you have a clear idea of what you want the AI to produce and mostly where usecase rely on the model’s pre-trained knowledge.

RAG (Retrieval-Augmented Generation)

When you need AI to retrieve and incorporate information from a large knowledge base or corpus into its responses. It's beneficial when the context and relevance of information matter.

Fine-Tuning

When you want to adapt a pre-trained language model to perform specific tasks or excel in a particular domain. It's valuable for tasks where you have large data available.

Disadvantages of RAG

  • Latency Issues: The two-step process of first retrieving documents and then generating responses can introduce latency. This might not be suitable for applications that require real-time responses.
  • Context Length Limitation: We have to be cautious of the maximum context length that the decoder transformer can handle. For example, ChatGPT has a maximum context length of 4096 tokens?(which is ~3 pages of single-lined English text). If the combined length of the input sequence and the retrieved documents exceeds this limit, some information will have to be truncated, which can affect the quality of the response.
  • Dependent on Semantic Search: The effectiveness of RAG is highly reliant on the quality of the semantic search. If the search retrieves irrelevant or low-quality documents, the generated responses may also be of poor quality.
  • Requires Existing Data: RAG depends on having an existing database of documents to retrieve from. Without a substantial database, it’s not possible to leverage the benefits of RAG.

Implementation:

To implement RAG best way to use Langchain or Llama Index. I have implemented RAG on Adrew Huberman's podcast using Llama Index. Here is the link for the code.

https://github.com/ravi2799/RAG/blob/main/Llama_Index%20.ipynb

References:

  1. https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322
  2. https://gradient.ai/blog/rag-101-for-enterprise
  3. https://docs.llamaindex.ai/en/latest/getting_started/concepts.html
  4. https://cobusgreyling.medium.com/rag-retrieval-augmented-generation-c81044081e6f







Dhaval Prajapati

Project Engineering - Bombardier Aviation | Concordia University' 24 - MEng in Mechanical | Aerospace & Aviation Enthusiast | Ex - Tata Motors | PDPU' 19

1 年

Very useful ? Thanks for sharing this amazing article Raviraj ??

回复
Pratik Gondaliya

Software Developer at PC-info | Masters Graduate | Seeking Opportunity to Apply Technical Skills for Societal Impact.

1 年

It's very informative

回复
Meet Patel

Quality and Sustainability Specialist | M.Eng in Mechanical Engineering CO-OP |

1 年

Very Informative ?

回复

要查看或添加评论,请登录

Raviraj Savaliya的更多文章

  • Dive into the MCP Framework

    Dive into the MCP Framework

    What is MCP The Model Context Protocol (MCP) is a framework that helps AI assistants, especially those powered by Large…

    2 条评论
  • Building LLMs for Production: Enhancing Prompt Effectiveness

    Building LLMs for Production: Enhancing Prompt Effectiveness

    Effective prompts are crucial for leveraging the full potential of LLM. Ambiguity in prompts can introduce challenges…

    3 条评论

社区洞察

其他会员也浏览了