Retrieval Augmented Generation (RAG) overview
Images taken from https://www.youtube.com/watch?v=6cv3s73MXFo

Retrieval Augmented Generation (RAG) overview

Generative AI learns from some big snapshots of content (text, images, code, audio) to generate similar or even new content. Text generation is the most common LLM use case but there are amazing LLM based apps for images like DALL-E, code Github Copilot and so on.

In this article I'll present you the Retrieval Augmented Generation (RAG) framework to design and develop applications that need more information than the one used to pre-train the generative AI models. As you read, you′ll find some code examples I created to illustrate the narrative. An OpenAI API key is needed to run the code.

Some LLM limitations

Large Language Models (or LLMs) are the source of natural language generation. Chat GPT disruption was an inflexion point for generative AI popularity and many use cases are being developed very fast (text summarization, text extraction, semantic search, chats, etc). However, LLMs have some limitations, they are not perfect. Here, a couple of shortcomings:

1.- LLM's were pre-trained with content that is not up to date. So they can not answer about recent events and if they do, they would hallucinate.

2.- Even though an LLM can be pre-trained with fresh data, it is not learning from all the data (for instance internal and private documents of a company).

To describe the problem, let's prompt about the term langchain to the model gpt-3.5-turbo. As you know, langchain is a framework for building applications based on large language models.

# Ask to the gpt-3.5-turbo about the term langchain
import langchain
from langchain.chat_models import ChatOpenAI

responsor = ChatOpenAI(model_name="gpt-3.5-turbo",max_tokens=48)        
print(responder.predict("tell me about langchain"))

I'm sorry, but I couldn't find any information about "Langchain." It is possible that you may be referring to a concept, organization, or product that is not widely known or recognized. Can you please provide more context or clarify your query?        

As you can notice, gpt-3.5-turbo LLM cannot return an answer because it was created with data up to September 2021 and the langchain framework was created in October 2022.

Retrieval Augmented Generation

It's clear that LLM's knowledge base (sometimes called parametrized knowledge) needs to be complemented or augmented with more information.

So, Retrieval Augmented Generation (RAG) is a framework in which LLMs are complemented with an additional knowledge base looking for more reliable, specific and/or updated answers.

RAG idea was proposed in May of 2020 by Patrick Lewis who added a Wikipedia vector index to a pre-trained seq2seq transformer. So, he complimented the model with additional data looking for and getting better answers.

The next exhibit presents the RAG architecture.

RAG architecture

Let's describe the RAG components. Data curation is probably the most important element because it ensures the information that will be used as complementary knowledge base is compelling, secure and fresh. So, a pool of files are reviewed, filtered and in some cases summarized (to reduce costs). Data stewards are the owners of the documents and they ensure that the curated data bucket or repository contains clean and valuable information that will be sent to the indexing process.

For indexing, data should be splitted and complemented with document metadata; then an embedding model (for instance the GCP textembedding-gecko@001 or the OpenAI text-embedding-ada-002) is used to embed the documents, that means to cast text into vectors of numbers. These vectors and the metadata are stored in a Vector Database (like Pinecone, Vector Store, Postgre + pgvector, ChromaDB or AWS Neptune).

The third component is a Gen AI application that lets end users to launch prompts and receive answers (for instance a chatbot or a Q&A tool). In the backend of this application, it is possible to find out the Retriever (the fourth component). It is a snippet of langchain or llama index code made to retrieve information from the vector database and combine it with the LLM to generate answers.

So, when an end user prompts, the query is embedded by the embedding model and then the Vector Database receives the embedded prompt and executes a searching algorithm (mostly the nearest neighbor one) to return the vectors and metadata that better fit to the received prompt.

The Vector Database output (chunks of documents and metadata) is then sent to the LLM. The LLM generates a relevant answer that is presented to the end user on the Gen AI application front end.

Let's run some code to help the gpt-3.5-turbo model to answer some questions about langchain.

Data preparation

Let's define some documents that contain information about langchain from different websites. Not much data cleansing to do. The URL's of the websites where the content was taken to create the documents will be the metadata in this case. Documents are too short so it is not needed to split them up.

# Define some documents about langchain
from langchain.schema import Document
documents=[
          Document(page_content="LangChain is a framework for developing applications powered by language models. It enables applications that: Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.",
                    metadata={"source":"https://api.python.langchain.com/en/v0.0.339/schema/langchain.schema.document.Document.html"}  
                   ),
          Document(page_content="LangChain was launched in October 2022 as an open source project by Harrison Chase, while working at machine learning startup Robust Intelligence. ",
                    metadata={"source":"https://en.wikipedia.org/wiki/LangChain"}  
                   ),
          Document(page_content="LangChain can also be installed on Python with a simple pip command: pip install langchain.? To install all LangChain dependencies (rather than only those you find necessary), you can run the command pip install langchain[all]",
                    metadata={"source":"https://www.ibm.com/topics/langchain"}),
          Document(page_content="Let’s put together a simple question-answering prompt template. We first need to install the langchain library.pip install langchain",
                    metadata={"source":"https://www.pinecone.io/learn/series/langchain/langchain-intro/"})
                   ]        

2. Indexing

The text-embedding-ada-002 is used to embed the previously defined list of documents. Chroma is used as a vector database. Embeddings are vectors of numbers. The ada-002 model generates vectors with 1,536 entries. To illustrate the embedding results, the first five numbers of the embedding of the first document are printed.

# Create the embeddings object. Specify the embedding model.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import chromadb
import tiktoken
import lark
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
vectorstore=Chroma.from_documents(documents,embeddings)

# Present the embeddings of the first document
vectorstore._collection.get(include=['embeddings'])['embeddings'][0][0:5]
[-0.007547943852841854,
 0.0023824041709303856,
 -0.0114842364564538,
 -0.027040034532546997,
 0.006394785828888416]        

3. Prompting app

Gradio is an interesting framework to build a simple tool to launch prompts. The definition of qa will be done in the next paragraph.

# Create the Gen AI front

import gradio as gr

# Define a function that calls the retriever
def retrieving(prompt):
    result=qa({"query":prompt})
    return [result['result'],result['source_documents']]

with gr.Blocks() as gen_ai_front:
    title=gr.HTML("<h1>RAG example</h1>")
    input=gr.Textbox(label="Type your question")
    output=gr.Textbox(label="The answer")
    btn=gr.Button("Send your prompt")
    btn.click(fn=retrieving, inputs=input,outputs=output)

gen_ai_front.launch(share=False)        

A local URL is generated and this is the way the front looks.

4. Retriever

The retriever chains the vector database and the llm to get answers. RetrievalQA parameters can be consulted in the langchain documentation website.

# Define the retriever
from langchain.chains import RetrievalQA
llm = ChatOpenAI(temperature=0)

retriever=vectorstore.as_retriever(
search_type='similarity',search_kwargs={"k":2})

# qa integrates the llm and the vectorstore
qa=RetrievalQA.from_chain_type(llm=llm,
                               chain_type='stuff',
                               retriever=retriever,
                               return_source_documents=True)        

5 LLM inference and answer

Prompting "what is langchain" returns an answer generated by the LLM and the documents that the LLM used to create the response.

As you can see, the retriever can answer prompts about langchain. This is the value of the RAG architecture. The benefits of RAG are the next ones:

  • It is possible to build Gen AI apps (chatbots, Q&A apps) that generate answers based on fresh and domain specific information.
  • Less hallucinations
  • It is not needed to re-train the LLM with additional data. It is cost effective.
  • Democratizes knowledge along organizations

RAG challenges

Probably the most important challenges about RAG are the next ones:

  • Keep the quality and freshness of the complementary documents. This is challenging for companies where data stewardship is not a common practice.
  • Indexing costs. As the pool of documents grows, the costs of indexing increases as well. So, managing the Vector Database and its inputs is a key element to keep healthy the Gen AI apps.
  • Hallucinations still happen.

Conclusions

RAG is a good framework to complement LLM's knowledge base and get better answers to prompts. That makes possible to create Chatbots and Q&A tools that can dig into more and fresh documents to get compelling answers.

Although RAG is a cost effective way to improve Gen AI applications, hallucinations can still be present. So, techniques like Reinforcement Learning from Human Feedback and Transfer Learning are also useful for improving Gen AI applications in cases where accuracy is critical.


References

Langchain. Retrieval-augmented generation (RAG). Available in the next URL: https://python.langchain.com/docs/use_cases/question_answering/

Langchain. Self-querying. Available in the next URL:

https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/

Lewis, P. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. https://arxiv.org/pdf/2005.11401.pdf

Nvidia. What Is Retrieval-Augmented Generation, aka RAG? Available in the next URL: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了