Retrieval Augmented Generation (RAG)

Giovanni MASI

Computer Science Engineer | Artificial Intelligence Researcher | Subject Matter Expert at eCampus University | Advisory Board Member at Kwaai | Artificial Intelligence Group Coordinator at Order of Engineers

发布日期: 2024年8月9日

RAG: An innovation for optimizing language models

Introduction

Retrieval Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines two powerful approaches: retrieval and generation. A RAG model uses a retrieval system to extract relevant information from a vast knowledge base and then employs a generative model to create responses based on this information. This approach allows large language models (LLMs) to access updated and pertinent data, improving the quality and reliability of the generated responses.

How Retrieval Augmented Generation (RAG) works

The RAG process is divided into two main stages:

Retrieval Step: The system searches for relevant documents from external sources, such as web pages, databases, or text corpora, in response to an input query.
Generation Step: The retrieved passages are used as context by a generative model, which produces a more enriched and informed response than what would be generated solely from training data. This approach is particularly effective in ensuring that the model's output is relevant, accurate, and applicable in various contexts without the need for complete retraining.

The importance of Retrieval Augmented Generation

LLMs, while being fundamental components in the evolution of artificial intelligence (AI), have certain limitations, such as a tendency to generate inaccurate, outdated, or incorrect information. These limitations arise from the static nature of training data, which is not always up-to-date. The main issues include:

Inaccurate Information: When the answer is not present in the training dataset.
Generic or Outdated Responses: In situations requiring specific and current information.
Non-authoritative Sources: Compromising the quality of the output.
Terminological Confusion: Due to the use of similar terms in different contexts. RAG addresses these challenges by enabling the model to reference external, authoritative knowledge sources, enhancing the relevance and accuracy of responses.

Benefits of Retrieval Augmented Generation

Adopting RAG technology offers numerous advantages:

Cost Efficiency: It improves language models without the high costs associated with complete model retraining.
Information Updating: Allows for the integration of up-to-date information, ensuring responses remain pertinent.
Increased User Trust: Responses supported by authoritative sources enhance transparency and trust in the generated output.
Greater Control: Developers can tailor and optimize the information sources used, improving the quality and security of the responses.

How Retrieval Augmented Generation works

The RAG process involves several key stages:

Creation of External Data: New information from external sources is converted into numerical representations and stored in a vector database.
Retrieval of Relevant Information: The user's input is transformed into a vector representation and matched against the database to identify the most relevant data.
LLM Prompt Augmentation: The model uses the retrieved data to enrich the user's input, improving the accuracy of the generated responses.
Continuous Data Updates: To maintain relevance, external data is updated periodically or in real-time.

Final considerations

Retrieval Augmented Generation represents a significant advancement in the ability of large language models to generate more relevant and reliable responses. By integrating external information and keeping data up-to-date, RAG enhances the effectiveness of LLMs in various application contexts, making generative AI an increasingly valuable resource for organizations.

Challenges of Retrieval Augmented Generation

Despite its many benefits, RAG also presents some challenges:

Data Quality: The quality of responses depends largely on the quality of the retrieved data.
Computational Complexity: RAG requires increased computational and temporal complexity, involving both retrieval and generation stages.
Consistency and Relevance: Issues may arise if the retrieved data is not well-aligned with the query or context.

Real world applications of Retrieval Augmented Generation

RAG finds application in many areas, such as creative content generation, the analysis and synthesis of data from diverse sources, and the creation of articles enriched with up-to-date information. For example, RAG can generate articles on the latest scientific discoveries by drawing on reliable academic sources or analyze financial reports to extract relevant information.

Code example to understand RAG

This article presents two examples of implementing Retrieval Augmented Generation (RAG) in Python on Google Colab. These examples demonstrate two distinct approaches to building a RAG system: one that is more manual and requires handling all the steps, and another that leverages Hugging Face APIs to significantly simplify the process.

First example: Full implementation of a RAG system

In the first example, we show how to manually implement all the functionalities of a RAG system using pre-trained models, managing each step of the process:

Generating Embeddings: The generate_embedding function converts text into numerical embeddings using a pre-trained model (such as LLAMA). These embeddings represent the semantic meaning of the text in a high-dimensional vector space, which is useful for subsequent retrieval operations.
Creating and Saving Embeddings: The create_and_save_embeddings function generates embeddings for a corpus of documents and saves them in an HDF5 file, an efficient file format for handling large amounts of data.
Loading Embeddings and Creating the FAISS Index: Subsequently, the embeddings are loaded from an HDF5 file, and a FAISS index is created. FAISS (Facebook AI Similarity Search) is a library used to perform fast searches of similar vectors in large datasets, accelerating the retrieval process.
Improved Retrieval with Diversity Control: The retrieve_passages function uses the FAISS index to retrieve the most relevant passages from the corpus based on a specific query. To avoid repetition, the code implements a diversity control mechanism that ensures the retrieved passages are unique, thereby improving the quality of the context provided to the generative model.
Generating the Response: Finally, the generate_answer function uses the retrieved passages as context to generate a response to the given question. This function allows control over various parameters, such as the maximum length of the response, creativity, and repetition penalty.
Coordinating the RAG Process: The ask_question function combines all these steps, coordinating retrieval and generation to effectively answer a specific question.

Second example: Using Hugging Face APIs to simplify RAG implementation

In the second example, a Hugging Face pipeline is used to dramatically simplify the implementation of RAG, automating many of the steps handled manually in the first example:

Creating the Text Generation Pipeline: The llm_pipeline pipeline from Hugging Face is configured for text generation, setting parameters like max_new_tokens, temperature, and top_p to control the generated response. This pipeline internally manages the model and tokenizer, significantly simplifying the implementation.
Managing Embeddings: A pre-trained embedding model, such as sentence-transformers/all-MiniLM-L6-v2, is used to generate dense vector representations of the corpus. These embeddings can be loaded from previously saved files or generated anew if necessary.
Creating and Managing the FAISS Index: The FAISS library is used to create or load a vector index that facilitates fast retrieval of documents based on embedding similarity. This process is streamlined by the direct integration with Hugging Face, which automatically handles many complex operations.
Retrieval and Combination of Documents: Using the retriever associated with the FAISS index, the most relevant documents for a given query are retrieved. These documents are then combined into a single block of text, ready to be used as context in the generation process.
Generating the Response with LLMChain: The retrieval_qa function uses LLMChain, which integrates the LLM model with a predefined prompt, to generate a response based on the combined context. The response is further enhanced by removing any repetitions, ensuring greater coherence and readability.
Prompt Template: A well-structured prompt is defined using PromptTemplate, which specifies how the context and question should be presented to the generative model. This helps guide the model in producing concise and accurate responses.

In summary, while the first example demonstrates how to build a RAG system from scratch by manually managing all the steps, the second example shows how Hugging Face APIs can automate much of this process, making development faster and more accessible without sacrificing the power and flexibility of the system.

Let's start by defining the components common to both programs

Library Installation

!pip?install?faiss-cpu
!pip?install?langchain?sentence-transformers?faiss-cpu
!pip?install?--upgrade?transformers
!pip?install?-U?langchain-community
!pip?install?unstructured
!pip?install?cloud-tpu-client
!pip?install?PyPDF2
!pip?install?python-docx

Library import, Google Drive access, Hugging Face authentication Token definition, and specification of the pre-trained LLAMA model

#?Import?the?necessary?libraries
import?torch
import?os
import?sys
from?torch.utils.data?import?DataLoader,?Dataset
from?torch.optim?import?AdamW
import?torch.nn.functional?as?F
from?tqdm?import?tqdm
import?faiss
import?numpy?as?np
import?h5py
from?transformers?import?AutoModelForCausalLM,?AutoTokenizer,?pipeline
from?langchain?import?LLMChain,?PromptTemplate
from?langchain.vectorstores?import?FAISS
from?langchain.embeddings?import?HuggingFaceEmbeddings
from?langchain.document_loaders?import?DirectoryLoader
from?langchain.text_splitter?import?RecursiveCharacterTextSplitter
from?langchain.chains?import?RetrievalQA
from?langchain.schema?import?Document
from?langchain.llms?import?HuggingFacePipeline
import?csv
import?PyPDF2
from?bs4?import?BeautifulSoup
from?docx?import?Document?as?DocxDocument
from?langchain.docstore?import?InMemoryDocstore
import?difflib
from?google.colab?import?drive

#?Google?Drive?Mounting
drive.mount('/content/drive',?force_remount=True)

#?Defining?paths?for?files
path?=?'/content/drive/My?Drive/<your path>/'
pathDocs=?path?+?'docs'
corpus_file?=?path?+?'corpus.txt'
corpus_hdf5?=?path?+?'corpus.h5'
embeddings_hdf5?=?path?+?'embeddings.h5'
index_file?=?path?+?'faiss_index.index'
pathFaissIndexBin?=?path?+?'faiss_index.bin'
pathEmbeddingNpy?=?path?+?'embeddings.npy'

#?Setting?up?Hugging?Face?authentication?token
huggingface_token?=?"<your token>"

#?Specify?the?model?name
model_name?=?"meta-llama/Meta-Llama-3.1-8B-Instruct"

Checking TPU and GPU availability

if?'COLAB_TPU_ADDR'?in?os.environ:
????print("TPU?found.?Configuring...")
????try:
????????import?torch_xla
????????import?torch_xla.core.xla_model?as?xm
????????device?=?xm.xla_device()
????????print("TPU?configured?successfully.?Device:",?device)
????except?ImportError?as?e:
????????print("Error?importing?torch_xla:",?e)
????????print("Switching?to?GPU?or?CPU.")
????????if?torch.cuda.is_available():
????????????device?=?torch.device("cuda")
????????????print("GPU?Usage.?Device:",?device)
????????else:
????????????device?=?torch.device("cpu")
????????????print("CPU?Usage.?Device:",?device)
else:
????print("No?TPU?found.")
????if?torch.cuda.is_available():
????????device?=?torch.device("cuda")
????????print("GPU?Usage.?Device:",?device)
????else:
????????device?=?torch.device("cpu")
????????print("CPU?Usage.?Device:",?device)

Load the Tokenizer and model from Hugging Face

#?Loading?the?tokenizer?and?model:
#?This?command?loads?the?tokenizer?associated?with?the?pre-trained?model.
#?The?tokenizer?is?responsible?for?converting?text?into?tokens?that?can?be?processed?by?the?model.
tokenizer?=?AutoTokenizer.from_pretrained(model_name,?use_auth_token=huggingface_token)

#?Setting?the?padding?token:
#?This?line?sets?the?padding?token?(pad_token)?to?the?end-of-sequence?token?(eos_token)?if?the?tokenizer?does?not?already?have?a?padding?token?defined.
#?This?ensures?that?all?sequences?have?the?same?length?during?the?batching?process.
if?tokenizer.pad_token?is?None:
????tokenizer.pad_token?=?tokenizer.eos_token

#?Loading?the?pre-trained?model:
#?Finally,?the?pre-trained?language?model?is?loaded?from?Hugging?Face?and?transferred?to?the?configured?device?(TPU,?GPU,?or?CPU).
model?=?AutoModelForCausalLM.from_pretrained(model_name,?use_auth_token=huggingface_token).to(device)

Model Inference

def generate_response(prompt, max_length=200, temperature=0.7, top_k=50, top_p=0.9):
    # Tokenize the prompt and move the tensors to the device
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
    # Generate the response using the model with advanced decoding techniques
    outputs = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        pad_token_id=tokenizer.pad_token_id,
        no_repeat_ngram_size=2
    )
    # Decode the generated response and return it
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

prompt = input("Enter the prompt: ")
response = generate_response(prompt)
print("Model response:", response)

HDF5 file generation

This code is designed to handle the reading and processing of various file types (such as .txt, .pdf, .csv, .docx, .html) and convert them into a unified format, which is then stored in an HDF5 file for future use. Below is a detailed description of what each part of the code does:

Document Class
Loading Functions
Function load_and_save_corpus_to_hdf5_extended
Returning the Documents

This code is ideal for scenarios where it is necessary to unify and structure data from various sources (textual, web, database) and save it in an efficient format for further analysis or processing, such as use with machine learning models or advanced search.

#?Creating?the?Document?class
class?Document:
????def?__init__(self,?page_content,?metadata=None):
????????self.page_content?=?page_content
????????self.metadata?=?metadata?if?metadata?is?not?None?else?{}

#?Loading?functions?for?various?file?types
def?load_txt(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????return?Document(page_content=file.read(),?metadata={"source":?file_path})

def?load_pdf(file_path):
????with?open(file_path,?'rb')?as?file:
????????reader?=?PyPDF2.PdfReader(file)
????????text?=?""
????????for?page?in?reader.pages:
????????????text?+=?page.extract_text()
????????return?Document(page_content=text,?metadata={"source":?file_path})

def?load_csv(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????reader?=?csv.reader(file)
????????text?=?"\n".join([",?".join(row)?for?row?in?reader])
????????return?Document(page_content=text,?metadata={"source":?file_path})

def?load_docx(file_path):
????doc?=?DocxDocument(file_path)
????full_text?=?[]
????for?para?in?doc.paragraphs:
????????full_text.append(para.text)
????return?Document(page_content="\n".join(full_text),?metadata={"source":?file_path})

def?load_html(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????content?=?file.read()
????????soup?=?BeautifulSoup(content,?'html.parser')
????????return?Document(page_content=soup.get_text(separator='\n'),?metadata={"source":?file_path})

def?load_and_save_corpus_to_hdf5_extended(directory_path,?hdf5_path,?web_urls=None,?sql_queries=None,?sql_connections=None,?chunk_size=500,?chunk_overlap=50):
????#?Load?files?from?the?directory?and?process?them?based?on?their?type
????documents?=?[]
????for?root,?dirs,?files?in?os.walk(directory_path):
????????for?file?in?files:
????????????file_path?=?os.path.join(root,?file)
????????????if?file.endswith('.txt'):
????????????????documents.append(load_txt(file_path))
????????????elif?file.endswith('.pdf'):
????????????????documents.append(load_pdf(file_path))
????????????elif?file.endswith('.csv'):
????????????????documents.append(load_csv(file_path))
????????????elif?file.endswith('.docx'):
????????????????documents.append(load_docx(file_path))
????????????elif?file.endswith('.html'):
????????????????documents.append(load_html(file_path))
????#?Add?data?from?web?URLs
????if?web_urls:
????????for?url?in?web_urls:
????????????web_content?=?download_and_load_web_content(url)
????????????documents.append(Document(page_content=web_content,?metadata={"source":?url}))
????#?Add?data?from?SQL?queries
????if?sql_queries?and?sql_connections:
????????for?query,?connection_string?in?zip(sql_queries,?sql_connections):
????????????sql_content?=?load_data_from_sql(query,?connection_string)
????????????documents.append(Document(page_content=sql_content,?metadata={"source":?connection_string}))
????#?Split?documents?into?chunks
????text_splitter?=?RecursiveCharacterTextSplitter(chunk_size=chunk_size,?chunk_overlap=chunk_overlap)
????texts?=?text_splitter.split_documents(documents)
????combined_corpus?=?[text.page_content?for?text?in?texts]
????#?Encode?the?corpus?in?UTF-8?format?and?save?to?HDF5
????encoded_corpus?=?[line.encode('utf-8')?for?line?in?combined_corpus]
????with?h5py.File(hdf5_path,?'w')?as?hf:
????????hf.create_dataset('corpus',?data=np.array(encoded_corpus,?dtype='S'))
????return?texts

def?load_corpus_from_hdf5(hdf5_path):
????with?h5py.File(hdf5_path,?'r')?as?hf:
????????data?=?hf['corpus'][:]
????return?[Document(page_content=content.decode('utf-8'))?for?content?in?data]

FIRST EXAMPLE OF RAG

Part 1

Embeddings: Embeddings are vector representations of data (such as words, sentences, or documents) in a continuous high-dimensional space. These vectors capture the semantic features of the data so that vectors that are similar in meaning are close together in the vector space. In the context of natural language, embeddings are often used to represent words or sentences so that they can be used for machine learning tasks such as classification or similarity searching.
FAISS: FAISS is a library developed by Facebook AI Research designed to perform similarity searches on large amounts of high-dimensional vectors quickly and efficiently. It is often used for tasks such as similar image search, clustering, and, as in this case, document retrieval in a RAG system. FAISS works by creating indexes that reduce the number of comparisons needed to find similar vectors, using techniques such as vector quantization and space partitioning.

#?Function?to?generate?embeddings?from?the?model.
#?This?function?takes?some?text,?tokenizes?it?using?the?provided?tokenizer,?and?then?uses?the?LLaMA?model?to?generate?an?embedding.
#?The?embedding?is?a?dense?numerical?representation?of?the?text,?usually?in?a?high-dimensional?vector?space.
#?Embeddings?are?used?to?capture?the?semantic?meaning?of?words?or?sentences?in?a?format?that?can?be?easily?manipulated?by?machine?learning?algorithms.
#?The?`outputs.hidden_states[-1].mean(dim=1)`?method?takes?the?last?level?of?hidden?states?from?the?model?and?averages?them?across?the?word?dimension?(dim=1),?resulting?in?a?single?vector?for?each?sentence.
def?generate_embedding(text,?tokenizer,?model):
????inputs?=?tokenizer(text,?return_tensors="pt",?padding=True,?truncation=True).to(model.device)
????with?torch.no_grad():
????????outputs?=?model(**inputs,?output_hidden_states=True)
????embedding?=?outputs.hidden_states[-1].mean(dim=1).cpu().numpy()
????return?embedding

#?Creating?and?saving?embeddings?in?an?HDF5?file.
#?This?function?takes?a?corpus?of?documents,?generates?embeddings?for?each?document,?and?saves?them?in?an?HDF5?file.
#?HDF5?is?a?file?format?that?allows?large?amounts?of?data?to?be?stored?in?an?efficient?and?organized?format,?with?quick?access?to?specific?parts?of?the?data.
#?The?`for?doc?in?corpus`?loop?iterates?over?each?document?in?the?corpus,?generates?the?embedding?using?the?`generate_embedding`?function,?and?adds?it?to?the?list?of?embeddings.
#?Finally,?all?embeddings?are?saved?in?the?'embeddings'?dataset?of?the?HDF5?file.
def?create_and_save_embeddings(corpus,?tokenizer,?model,?hdf5_path):
????with?h5py.File(hdf5_path,?'w')?as?hf:
????????embeddings?=?[]
????????for?doc?in?corpus:
????????????embedding?=?generate_embedding(doc.page_content,?tokenizer,?model)
????????????embeddings.append(embedding)
????????hf.create_dataset('embeddings',?data=np.vstack(embeddings))

#?Loading?embeddings?from?HDF5?file.
#?This?function?loads?embeddings?from?a?previously?created?HDF5?file?and?returns?them?as?a?numpy?array.
#?This?is?useful?for?restoring?embeddings?without?having?to?regenerate?them?each?time.
def?load_embeddings_from_hdf5(hdf5_path):
????with?h5py.File(hdf5_path,?'r')?as?hf:
????????embeddings?=?hf['embeddings'][:]
????return?embeddings

#?Creating?FAISS?index?with?modified?parameters
#?This?function?creates?a?FAISS?index?for?efficient?retrieval?of?embeddings.
#?FAISS?(Facebook?AI?Similarity?Search)?is?a?library?developed?by?Facebook?AI?Research?that?allows?you?to?quickly?search?for?similar?vectors?in?large,?high-dimensional?datasets.
#?It?is?often?used?to?speed?up?the?retrieval?process?in?RAG?systems.
#?-?`nlist`?is?the?number?of?clusters?into?which?the?embeddings?are?split.?This?value?can?affect?the?speed?and?accuracy?of?the?retrieval.
#?-?`nprobe`?is?the?number?of?clusters?to?explore?during?the?search.?The?higher?the?`nprobe`,?the?more?accurate?the?search?will?be,?but?at?the?expense?of?performance.
#?The?function?first?creates?a?quantizer?(using?`IndexFlatL2`,?which?uses?the?Euclidean?distance)?and?then?creates?an?index?with?`IndexIVFFlat`,?which?is?a?type?of?FAISS?index?that?divides?the?vector?space?into?clusters?to?speed?up?the?search.
#?The?index?is?trained?on?embeddings?and?then?all?embeddings?are?added?to?it.
def?create_faiss_index(embedding_dim,?embeddings,?nlist=100,?nprobe=10):
????nlist?=?min(nlist,?len(embeddings))
????quantizer?=?faiss.IndexFlatL2(embedding_dim)
????index?=?faiss.IndexIVFFlat(quantizer,?embedding_dim,?nlist,?faiss.METRIC_L2)
????index.train(embeddings)
????index.add(embeddings)
????index.nprobe?=?nprobe
????return?index

#?Load?corpus?and?save?to?HDF5
corpus?=?load_and_save_corpus_to_hdf5_extended(pathDocs,?corpus_hdf5)

#?Generate?embeddings?and?save?to?HDF5
create_and_save_embeddings(corpus,?tokenizer,?model,?embeddings_hdf5)

#?Load?embeddings?and?create?FAISS?index
embeddings?=?load_embeddings_from_hdf5(embeddings_hdf5)
embedding_dim?=?embeddings.shape[1]
index?=?create_faiss_index(embedding_dim,?embeddings,?nlist=100)
faiss.write_index(index,?index_file)

Part 2

Diversity-controlled retrieval: This part of the code tries to improve the quality of the retrieval results by avoiding returning steps that are too similar. This is especially important in contexts where variety of information is critical, for example, when using context to generate a response. Diversity of steps ensures that the generated response is more informative and less redundant.
Response generation with context: Using a well-chosen context is crucial for the quality of the generated responses. The context is built by concatenating the retrieved steps and providing this concatenation as input to the generative model. Generation parameters, such as temperature, top_k, and repetition_penalty, are used to regulate the creativity, diversity, and consistency of the generated response.

#?Enhanced?retrieval?with?diversity?checking.
#?This?function?retrieves?relevant?passages?from?the?corpus?given?a?query.
#?1.?The?query?is?converted?into?an?embedding?using?the?`generate_embedding`?function.
#?2.?The?FAISS?index?is?used?to?find?the?`top_k`?passages?most?similar?to?the?query.
#?3.?To?avoid?repetition,?the?output?is?filtered?ensuring?that?the?retrieved?texts?are?unique.
#?Diversity?is?handled?using?a?set?(`seen_texts`)?that?keeps?track?of?the?passages?already?seen?and?new?passages?are?added?only?if?they?have?not?already?been?retrieved.
#?This?is?important?to?ensure?that?the?context?passed?to?the?generative?model?is?diverse?and?non-repetitive.
#?Finally,?a?single?text?is?returned?that?concatenates?all?the?retrieved?passages.
def?retrieve_passages(query,?index,?corpus,?tokenizer,?model,?top_k=10):
????query_embedding?=?generate_embedding(query,?tokenizer,?model).reshape(1,?-1)
????distances,?indices?=?index.search(query_embedding,?top_k)
????unique_indices?=?[]
????seen_texts?=?set()
????for?idx?in?indices[0]:
????????if?corpus[idx].page_content?not?in?seen_texts:
????????????unique_indices.append(idx)
????????????seen_texts.add(corpus[idx].page_content)
????????if?len(unique_indices)?>=?top_k:
????????????break
????retrieved_texts?=?[corpus[idx].page_content?for?idx?in?unique_indices]
????#print("Passaggi?recuperati:",?retrieved_texts)
????return?"?".join(retrieved_texts)

#?Improved?response?generation?function?with?context?management.
#?This?function?generates?an?answer?to?the?given?question,?using?the?retrieved?texts?as?context.
#?1.?An?input?string?is?created?that?includes?the?retrieved?context?and?the?question.
#?2.?The?generative?model?uses?this?input?to?produce?an?answer.
#?Important?parameters:
#?-?`max_new_tokens`:?Limits?the?length?of?the?generated?answer.
#?-?`temperature`:?Controls?the?creativity?of?the?model?(lower?values?make?the?generation?more?deterministic).
#?-?`top_k`:?Limits?the?number?of?possible?next?tokens?considered?in?each?step.
#?-?`repetition_penalty`:?Penalizes?repetition?to?improve?the?diversity?of?the?generation.
#?The?function?returns?the?generated?answer?as?text.
def?generate_answer(question,?retrieved_texts,?model,?tokenizer,?max_new_tokens=500,?temperature=1.0,?top_k=50,?repetition_penalty=1.1):
????input_text?=?f"\n\nContext:?{retrieved_texts}\n\nQuestion:?{question}\n\nAnswer:"
????inputs?=?tokenizer(input_text,?return_tensors="pt",?padding=True,?truncation=True).to(model.device)
????attention_mask?=?inputs['attention_mask']
????outputs?=?model.generate(
????????inputs["input_ids"],
????????attention_mask=attention_mask,
????????max_new_tokens=max_new_tokens,
????????temperature=temperature,
????????top_k=top_k,
????????repetition_penalty=repetition_penalty,
????????pad_token_id=tokenizer.pad_token_id
????)
????#?Let's?make?sure?that?the?extracted?answer?is?only?the?final?part?of?the?generation
????decoded_output?=?tokenizer.decode(outputs[0],?skip_special_tokens=True)
????#?Split?the?generated?text?using?the?term?"Answer:"?and?take?only?the?final?part
????StrAnswer?=?decoded_output.split("Answer:")[1].strip()?if?"Answer:"?in?decoded_output?else?decoded_output
????return?StrAnswer

#?This?function?coordinates?the?querying?process?of?the?system.
#?1.?Retrieves?the?most?relevant?passages?using?`retrieve_passages`.
#?2.?Generates?an?answer?using?`generate_answer`.
#?This?is?the?final?integration?point?where?retrieval?and?generation?are?combined?to?answer?the?given?query.
def?ask_question(question,?index,?corpus,?tokenizer,?model,?temperature=1.0,?top_k=50):
????retrieved_texts?=?retrieve_passages(question,?index,?corpus,?tokenizer,?model)
????answer?=?generate_answer(question,?retrieved_texts,?model,?tokenizer,?temperature=temperature,?top_k=top_k)
????return?answer

#?System?query.
question?=?input("Enter?the?prompt:?")
StrAnswer?=?ask_question(question,?index,?corpus,?tokenizer,?model)

#?View?only?the?question?and?answer
print("Question:",?question)
print("Answer:",?strAnswer)

Strengths:

Customized Embedding Management and FAISS Index: This program handles the generation of embeddings and the creation of the FAISS index directly. This approach allows for greater control over how embeddings are created and managed, which can be useful for specific customizations.
HDF5 Storage: Using the HDF5 format to store the corpus and embeddings is efficient in terms of space and time, particularly useful for large datasets.
Duplicate Result Filtering: The retrieval function includes a mechanism to avoid repeating results, enhancing the quality of the generated responses.
Debugging: Integrated debugging functions allow easy monitoring of which steps are being retrieved and how the system generates responses.

Weaknesses:

Complexity: The code is more complex and requires a good understanding of how FAISS and embeddings work to be modified or extended.
Process Efficiency: It may require more time for setup and execution, given the need to manually manage the embedding and retrieval pipeline.

SECOND EXAMPLE OF RAG

Part 1

Hugging Face Pipeline: In this code, a Hugging Face pipeline is used for text generation. The pipeline helps simplify the text generation process by providing an interface to the model, tokenizer, and configurations such as max_new_tokens, temperature, and top_p.
Embeddings: Embeddings are created using a sentence-transformers model, which is optimized for creating dense representations of sentences. These embeddings are essential for performing semantic searches in the corpus.
FAISS and VectorStore: FAISS is used to build an index that allows for fast searches across embeddings. In this case, the process of creating a VectorStore that includes FAISS-based retrieval has been automated.
PromptTemplate: The template defines the format of the prompt that will be used to generate the response, including both the context and the question.
LLMChain: LLMChain is used to connect the generative model (LLM) with the prompt, performing the response generation in a structured way.
Repetition removal: A useful feature to improve the quality of the generated response, avoiding repetitions that may emerge during generation.

#?Creating?a?Hugging?Face?pipeline?for?text?generation.
#?The?pipeline?is?a?key?component?that?facilitates?the?use?of?the?model?for?various?tasks?such?as?text-generation,?sentiment-analysis,?etc.
#?In?this?case,?we?configure?the?pipeline?for?text?generation,?setting?several?parameters?to?control?the?response:
#?-?`max_new_tokens`:?Limits?the?length?of?the?generated?response,?preventing?too?long?responses.
#?-?`temperature`:?Controls?the?"creativity"?of?the?response;?lower?values?produce?more?deterministic?responses.
#?-?`top_p`:?Nucleus?sampling,?considers?only?tokens?with?a?cumulative?probability?up?to?90%,?reducing?the?risk?of?inconsistent?responses.
llm_pipeline?=?pipeline(
????"text-generation",
????model=model,??#?Pre-trained?model,?such?as?a?LLaMA?or?GPT?model
????tokenizer=tokenizer,?#?Tokenizer?associated?with?the?model
????max_new_tokens=200,??#?Limit?the?maximum?number?of?tokens?in?the?generated?response
????temperature=0.7,??#?Check?creativity?(lower?values?make?the?model?more?conservative)
????top_p=0.9??#?Nucleus?sampling,?consider?only?tokens?with?a?cumulative?probability?up?to?90%
)

#?Create?an?LLM?object?for?use?in?LangChain?chains.
#?The?`HuggingFacePipeline`?object?wraps?the?Hugging?Face?pipeline,?making?it?compatible?with?LangChain.
#?This?allows?the?pipeline?to?be?easily?integrated?into?more?complex?workflows.
llm?=?HuggingFacePipeline(pipeline=llm_pipeline)

#?Loading?documents?using?the?previously?defined?function.
#?This?function?loads?a?corpus?of?documents?and?saves?them?in?an?HDF5?file?for?efficient?access.
#?This?step?may?have?already?been?done?previously?and?the?data?can?be?reused.
texts?=?load_and_save_corpus_to_hdf5_extended(pathDocs,?corpus_hdf5)

#?Creating?or?loading?embeddings?with?Hugging?Face
#?Embeddings?are?dense?vector?representations?of?documents.?These?numeric?vectors?capture?the?semantic?content?of?texts,?allowing?you?to?compare?and?search?for?similar?documents.
#?We?use?a?pre-trained?embedding?model?called?"sentence-transformers/all-MiniLM-L6-v2".
#?This?model?is?optimized?to?generate?compact?and?fast-to-compute?embeddings,?ideal?for?semantic?retrieval.
embedding_model?=?HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#?Try?to?load?saved?embeddings,?otherwise?generate?and?save
#?Embeddings?can?be?recalculated?each?time?or,?as?in?this?case,?saved?and?reloaded?to?improve?performance.
#?Here,?we?try?to?load?embeddings?from?a?`.npy`?file;?if?not?available,?they?are?generated?from?scratch?and?saved.
try:
????embeddings?=?np.load(pathEmbeddingNpy)
except?FileNotFoundError:
????embeddings?=?[embedding_model.embed_query(text.page_content)?for?text?in?texts]
????np.save(pathEmbeddingNpy,?np.array(embeddings))

#?Creating?or?loading?a?FAISS?VectorStore.
#?FAISS?(Facebook?AI?Similarity?Search)?is?a?library?developed?by?Facebook?to?perform?efficient?searches?of?high-dimensional?vectors.
#?It?is?used?here?to?create?an?index?that?facilitates?fast?retrieval?of?documents?based?on?the?similarity?of?embeddings.
#?If?the?FAISS?index?already?exists?(previously?saved),?it?is?reloaded;?otherwise,?a?new?index?is?created?and?saved?for?future?use.
try:
????if?os.path.exists(pathFaissIndexBin):
????????index?=?faiss.read_index(pathFaissIndexBin)
????????vectorstore?=?FAISS(embedding_function=embedding_model.embed_query,?index=index,?docstore=docstore,?index_to_docstore_id=index_to_docstore_id)
????else:
????????raise?FileNotFoundError
except?(FileNotFoundError,?RuntimeError):
????vectorstore?=?FAISS.from_documents(documents=texts,?embedding=embedding_model)
????faiss.write_index(vectorstore.index,?pathFaissIndexBin)

#?Retriever?Configuration
#?The?retriever?is?responsible?for?retrieving?the?most?relevant?documents?from?the?corpus?given?a?query?embedding.
#?Using?the?FAISS?index,?the?retriever?can?perform?fast?searches?based?on?similarity?between?the?query?embedding?and?the?document?embeddings.
retriever?=?vectorstore.as_retriever()

Part 2

#?Function?to?combine?documents?manually.
#?After?retrieving?the?relevant?documents,?it?is?often?useful?to?combine?them?into?a?single?text.
#?This?function?takes?a?list?of?documents?and?concatenates?them?into?a?single?block?of?text,?separated?by?blank?lines.
#?This?combined?text?can?then?be?used?as?context?to?generate?a?more?informed?response.
def?combine_documents(docs):
????combined_text?=?"\n\n".join([doc.page_content?for?doc?in?docs])
????return?combined_text

#?Repetition?removal?function
#?When?generating?text,?models?may?occasionally?repeat?sentences?or?lines.
#?This?function?removes?such?repetitions,?improving?readability?and?consistency?of?the?response.
#?Splits?the?text?into?lines,?keeps?track?of?which?lines?have?already?been?seen,?and?builds?a?final?result?free?of?duplicates.
def?remove_repetitions(text):
????#?Divide?the?text?into?sentences.
????sentences?=?text.split(".?")
????seen?=?[]
????result?=?[]
????for?sentence?in?sentences:
????????clean_sentence?=?sentence.strip()
????????#?Compare?the?current?sentence?with?those?already?seen.
????????if?clean_sentence?and?not?any(difflib.SequenceMatcher(None,?clean_sentence,?s).ratio()?>?0.8?for?s?in?seen):
????????????result.append(clean_sentence)
????????????seen.append(clean_sentence)
????#?Recombine?sentences?into?a?single?text.
????return?".?".join(result)?+?"."

#?Function?to?perform?the?Retrieval?and?QA?process.
#?This?function?handles?the?entire?Retrieval-Augmented?Generation?(RAG)?process:
#?1.?Uses?the?retriever?to?get?the?most?relevant?documents?based?on?the?query.
#?2.?Combines?the?retrieved?documents?into?a?single?context.
#?3.?Passes?the?combined?context?and?the?query?to?the?generative?model?via?`LLMChain`.
#?4.?Removes?any?repetitions?in?the?generated?answer.
#?Finally,?returns?the?final?system-generated?answer.
def?retrieval_qa(question):
????docs?=?retriever.get_relevant_documents(question)
????combined_text?=?combine_documents(docs)
????input_text?=?f"Contesto:?{combined_text}\n\nDomanda:?{question}\n\nRisposta:"
????response?=?llm_pipeline(input_text)[0]['generated_text']
????response?=?response.split("Risposta:")[1].strip()?if?"Risposta:"?in?response?else?response
????return?remove_repetitions(response)

#?Creating?a?prompt?template.
#?A?well-structured?prompt?is?essential?to?guide?the?generative?model?to?produce?relevant?answers.
#?The?`PromptTemplate`?defines?the?format?of?the?prompt,?including?the?context?(combined?texts)?and?the?question?to?be?answered.
#?The?template?specifies?that?the?answer?should?be?concise?and?precise,?helping?the?model?to?stay?focused?on?the?query.
template?=?"""{context}
Request:?{question}
Answer:?"""
prompt?=?PromptTemplate(input_variables=["context",?"question"],?template=template)

#?System?query.
question?=?input("Enter?the?prompt:?")
response?=?retrieval_qa(question)
print(f"Question:?{question}")
print(f"Answer:?{response}")

Strengths:

Use of High-Level Hugging Face APIs: This program utilizes high-level Hugging Face APIs, which simplify the process of creating the pipeline and make the code more readable and easier to maintain.
Integrated LLM Pipeline: The creation of the Hugging Face pipeline for text generation is straightforward and well-integrated, reducing the need for complex manual configurations.
Modularity and Simplicity: The code is more modular and easier to understand and modify, with well-defined functions for each stage of the process.
Automatic Document Combining: The approach to combining documents and generating responses is more streamlined and easily adaptable to new contexts or questions.

Weaknesses:

Less Control Over Embeddings: By using predefined Hugging Face embedding functions, there is less fine-grained control over embeddings compared to the first program.
Potential Overhead: The use of high-level pipelines might introduce a slight performance overhead compared to the more customized approach of the first program.

If you enjoyed this article, you can read the previous one or continue with the next article!

Ing. Giovanni Masi

www.dhirubhai.net/in/giovanni-masi

Email: [email protected]

Beyond Chats

1 个月

Exciting advancements in AI! RAG enhances LLMs efficiency and reliability.

Riccardo Petricca

Certificatore crediti R&S/Industria 5.0/Innovation Manager UNI 11814/ Cybersecurity / IA / DPO / CTU / Engineer

1 个月

Complimenti

Beyond Chats

2 个月

Great insight on RAG! Excited to explore potential collaborations in AI.

1 次回应

查看更多评论

要查看或添加评论，请登录

[IT]L'AI: Motore dell'Industria 5.0 per un'Economia Circolare e Sostenibile |[EN]AI: The Engine of Industry 5.0 for a Circular and Sustainable Economy

2024年8月30日
[IT] Un leader di successo nell'era dell'Intelligenza Artificiale: Cosa deve sapere | [EN] A successful leader in the age of Artificial Intelligence

2024年8月26日
[IT] Guida pratica all'implementazione dell'Intelligenza Artificiale nelle aziende | [EN] A practical guide to AI implementation in companies

2024年8月24日
Multimodal Large Language Models (LLMs): From data management to training

2024年8月18日
Intelligenza Artificiale Generativa nel Digital Marketing: ChatGPT in azione

2024年7月19日
Prompt Engineering

2024年7月3日
Guida completa all'applicazione web di ChatGPT

2024年6月29日
Introduzione a ChatGPT: Una rivoluzione nell'Intelligenza Artificiale

2024年6月22日
Comprendere i Modelli di Linguaggio di Grandi Dimensioni: Architettura, Applicazioni e Limitazioni

2024年6月19日
Intelligenza Artificiale Generativa: Una panoramica sui modelli GAN e Transformer

2024年6月15日

查看全部

Retrieval Augmented Generation (RAG)

Giovanni MASI

Computer Science Engineer | Artificial Intelligence Researcher | Subject Matter Expert at eCampus University | Advisory Board Member at Kwaai | Artificial Intelligence Group Coordinator at Order of Engineers

RAG: An innovation for optimizing language models

Introduction

How Retrieval Augmented Generation (RAG) works

The importance of Retrieval Augmented Generation

Benefits of Retrieval Augmented Generation

How Retrieval Augmented Generation works

Final considerations

Challenges of Retrieval Augmented Generation

Real world applications of Retrieval Augmented Generation

Code example to understand RAG

First example: Full implementation of a RAG system

Second example: Using Hugging Face APIs to simplify RAG implementation

Let's start by defining the components common to both programs

Library Installation

Library import, Google Drive access, Hugging Face authentication Token definition, and specification of the pre-trained LLAMA model

Checking TPU and GPU availability

Load the Tokenizer and model from Hugging Face

Model Inference

HDF5 file generation

FIRST EXAMPLE OF RAG

Part 1

Part 2

SECOND EXAMPLE OF RAG

Part 1

Part 2

更多精彩文章

社区洞察

RAG: An innovation for optimizing language models

Introduction

How Retrieval Augmented Generation (RAG) works

The importance of Retrieval Augmented Generation

Benefits of Retrieval Augmented Generation

How Retrieval Augmented Generation works

Final considerations

Challenges of Retrieval Augmented Generation

Real world applications of Retrieval Augmented Generation

Code example to understand RAG

First example: Full implementation of a RAG system

Second example: Using Hugging Face APIs to simplify RAG implementation

Let's start by defining the components common to both programs

Library Installation

Library import, Google Drive access, Hugging Face authentication Token definition, and specification of the pre-trained LLAMA model

Checking TPU and GPU availability

Load the Tokenizer and model from Hugging Face

Model Inference

HDF5 file generation

FIRST EXAMPLE OF RAG

Part 1

Part 2

SECOND EXAMPLE OF RAG

Part 1

Part 2

[IT]L'AI: Motore dell'Industria 5.0 per un'Economia Circolare e Sostenibile |[EN]AI: The Engine of Industry 5.0 for a Circular and Sustainable Economy

2024年8月30日

[IT] Un leader di successo nell'era dell'Intelligenza Artificiale: Cosa deve sapere | [EN] A successful leader in the age of Artificial Intelligence

2024年8月26日

[IT] Guida pratica all'implementazione dell'Intelligenza Artificiale nelle aziende | [EN] A practical guide to AI implementation in companies

2024年8月24日

Multimodal Large Language Models (LLMs): From data management to training

2024年8月18日

Intelligenza Artificiale Generativa nel Digital Marketing: ChatGPT in azione

2024年7月19日

Prompt Engineering

2024年7月3日

Guida completa all'applicazione web di ChatGPT

2024年6月29日

Introduzione a ChatGPT: Una rivoluzione nell'Intelligenza Artificiale

2024年6月22日

Comprendere i Modelli di Linguaggio di Grandi Dimensioni: Architettura, Applicazioni e Limitazioni

2024年6月19日

Intelligenza Artificiale Generativa: Una panoramica sui modelli GAN e Transformer

2024年6月15日

社区洞察