Retrieval Augmented Generation (RAG)
Giovanni MASI
Computer Science Engineer | Artificial Intelligence Researcher | Subject Matter Expert at eCampus University | Advisory Board Member at Kwaai | Artificial Intelligence Group Coordinator at Order of Engineers
RAG: An innovation for optimizing language models
Introduction
Retrieval Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines two powerful approaches: retrieval and generation. A RAG model uses a retrieval system to extract relevant information from a vast knowledge base and then employs a generative model to create responses based on this information. This approach allows large language models (LLMs) to access updated and pertinent data, improving the quality and reliability of the generated responses.
How Retrieval Augmented Generation (RAG) works
The RAG process is divided into two main stages:
The importance of Retrieval Augmented Generation
LLMs, while being fundamental components in the evolution of artificial intelligence (AI), have certain limitations, such as a tendency to generate inaccurate, outdated, or incorrect information. These limitations arise from the static nature of training data, which is not always up-to-date. The main issues include:
Benefits of Retrieval Augmented Generation
Adopting RAG technology offers numerous advantages:
How Retrieval Augmented Generation works
The RAG process involves several key stages:
Final considerations
Retrieval Augmented Generation represents a significant advancement in the ability of large language models to generate more relevant and reliable responses. By integrating external information and keeping data up-to-date, RAG enhances the effectiveness of LLMs in various application contexts, making generative AI an increasingly valuable resource for organizations.
Challenges of Retrieval Augmented Generation
Despite its many benefits, RAG also presents some challenges:
Real world applications of Retrieval Augmented Generation
RAG finds application in many areas, such as creative content generation, the analysis and synthesis of data from diverse sources, and the creation of articles enriched with up-to-date information. For example, RAG can generate articles on the latest scientific discoveries by drawing on reliable academic sources or analyze financial reports to extract relevant information.
Code example to understand RAG
This article presents two examples of implementing Retrieval Augmented Generation (RAG) in Python on Google Colab. These examples demonstrate two distinct approaches to building a RAG system: one that is more manual and requires handling all the steps, and another that leverages Hugging Face APIs to significantly simplify the process.
First example: Full implementation of a RAG system
In the first example, we show how to manually implement all the functionalities of a RAG system using pre-trained models, managing each step of the process:
Second example: Using Hugging Face APIs to simplify RAG implementation
In the second example, a Hugging Face pipeline is used to dramatically simplify the implementation of RAG, automating many of the steps handled manually in the first example:
In summary, while the first example demonstrates how to build a RAG system from scratch by manually managing all the steps, the second example shows how Hugging Face APIs can automate much of this process, making development faster and more accessible without sacrificing the power and flexibility of the system.
Let's start by defining the components common to both programs
Library Installation
!pip?install?faiss-cpu
!pip?install?langchain?sentence-transformers?faiss-cpu
!pip?install?--upgrade?transformers
!pip?install?-U?langchain-community
!pip?install?unstructured
!pip?install?cloud-tpu-client
!pip?install?PyPDF2
!pip?install?python-docx
Library import, Google Drive access, Hugging Face authentication Token definition, and specification of the pre-trained LLAMA model
#?Import?the?necessary?libraries
import?torch
import?os
import?sys
from?torch.utils.data?import?DataLoader,?Dataset
from?torch.optim?import?AdamW
import?torch.nn.functional?as?F
from?tqdm?import?tqdm
import?faiss
import?numpy?as?np
import?h5py
from?transformers?import?AutoModelForCausalLM,?AutoTokenizer,?pipeline
from?langchain?import?LLMChain,?PromptTemplate
from?langchain.vectorstores?import?FAISS
from?langchain.embeddings?import?HuggingFaceEmbeddings
from?langchain.document_loaders?import?DirectoryLoader
from?langchain.text_splitter?import?RecursiveCharacterTextSplitter
from?langchain.chains?import?RetrievalQA
from?langchain.schema?import?Document
from?langchain.llms?import?HuggingFacePipeline
import?csv
import?PyPDF2
from?bs4?import?BeautifulSoup
from?docx?import?Document?as?DocxDocument
from?langchain.docstore?import?InMemoryDocstore
import?difflib
from?google.colab?import?drive
#?Google?Drive?Mounting
drive.mount('/content/drive',?force_remount=True)
#?Defining?paths?for?files
path?=?'/content/drive/My?Drive/<your path>/'
pathDocs=?path?+?'docs'
corpus_file?=?path?+?'corpus.txt'
corpus_hdf5?=?path?+?'corpus.h5'
embeddings_hdf5?=?path?+?'embeddings.h5'
index_file?=?path?+?'faiss_index.index'
pathFaissIndexBin?=?path?+?'faiss_index.bin'
pathEmbeddingNpy?=?path?+?'embeddings.npy'
#?Setting?up?Hugging?Face?authentication?token
huggingface_token?=?"<your token>"
#?Specify?the?model?name
model_name?=?"meta-llama/Meta-Llama-3.1-8B-Instruct"
Checking TPU and GPU availability
if?'COLAB_TPU_ADDR'?in?os.environ:
????print("TPU?found.?Configuring...")
????try:
????????import?torch_xla
????????import?torch_xla.core.xla_model?as?xm
????????device?=?xm.xla_device()
????????print("TPU?configured?successfully.?Device:",?device)
????except?ImportError?as?e:
????????print("Error?importing?torch_xla:",?e)
????????print("Switching?to?GPU?or?CPU.")
????????if?torch.cuda.is_available():
????????????device?=?torch.device("cuda")
????????????print("GPU?Usage.?Device:",?device)
????????else:
????????????device?=?torch.device("cpu")
????????????print("CPU?Usage.?Device:",?device)
else:
????print("No?TPU?found.")
????if?torch.cuda.is_available():
????????device?=?torch.device("cuda")
????????print("GPU?Usage.?Device:",?device)
????else:
????????device?=?torch.device("cpu")
????????print("CPU?Usage.?Device:",?device)
Load the Tokenizer and model from Hugging Face
#?Loading?the?tokenizer?and?model:
#?This?command?loads?the?tokenizer?associated?with?the?pre-trained?model.
#?The?tokenizer?is?responsible?for?converting?text?into?tokens?that?can?be?processed?by?the?model.
tokenizer?=?AutoTokenizer.from_pretrained(model_name,?use_auth_token=huggingface_token)
#?Setting?the?padding?token:
#?This?line?sets?the?padding?token?(pad_token)?to?the?end-of-sequence?token?(eos_token)?if?the?tokenizer?does?not?already?have?a?padding?token?defined.
#?This?ensures?that?all?sequences?have?the?same?length?during?the?batching?process.
if?tokenizer.pad_token?is?None:
????tokenizer.pad_token?=?tokenizer.eos_token
#?Loading?the?pre-trained?model:
#?Finally,?the?pre-trained?language?model?is?loaded?from?Hugging?Face?and?transferred?to?the?configured?device?(TPU,?GPU,?or?CPU).
model?=?AutoModelForCausalLM.from_pretrained(model_name,?use_auth_token=huggingface_token).to(device)
Model Inference
def generate_response(prompt, max_length=200, temperature=0.7, top_k=50, top_p=0.9):
# Tokenize the prompt and move the tensors to the device
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
# Generate the response using the model with advanced decoding techniques
outputs = model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=max_length,
temperature=temperature,
top_k=top_k,
top_p=top_p,
pad_token_id=tokenizer.pad_token_id,
no_repeat_ngram_size=2
)
# Decode the generated response and return it
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
prompt = input("Enter the prompt: ")
response = generate_response(prompt)
print("Model response:", response)
HDF5 file generation
This code is designed to handle the reading and processing of various file types (such as .txt, .pdf, .csv, .docx, .html) and convert them into a unified format, which is then stored in an HDF5 file for future use. Below is a detailed description of what each part of the code does:
This code is ideal for scenarios where it is necessary to unify and structure data from various sources (textual, web, database) and save it in an efficient format for further analysis or processing, such as use with machine learning models or advanced search.
#?Creating?the?Document?class
class?Document:
????def?__init__(self,?page_content,?metadata=None):
????????self.page_content?=?page_content
????????self.metadata?=?metadata?if?metadata?is?not?None?else?{}
#?Loading?functions?for?various?file?types
def?load_txt(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????return?Document(page_content=file.read(),?metadata={"source":?file_path})
def?load_pdf(file_path):
????with?open(file_path,?'rb')?as?file:
????????reader?=?PyPDF2.PdfReader(file)
????????text?=?""
????????for?page?in?reader.pages:
????????????text?+=?page.extract_text()
????????return?Document(page_content=text,?metadata={"source":?file_path})
def?load_csv(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????reader?=?csv.reader(file)
????????text?=?"\n".join([",?".join(row)?for?row?in?reader])
????????return?Document(page_content=text,?metadata={"source":?file_path})
def?load_docx(file_path):
????doc?=?DocxDocument(file_path)
????full_text?=?[]
????for?para?in?doc.paragraphs:
????????full_text.append(para.text)
????return?Document(page_content="\n".join(full_text),?metadata={"source":?file_path})
def?load_html(file_path):
????with?open(file_path,?'r',?encoding='utf-8')?as?file:
????????content?=?file.read()
????????soup?=?BeautifulSoup(content,?'html.parser')
????????return?Document(page_content=soup.get_text(separator='\n'),?metadata={"source":?file_path})
def?load_and_save_corpus_to_hdf5_extended(directory_path,?hdf5_path,?web_urls=None,?sql_queries=None,?sql_connections=None,?chunk_size=500,?chunk_overlap=50):
????#?Load?files?from?the?directory?and?process?them?based?on?their?type
????documents?=?[]
????for?root,?dirs,?files?in?os.walk(directory_path):
????????for?file?in?files:
????????????file_path?=?os.path.join(root,?file)
????????????if?file.endswith('.txt'):
????????????????documents.append(load_txt(file_path))
????????????elif?file.endswith('.pdf'):
????????????????documents.append(load_pdf(file_path))
????????????elif?file.endswith('.csv'):
????????????????documents.append(load_csv(file_path))
????????????elif?file.endswith('.docx'):
????????????????documents.append(load_docx(file_path))
????????????elif?file.endswith('.html'):
????????????????documents.append(load_html(file_path))
????#?Add?data?from?web?URLs
????if?web_urls:
????????for?url?in?web_urls:
????????????web_content?=?download_and_load_web_content(url)
????????????documents.append(Document(page_content=web_content,?metadata={"source":?url}))
????#?Add?data?from?SQL?queries
????if?sql_queries?and?sql_connections:
????????for?query,?connection_string?in?zip(sql_queries,?sql_connections):
????????????sql_content?=?load_data_from_sql(query,?connection_string)
????????????documents.append(Document(page_content=sql_content,?metadata={"source":?connection_string}))
????#?Split?documents?into?chunks
????text_splitter?=?RecursiveCharacterTextSplitter(chunk_size=chunk_size,?chunk_overlap=chunk_overlap)
????texts?=?text_splitter.split_documents(documents)
????combined_corpus?=?[text.page_content?for?text?in?texts]
????#?Encode?the?corpus?in?UTF-8?format?and?save?to?HDF5
????encoded_corpus?=?[line.encode('utf-8')?for?line?in?combined_corpus]
????with?h5py.File(hdf5_path,?'w')?as?hf:
????????hf.create_dataset('corpus',?data=np.array(encoded_corpus,?dtype='S'))
????return?texts
def?load_corpus_from_hdf5(hdf5_path):
????with?h5py.File(hdf5_path,?'r')?as?hf:
????????data?=?hf['corpus'][:]
????return?[Document(page_content=content.decode('utf-8'))?for?content?in?data]
FIRST EXAMPLE OF RAG
Part 1
#?Function?to?generate?embeddings?from?the?model.
#?This?function?takes?some?text,?tokenizes?it?using?the?provided?tokenizer,?and?then?uses?the?LLaMA?model?to?generate?an?embedding.
#?The?embedding?is?a?dense?numerical?representation?of?the?text,?usually?in?a?high-dimensional?vector?space.
#?Embeddings?are?used?to?capture?the?semantic?meaning?of?words?or?sentences?in?a?format?that?can?be?easily?manipulated?by?machine?learning?algorithms.
#?The?`outputs.hidden_states[-1].mean(dim=1)`?method?takes?the?last?level?of?hidden?states?from?the?model?and?averages?them?across?the?word?dimension?(dim=1),?resulting?in?a?single?vector?for?each?sentence.
def?generate_embedding(text,?tokenizer,?model):
????inputs?=?tokenizer(text,?return_tensors="pt",?padding=True,?truncation=True).to(model.device)
????with?torch.no_grad():
????????outputs?=?model(**inputs,?output_hidden_states=True)
????embedding?=?outputs.hidden_states[-1].mean(dim=1).cpu().numpy()
????return?embedding
#?Creating?and?saving?embeddings?in?an?HDF5?file.
#?This?function?takes?a?corpus?of?documents,?generates?embeddings?for?each?document,?and?saves?them?in?an?HDF5?file.
#?HDF5?is?a?file?format?that?allows?large?amounts?of?data?to?be?stored?in?an?efficient?and?organized?format,?with?quick?access?to?specific?parts?of?the?data.
#?The?`for?doc?in?corpus`?loop?iterates?over?each?document?in?the?corpus,?generates?the?embedding?using?the?`generate_embedding`?function,?and?adds?it?to?the?list?of?embeddings.
#?Finally,?all?embeddings?are?saved?in?the?'embeddings'?dataset?of?the?HDF5?file.
def?create_and_save_embeddings(corpus,?tokenizer,?model,?hdf5_path):
????with?h5py.File(hdf5_path,?'w')?as?hf:
????????embeddings?=?[]
????????for?doc?in?corpus:
????????????embedding?=?generate_embedding(doc.page_content,?tokenizer,?model)
????????????embeddings.append(embedding)
????????hf.create_dataset('embeddings',?data=np.vstack(embeddings))
#?Loading?embeddings?from?HDF5?file.
#?This?function?loads?embeddings?from?a?previously?created?HDF5?file?and?returns?them?as?a?numpy?array.
#?This?is?useful?for?restoring?embeddings?without?having?to?regenerate?them?each?time.
def?load_embeddings_from_hdf5(hdf5_path):
????with?h5py.File(hdf5_path,?'r')?as?hf:
????????embeddings?=?hf['embeddings'][:]
????return?embeddings
#?Creating?FAISS?index?with?modified?parameters
#?This?function?creates?a?FAISS?index?for?efficient?retrieval?of?embeddings.
#?FAISS?(Facebook?AI?Similarity?Search)?is?a?library?developed?by?Facebook?AI?Research?that?allows?you?to?quickly?search?for?similar?vectors?in?large,?high-dimensional?datasets.
#?It?is?often?used?to?speed?up?the?retrieval?process?in?RAG?systems.
#?-?`nlist`?is?the?number?of?clusters?into?which?the?embeddings?are?split.?This?value?can?affect?the?speed?and?accuracy?of?the?retrieval.
#?-?`nprobe`?is?the?number?of?clusters?to?explore?during?the?search.?The?higher?the?`nprobe`,?the?more?accurate?the?search?will?be,?but?at?the?expense?of?performance.
#?The?function?first?creates?a?quantizer?(using?`IndexFlatL2`,?which?uses?the?Euclidean?distance)?and?then?creates?an?index?with?`IndexIVFFlat`,?which?is?a?type?of?FAISS?index?that?divides?the?vector?space?into?clusters?to?speed?up?the?search.
#?The?index?is?trained?on?embeddings?and?then?all?embeddings?are?added?to?it.
def?create_faiss_index(embedding_dim,?embeddings,?nlist=100,?nprobe=10):
????nlist?=?min(nlist,?len(embeddings))
????quantizer?=?faiss.IndexFlatL2(embedding_dim)
????index?=?faiss.IndexIVFFlat(quantizer,?embedding_dim,?nlist,?faiss.METRIC_L2)
????index.train(embeddings)
????index.add(embeddings)
????index.nprobe?=?nprobe
????return?index
#?Load?corpus?and?save?to?HDF5
corpus?=?load_and_save_corpus_to_hdf5_extended(pathDocs,?corpus_hdf5)
#?Generate?embeddings?and?save?to?HDF5
create_and_save_embeddings(corpus,?tokenizer,?model,?embeddings_hdf5)
#?Load?embeddings?and?create?FAISS?index
embeddings?=?load_embeddings_from_hdf5(embeddings_hdf5)
embedding_dim?=?embeddings.shape[1]
index?=?create_faiss_index(embedding_dim,?embeddings,?nlist=100)
faiss.write_index(index,?index_file)
Part 2
#?Enhanced?retrieval?with?diversity?checking.
#?This?function?retrieves?relevant?passages?from?the?corpus?given?a?query.
#?1.?The?query?is?converted?into?an?embedding?using?the?`generate_embedding`?function.
#?2.?The?FAISS?index?is?used?to?find?the?`top_k`?passages?most?similar?to?the?query.
#?3.?To?avoid?repetition,?the?output?is?filtered?ensuring?that?the?retrieved?texts?are?unique.
#?Diversity?is?handled?using?a?set?(`seen_texts`)?that?keeps?track?of?the?passages?already?seen?and?new?passages?are?added?only?if?they?have?not?already?been?retrieved.
#?This?is?important?to?ensure?that?the?context?passed?to?the?generative?model?is?diverse?and?non-repetitive.
#?Finally,?a?single?text?is?returned?that?concatenates?all?the?retrieved?passages.
def?retrieve_passages(query,?index,?corpus,?tokenizer,?model,?top_k=10):
????query_embedding?=?generate_embedding(query,?tokenizer,?model).reshape(1,?-1)
????distances,?indices?=?index.search(query_embedding,?top_k)
????unique_indices?=?[]
????seen_texts?=?set()
????for?idx?in?indices[0]:
????????if?corpus[idx].page_content?not?in?seen_texts:
????????????unique_indices.append(idx)
????????????seen_texts.add(corpus[idx].page_content)
????????if?len(unique_indices)?>=?top_k:
????????????break
????retrieved_texts?=?[corpus[idx].page_content?for?idx?in?unique_indices]
????#print("Passaggi?recuperati:",?retrieved_texts)
????return?"?".join(retrieved_texts)
#?Improved?response?generation?function?with?context?management.
#?This?function?generates?an?answer?to?the?given?question,?using?the?retrieved?texts?as?context.
#?1.?An?input?string?is?created?that?includes?the?retrieved?context?and?the?question.
#?2.?The?generative?model?uses?this?input?to?produce?an?answer.
#?Important?parameters:
#?-?`max_new_tokens`:?Limits?the?length?of?the?generated?answer.
#?-?`temperature`:?Controls?the?creativity?of?the?model?(lower?values?make?the?generation?more?deterministic).
#?-?`top_k`:?Limits?the?number?of?possible?next?tokens?considered?in?each?step.
#?-?`repetition_penalty`:?Penalizes?repetition?to?improve?the?diversity?of?the?generation.
#?The?function?returns?the?generated?answer?as?text.
def?generate_answer(question,?retrieved_texts,?model,?tokenizer,?max_new_tokens=500,?temperature=1.0,?top_k=50,?repetition_penalty=1.1):
????input_text?=?f"\n\nContext:?{retrieved_texts}\n\nQuestion:?{question}\n\nAnswer:"
????inputs?=?tokenizer(input_text,?return_tensors="pt",?padding=True,?truncation=True).to(model.device)
????attention_mask?=?inputs['attention_mask']
????outputs?=?model.generate(
????????inputs["input_ids"],
????????attention_mask=attention_mask,
????????max_new_tokens=max_new_tokens,
????????temperature=temperature,
????????top_k=top_k,
????????repetition_penalty=repetition_penalty,
????????pad_token_id=tokenizer.pad_token_id
????)
????#?Let's?make?sure?that?the?extracted?answer?is?only?the?final?part?of?the?generation
????decoded_output?=?tokenizer.decode(outputs[0],?skip_special_tokens=True)
????#?Split?the?generated?text?using?the?term?"Answer:"?and?take?only?the?final?part
????StrAnswer?=?decoded_output.split("Answer:")[1].strip()?if?"Answer:"?in?decoded_output?else?decoded_output
????return?StrAnswer
#?This?function?coordinates?the?querying?process?of?the?system.
#?1.?Retrieves?the?most?relevant?passages?using?`retrieve_passages`.
#?2.?Generates?an?answer?using?`generate_answer`.
#?This?is?the?final?integration?point?where?retrieval?and?generation?are?combined?to?answer?the?given?query.
def?ask_question(question,?index,?corpus,?tokenizer,?model,?temperature=1.0,?top_k=50):
????retrieved_texts?=?retrieve_passages(question,?index,?corpus,?tokenizer,?model)
????answer?=?generate_answer(question,?retrieved_texts,?model,?tokenizer,?temperature=temperature,?top_k=top_k)
????return?answer
#?System?query.
question?=?input("Enter?the?prompt:?")
StrAnswer?=?ask_question(question,?index,?corpus,?tokenizer,?model)
#?View?only?the?question?and?answer
print("Question:",?question)
print("Answer:",?strAnswer)
Strengths:
Weaknesses:
SECOND EXAMPLE OF RAG
Part 1
#?Creating?a?Hugging?Face?pipeline?for?text?generation.
#?The?pipeline?is?a?key?component?that?facilitates?the?use?of?the?model?for?various?tasks?such?as?text-generation,?sentiment-analysis,?etc.
#?In?this?case,?we?configure?the?pipeline?for?text?generation,?setting?several?parameters?to?control?the?response:
#?-?`max_new_tokens`:?Limits?the?length?of?the?generated?response,?preventing?too?long?responses.
#?-?`temperature`:?Controls?the?"creativity"?of?the?response;?lower?values?produce?more?deterministic?responses.
#?-?`top_p`:?Nucleus?sampling,?considers?only?tokens?with?a?cumulative?probability?up?to?90%,?reducing?the?risk?of?inconsistent?responses.
llm_pipeline?=?pipeline(
????"text-generation",
????model=model,??#?Pre-trained?model,?such?as?a?LLaMA?or?GPT?model
????tokenizer=tokenizer,?#?Tokenizer?associated?with?the?model
????max_new_tokens=200,??#?Limit?the?maximum?number?of?tokens?in?the?generated?response
????temperature=0.7,??#?Check?creativity?(lower?values?make?the?model?more?conservative)
????top_p=0.9??#?Nucleus?sampling,?consider?only?tokens?with?a?cumulative?probability?up?to?90%
)
#?Create?an?LLM?object?for?use?in?LangChain?chains.
#?The?`HuggingFacePipeline`?object?wraps?the?Hugging?Face?pipeline,?making?it?compatible?with?LangChain.
#?This?allows?the?pipeline?to?be?easily?integrated?into?more?complex?workflows.
llm?=?HuggingFacePipeline(pipeline=llm_pipeline)
#?Loading?documents?using?the?previously?defined?function.
#?This?function?loads?a?corpus?of?documents?and?saves?them?in?an?HDF5?file?for?efficient?access.
#?This?step?may?have?already?been?done?previously?and?the?data?can?be?reused.
texts?=?load_and_save_corpus_to_hdf5_extended(pathDocs,?corpus_hdf5)
#?Creating?or?loading?embeddings?with?Hugging?Face
#?Embeddings?are?dense?vector?representations?of?documents.?These?numeric?vectors?capture?the?semantic?content?of?texts,?allowing?you?to?compare?and?search?for?similar?documents.
#?We?use?a?pre-trained?embedding?model?called?"sentence-transformers/all-MiniLM-L6-v2".
#?This?model?is?optimized?to?generate?compact?and?fast-to-compute?embeddings,?ideal?for?semantic?retrieval.
embedding_model?=?HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
#?Try?to?load?saved?embeddings,?otherwise?generate?and?save
#?Embeddings?can?be?recalculated?each?time?or,?as?in?this?case,?saved?and?reloaded?to?improve?performance.
#?Here,?we?try?to?load?embeddings?from?a?`.npy`?file;?if?not?available,?they?are?generated?from?scratch?and?saved.
try:
????embeddings?=?np.load(pathEmbeddingNpy)
except?FileNotFoundError:
????embeddings?=?[embedding_model.embed_query(text.page_content)?for?text?in?texts]
????np.save(pathEmbeddingNpy,?np.array(embeddings))
#?Creating?or?loading?a?FAISS?VectorStore.
#?FAISS?(Facebook?AI?Similarity?Search)?is?a?library?developed?by?Facebook?to?perform?efficient?searches?of?high-dimensional?vectors.
#?It?is?used?here?to?create?an?index?that?facilitates?fast?retrieval?of?documents?based?on?the?similarity?of?embeddings.
#?If?the?FAISS?index?already?exists?(previously?saved),?it?is?reloaded;?otherwise,?a?new?index?is?created?and?saved?for?future?use.
try:
????if?os.path.exists(pathFaissIndexBin):
????????index?=?faiss.read_index(pathFaissIndexBin)
????????vectorstore?=?FAISS(embedding_function=embedding_model.embed_query,?index=index,?docstore=docstore,?index_to_docstore_id=index_to_docstore_id)
????else:
????????raise?FileNotFoundError
except?(FileNotFoundError,?RuntimeError):
????vectorstore?=?FAISS.from_documents(documents=texts,?embedding=embedding_model)
????faiss.write_index(vectorstore.index,?pathFaissIndexBin)
#?Retriever?Configuration
#?The?retriever?is?responsible?for?retrieving?the?most?relevant?documents?from?the?corpus?given?a?query?embedding.
#?Using?the?FAISS?index,?the?retriever?can?perform?fast?searches?based?on?similarity?between?the?query?embedding?and?the?document?embeddings.
retriever?=?vectorstore.as_retriever()
Part 2
#?Function?to?combine?documents?manually.
#?After?retrieving?the?relevant?documents,?it?is?often?useful?to?combine?them?into?a?single?text.
#?This?function?takes?a?list?of?documents?and?concatenates?them?into?a?single?block?of?text,?separated?by?blank?lines.
#?This?combined?text?can?then?be?used?as?context?to?generate?a?more?informed?response.
def?combine_documents(docs):
????combined_text?=?"\n\n".join([doc.page_content?for?doc?in?docs])
????return?combined_text
#?Repetition?removal?function
#?When?generating?text,?models?may?occasionally?repeat?sentences?or?lines.
#?This?function?removes?such?repetitions,?improving?readability?and?consistency?of?the?response.
#?Splits?the?text?into?lines,?keeps?track?of?which?lines?have?already?been?seen,?and?builds?a?final?result?free?of?duplicates.
def?remove_repetitions(text):
????#?Divide?the?text?into?sentences.
????sentences?=?text.split(".?")
????seen?=?[]
????result?=?[]
????for?sentence?in?sentences:
????????clean_sentence?=?sentence.strip()
????????#?Compare?the?current?sentence?with?those?already?seen.
????????if?clean_sentence?and?not?any(difflib.SequenceMatcher(None,?clean_sentence,?s).ratio()?>?0.8?for?s?in?seen):
????????????result.append(clean_sentence)
????????????seen.append(clean_sentence)
????#?Recombine?sentences?into?a?single?text.
????return?".?".join(result)?+?"."
#?Function?to?perform?the?Retrieval?and?QA?process.
#?This?function?handles?the?entire?Retrieval-Augmented?Generation?(RAG)?process:
#?1.?Uses?the?retriever?to?get?the?most?relevant?documents?based?on?the?query.
#?2.?Combines?the?retrieved?documents?into?a?single?context.
#?3.?Passes?the?combined?context?and?the?query?to?the?generative?model?via?`LLMChain`.
#?4.?Removes?any?repetitions?in?the?generated?answer.
#?Finally,?returns?the?final?system-generated?answer.
def?retrieval_qa(question):
????docs?=?retriever.get_relevant_documents(question)
????combined_text?=?combine_documents(docs)
????input_text?=?f"Contesto:?{combined_text}\n\nDomanda:?{question}\n\nRisposta:"
????response?=?llm_pipeline(input_text)[0]['generated_text']
????response?=?response.split("Risposta:")[1].strip()?if?"Risposta:"?in?response?else?response
????return?remove_repetitions(response)
#?Creating?a?prompt?template.
#?A?well-structured?prompt?is?essential?to?guide?the?generative?model?to?produce?relevant?answers.
#?The?`PromptTemplate`?defines?the?format?of?the?prompt,?including?the?context?(combined?texts)?and?the?question?to?be?answered.
#?The?template?specifies?that?the?answer?should?be?concise?and?precise,?helping?the?model?to?stay?focused?on?the?query.
template?=?"""{context}
Request:?{question}
Answer:?"""
prompt?=?PromptTemplate(input_variables=["context",?"question"],?template=template)
#?System?query.
question?=?input("Enter?the?prompt:?")
response?=?retrieval_qa(question)
print(f"Question:?{question}")
print(f"Answer:?{response}")
Strengths:
Weaknesses:
Ing. Giovanni Masi
Email: [email protected]
Exciting advancements in AI! RAG enhances LLMs efficiency and reliability.
Certificatore crediti R&S/Industria 5.0/Innovation Manager UNI 11814/ Cybersecurity / IA / DPO / CTU / Engineer
1 个月Complimenti
Great insight on RAG! Excited to explore potential collaborations in AI.