RAG Pipeline with Deepseek-R1
Sourabh Solkar
NodeJs | Java | ReactJs | NextJs | ElectronJs | AWS Cloud | CI/CD | Github Copilot | AWS S3 | Docker | Elastic Search | AWS EC2 | Solidity | Dapp | MongoDB |Mysql | Redis | Socket.io
Introduction
Before we dive into building the RAG (Retrieval-Augmented Generation) pipeline, let me set the context.
Imagine we have some non-public data, such as an internal research paper, HR policy document, or a confidential contract. Now, we want employees within the company to be able to ask questions related to that document, and the LLM should provide answers strictly based on the context of that document, not generic answers that traditional language models typically generate.
To solve this problem, we are going to build a RAG pipeline that will fetch the most relevant data from the document and pass it to the LLM for more accurate and context-aware responses.
Prerequisite
Step 1: Collect Text Data
Gather any non-public text data, such as a dummy HR policy, research paper, or a sample contract. Alternatively, you can quickly create a personal text document (avoiding any sensitive information).
text = """
put_dummy_data_here
"""
Step 2: Split Data into Chunks
Utilize libraries to break the text into smaller chunks. Here, we are using Langchain for this purpose:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=2)
chunks = splitter.split_text(text)
Step 3: Convert Chunks into Vectors (Embedding Process)
In this step, we transform text data into numeric data, known as embeddings. This allows the model to understand the semantic meaning of the text.
from langchain_huggingface import HuggingFaceEmbeddings
# Generate embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedding_model.embed_documents(chunks)
print(embeddings)
Step 4: Embed the Query
Now, embed the query to convert it into a numeric vector, similar to the text chunks.
# Embed the query
query = "What is the timeline for review?"
query_embedding = embedding_model.embed_query(query)
print(query_embedding)
Step 5: Reshape the Query Embedding
To ensure compatibility for similarity search, reshape the query embedding into a 2D array.
import numpy as np
# Reshape query embedding to 2D array
query_embedding = np.array(query_embedding).reshape(1, -1)
Step 6: Compute Similarity Scores
from sklearn.metrics.pairwise import cosine_similarity
# Compute similarity between query and all document embeddings
similarity_scores = cosine_similarity(query_embedding, embeddings)
Step 7: Retrieve Top Matching Chunks
top_k_indices = similarity_scores[0].argsort()[::-1]
# Print the top matching chunks
top_k = 3
context = ""
for i in top_k_indices[:top_k]:
print(f"Score: {similarity_scores[0][i]:.4f} -> {chunks[i]}")
# Append the top chunks to build the context
context += chunks[i] + " "
print(context)
Step 8: Build the Final Prompt for the LLM
Combine the retrieved context and the query to create the final prompt.
final_prompt = f"Context: {context} \n\nQuery: {query} \n\nAnswer:"
print(final_prompt)
This prompt will be fed into the language model (LLM) to generate a more accurate and context-aware response. in our case we choose deepseek-r1 because its free :-)
Step 9: Set Up Deepseek-R1 Model
To run Deepseek-R1 locally, we'll use Ollama, a platform that allows running LLMs on your local machine without relying on cloud-based APIs
1?? Install Ollama
2?? Pull the Deepseek-R1 Model
Run the following command to download the Deepseek-R1 (1.5B) model:
ollama run deepseek-r1:1.5b
Deepseek-R1 1.5B is lightweight and ideal for local setups. Heavier models like 7B or 67B require more GPU power and RAM.
3?? Run Ollama Server
After pulling the model, start the Ollama server:
ollama serve
By default, Ollama will start running Deepseek at:https://localhost:11434
Step 10: Generate the Final Answer Using Deepseek-R1
Now that we have the query and relevant context, it's time to generate the final answer by sending the prompt to Deepseek-R1 via Ollama API.
import requests
import json
response = requests.post(
"https://localhost:11434/api/chat",
json={
"model": "deepseek-r1:1.5b",
"messages": [
{"role": "user", "content": final_prompt}
]
},
)
# Handle streaming response
for chunk in response.iter_lines():
if chunk:
data = json.loads(chunk.decode('utf-8'))
message = data.get('message', {}).get('content', '')
if message:
print(message, end='', flush=True)
1?? We're sending a POST request to Deepseek's local endpoint (localhost:11434) via Ollama API.
2?? The final prompt (context + query) is passed in the messages body.
3?? We're handling the streaming response, which allows us to read the output in real-time.
4?? Finally, the response is printed as the LLM generates the text.
? Congrats! Your RAG Pipeline with Deepseek-R1 is Complete ??
Github : https://github.com/jhm164/RAG
Event Executive @ AI CERTs? | Event Management, Sponsorship
1 周Sourabh, your work on the RAG pipeline is impressive! If you're interested in AI and HR, I thought you might find value in an upcoming free webinar hosted by AI CERTs on "Transforming HR with AI: From Recruitment to Employee Engagement" on March 27, 2025. Anyone interested can register at: https://bit.ly/y-transforming-hr. Participants will also receive a certification of participation.