Building an AI-Powered Research Assistant with LangChain: A Step-by-Step Guide
Bishwa kiran Poudel
Former Vice President at CSIT Association of Nepal Purwanchal
In today's age, research can be overwhelming due to the sheer volume of information available. Wouldn't it be great to have an AI research assistant that can browse the web, retrieve relevant articles, summarize key points, and even generate well-structured reports? With LangChain, lets build just that
This tutorial guides you through building an AI research assistant using LangChain, enabling automated web search, document processing, and intelligent summarization.
Architecture Overview
Heres what our core components look like:
Seems easy enough right lets dig in.
Prerequisites
Before that, you’ll need:
pip install langchain langchain_community langchain_text_splitters bs4 chromadb duckduckgo-search langchain-google-genai
1. Internet Search & Data Collection
To begin with, we need to obtain relevant research articles on our chosen topic. So lets make use of the DuckDuckGo search API to fetch articles. We limit the results to 5 in order to keep things simple.
from langchain_community.tools import DuckDuckGoSearchResults
from langchain_community.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper
import os
import re
def get_links(keyword):
wrapper = DuckDuckGoSearchAPIWrapper(max_results=5)
search = DuckDuckGoSearchResults(api_wrapper=wrapper)
results = search.run(tool_input=keyword)
return re.findall(r'link:\s*(https?://[^\],\s]+)', results)
2. Document Loading
After obtaining the links, the next step is to extract useful content. Using WebBaseLoader (which internally employs BeautifulSoup), we scrape the articles for text and ensure that we only obtain the valuable information.
from langchain_community.document_loaders import WebBaseLoader
document_loader = WebBaseLoader(web_path=(get_links("AI in Nepal")))
docs = document_loader.load()
3. Splitting Documents into Chunks
Research papers and articles are lengthy, and LLMs have token constraints. To handle this efficiently, we divide the extracted text into small overlapping fragments with RecursiveCharacterTextSplitter.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(docs)
4. Storing Data in a Vector Database
To make our research efficient, we convert these text chunks into embeddings and store them in a Chroma vector database. This allows us to perform rapid and intelligent retrieval with semantic search.
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector_store = Chroma.from_documents(documents=splits, embedding=embedding_model)
5. Intelligent Retrieval
When a user asks for a summary, we do not retrieve all the stored data. Instead, we use similarity search to retrieve the most relevant chunks based on the query.
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
6. Summarization & Report Generation
Finally, the wizardry of LLMs. We input the information gathered into a structured prompt with Google's Gemini model to generate a neat and organized research abstract.
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.7)
summary_template = """
List the key findings from the following research articles:
Context: {context}
Ensure it is well-structured, concise, and includes key takeaways.
"""
prompt = PromptTemplate.from_template(template=summary_template)
def summarize_research(topic):
retrieved_docs = retriever.invoke(topic)
context = "\n".join(doc.page_content for doc in retrieved_docs)
response = llm.invoke(prompt.format(context=context))
return response.content
response = summarize_research("Nepal AI Boom")
print(response)
7. Saving the Research Report
And finally, the report generated by AI is saved as a markdown file for sharing, editing, or future use.
def save_file(content, filename):
directory = "research_summaries"
os.makedirs(directory, exist_ok=True)
with open(os.path.join(directory, filename), 'w') as f:
f.write(content)
print(f"? File saved as {filename}")
save_file(content=response, filename="Research_Report.md")
Running the Research Assistant
All thats left is to simply run:
python main.py
With LangChain, we have built an AI-powered research assistant that can:
This can be further extended by incorporating a chat interface, enabling real-time interactions for research questions.
What’s next?
And for all those who are lets says running on a busy schedule, heres the Notebook