Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]
Shanoj Kumar V
VP - Senior Technology Architecture Manager @ Citi | LLMs, AI Agents & RAG | Cloud & Big Data | Author
TL;DR
Local Large Language Models (LLMs) have made it possible to build powerful AI apps on everyday hardware — no expensive GPU or cloud API needed. In this Day 1 tutorial, we’ll walk through creating a Q&A chatbot powered by a local LLM running on your CPU, using Ollama for model management and Streamlit for a friendly UI. Along the way, we emphasize good software practices: a clean project structure, robust fallback strategies, and conversation context handling. By the end, you’ll have a working AI assistant on your machine and hands-on experience with Python, LLM integration, and modern development best practices. Get ready for a practical, question-driven journey into the world of local LLMs!
Introduction: The Power of Local LLMs
Have you ever wanted to build your own AI assistant like ChatGPT without relying on cloud services or high-end hardware? The recent emergence of optimized, open-source LLMs has made this possible even on standard laptops. By running these models locally, you gain complete privacy, eliminate usage costs, and get a deeper understanding of how LLMs function under the hood.
In this Day 1 project of our learning journey, we’ll build a Q&A application powered by locally running LLMs through Ollama. This project teaches not just how to integrate with these models, but also how to structure a professional Python application, design effective prompts, and create an intuitive user interface.
What sets our approach apart is a focus on question-driven development — we’ll learn by doing. At each step, we’ll pose real development questions and challenges (e.g., “How do we handle model failures?”) and solve them hands-on. This way, you’ll build a genuine understanding of LLM application development rather than just following instructions.
Learning Note: What is an LLM? A large language model (LLM) is a type of machine learning model designed for natural language processing tasks like understanding and generating text. Recent open-source LLMs (e.g. Meta’s LLaMA) can run on everyday computers, enabling personal AI apps.
Project Overview: A Local LLM Q&A Assistant
The Concept
We’re building a chat Q&A application that connects to Ollama (a tool for running LLMs locally), formats user questions into effective prompts, and maintains conversation context for follow-ups. The app will provide a clean web interface via Streamlit and include fallback mechanisms for when the primary model isn’t available. In short, it’s like creating your own local ChatGPT that you fully control.
Key Learning Objectives
Learning Note: What is Ollama? Ollama is an open-source tool that lets you download and run popular LLMs on your local machine through a simple API. We’ll use it to manage our models so we can generate answers without any cloud services.
Project Structure
We’ve organized our project with the following structure to ensure clarity and easy maintenance:
GitHub Repository: day-01-local-qa-app
day-01-local-qa-app/
│── docs/ # Documentation and learning materials
│ │── images/ # Diagrams and screenshots
│ │── README.md # Learning documentation
│
│── src/ # Source code
│ │── app.py # Main Streamlit application
│ │── config/ # Configuration settings
│ │ │── settings.py # Application settings
│ │── models/ # LLM integration
│ │ │── llm_loader.py # Model loading and integration
│ │ │── prompt_templates.py # Prompt engineering templates
│ │── utils/ # Utility functions
│ │── helpers.py # Helper functions
│ │── logger.py # Logging setup
│
│── tests/ # Test files
│── README.md # Project documentation
│── requirements.txt # Python dependencies
The Architecture: How It All Fits Together
Our application follows a layered architecture with a clean separation of concerns:
Let’s explore each component in this architecture:
1. User Interface Layer (Streamlight)
The Streamlit framework provides our web interface, handling:
Learning Note: What is Streamlit? Streamlit (streamlit.io)is an open-source Python framework for building interactive web apps quickly. It lets us create a chat interface in just a few lines of code, perfect for prototyping our AI assistant.
2. Application Logic Layer
The core application logic manages:
3. Model Integration Layer
This layer handles all LLM interactions:
Learning Note: Hugging Face Models as Fallback — Hugging Face hosts many pre-trained models that can run locally. In our app, if Ollama’s model fails, we can query a smaller model from Hugging Face’s library to ensure the assistant still responds. This way, the app remains usable even if the primary model isn’t running.
4. Utility Layer
Supporting functions and configurations that underpin the above layers:
By separating these layers, we make the app easier to understand and modify. For instance, you could swap out the UI (Layer 1) or the LLM engine (Layer 3) without heavily affecting other parts of the system.
Data Flow: From Question to Answer
Here’s a step-by-step breakdown of how a user’s question travels through our application and comes back with an answer:
Key Implementation Insights
GitHub Repository: day-01-local-qa-app
Effective Prompt Engineering
The quality of responses from any LLM depends heavily on how we structure our prompts. In our application, the prompt_templates.py file defines templates for various use cases. For example, a simple question-answering template might look like:
"""
Prompt templates for different use cases.
"""
class PromptTemplate:
"""
Class to handle prompt templates and formatting.
"""
@staticmethod
def qa_template(question, conversation_history=None):
"""
Format a question-answering prompt.
Args:
question (str): User question
conversation_history (list, optional): List of previous conversation turns
Returns:
str: Formatted prompt
"""
if not conversation_history:
return f"""
You are a helpful assistant. Answer the following question:
Question: {question}
Answer:
""".strip()
# Format conversation history
history_text = ""
for turn in conversation_history:
role = turn.get("role", "")
content = turn.get("content", "")
if role.lower() == "user":
history_text += f"Human: {content}\n"
elif role.lower() == "assistant":
history_text += f"Assistant: {content}\n"
# Add the current question
history_text += f"Human: {question}\nAssistant:"
return f"""
You are a helpful assistant. Here's the conversation so far:
{history_text}
""".strip()
@staticmethod
def coding_template(question, language=None):
"""
Format a prompt for coding questions.
Args:
question (str): User's coding question
language (str, optional): Programming language
Returns:
str: Formatted prompt
"""
lang_context = f"using {language}" if language else ""
return f"""
You are an expert programming assistant {lang_context}. Answer the following coding question with clear explanations and example code:
Question: {question}
Answer:
""".strip()
@staticmethod
def educational_template(question, topic=None, level="beginner"):
"""
Format a prompt for educational explanations.
Args:
question (str): User's question
topic (str, optional): The topic area
level (str): Knowledge level (beginner, intermediate, advanced)
Returns:
str: Formatted prompt
"""
topic_context = f"about {topic}" if topic else ""
return f"""
You are an educational assistant helping a {level} learner {topic_context}. Provide a clear and helpful explanation for the following question:
Question: {question}
Explanation:
""".strip()
This template-based approach:
In short, good prompt engineering helps the LLM give better answers by setting the stage properly.
Resilient Model Management
A key lesson in LLM app development is planning for failure. Things can go wrong — the model might not be running, an API call might fail, etc. Our llm_loader.py implements a sophisticated fallback mechanism to handle these cases:
"""
LLM loader for different model backends (Ollama and HuggingFace).
"""
import sys
import json
import requests
from pathlib import Path
from transformers import pipeline
# Add src directory to path for imports
src_dir = str(Path(__file__).resolve().parent.parent)
if src_dir not in sys.path:
sys.path.insert(0, src_dir)
from utils.logger import logger
from utils.helpers import time_function, check_ollama_status
from config import settings
class LLMManager:
"""
Manager for loading and interacting with different LLM backends.
"""
def __init__(self):
"""Initialize the LLM Manager."""
self.ollama_host = settings.OLLAMA_HOST
self.default_ollama_model = settings.DEFAULT_OLLAMA_MODEL
self.default_hf_model = settings.DEFAULT_HF_MODEL
# Check if Ollama is available
self.ollama_available = check_ollama_status(self.ollama_host)
logger.info(f"Ollama available: {self.ollama_available}")
# Initialize HuggingFace model if needed
self.hf_pipeline = None
if not self.ollama_available:
logger.info(f"Initializing HuggingFace model: {self.default_hf_model}")
self._initialize_hf_model(self.default_hf_model)
def _initialize_hf_model(self, model_name):
"""Initialize a HuggingFace model pipeline."""
try:
self.hf_pipeline = pipeline(
"text2text-generation",
model=model_name,
max_length=settings.DEFAULT_MAX_LENGTH,
device=-1, # Use CPU
)
logger.info(f"Successfully loaded HuggingFace model: {model_name}")
except Exception as e:
logger.error(f"Error loading HuggingFace model: {str(e)}")
self.hf_pipeline = None
@time_function
def generate_with_ollama(self, prompt, model=None, temperature=None, max_tokens=None):
"""
Generate text using Ollama API.
Args:
prompt (str): Input prompt
model (str, optional): Model name
temperature (float, optional): Sampling temperature
max_tokens (int, optional): Maximum tokens to generate
Returns:
str: Generated text
"""
if not self.ollama_available:
logger.warning("Ollama not available, falling back to HuggingFace")
return self.generate_with_hf(prompt)
model = model or self.default_ollama_model
temperature = temperature or settings.DEFAULT_TEMPERATURE
max_tokens = max_tokens or settings.DEFAULT_MAX_LENGTH
try:
# Updated: Use 'completion' endpoint for newer Ollama versions
request_data = {
"model": model,
"prompt": prompt,
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False
}
# Try the newer completion endpoint first
response = requests.post(
f"{self.ollama_host}/api/chat",
json={"model": model, "messages": [{"role": "user", "content": prompt}], "stream": False},
headers={"Content-Type": "application/json"}
)
if response.status_code == 200:
result = response.json()
return result.get("message", {}).get("content", "")
# Fall back to completion endpoint
response = requests.post(
f"{self.ollama_host}/api/completion",
json=request_data,
headers={"Content-Type": "application/json"}
)
if response.status_code == 200:
result = response.json()
return result.get("response", "")
# Fall back to the older generate endpoint
response = requests.post(
f"{self.ollama_host}/api/generate",
json=request_data,
headers={"Content-Type": "application/json"}
)
if response.status_code == 200:
result = response.json()
return result.get("response", "")
else:
logger.error(f"Ollama API error: {response.status_code} - {response.text}")
return self.generate_with_hf(prompt)
except Exception as e:
logger.error(f"Error generating with Ollama: {str(e)}")
return self.generate_with_hf(prompt)
@time_function
def generate_with_hf(self, prompt, model=None, temperature=None, max_length=None):
"""
Generate text using HuggingFace pipeline.
Args:
prompt (str): Input prompt
model (str, optional): Model name
temperature (float, optional): Sampling temperature
max_length (int, optional): Maximum length to generate
Returns:
str: Generated text
"""
model = model or self.default_hf_model
temperature = temperature or settings.DEFAULT_TEMPERATURE
max_length = max_length or settings.DEFAULT_MAX_LENGTH
# Initialize model if not done yet or if model changed
if self.hf_pipeline is None or self.hf_pipeline.model.name_or_path != model:
self._initialize_hf_model(model)
if self.hf_pipeline is None:
return "Sorry, the model is not available at the moment."
try:
result = self.hf_pipeline(
prompt,
temperature=temperature,
max_length=max_length
)
return result[0]["generated_text"]
except Exception as e:
logger.error(f"Error generating with HuggingFace: {str(e)}")
return "Sorry, an error occurred during text generation."
def generate(self, prompt, use_ollama=True, **kwargs):
"""
Generate text using the preferred backend.
Args:
prompt (str): Input prompt
use_ollama (bool): Whether to use Ollama if available
**kwargs: Additional generation parameters
Returns:
str: Generated text
"""
if use_ollama and self.ollama_available:
return self.generate_with_ollama(prompt, **kwargs)
else:
return self.generate_with_hf(prompt, **kwargs)
def get_available_models(self):
"""
Get a list of available models from both backends.
Returns:
dict: Dictionary with available models
"""
models = {
"ollama": [],
"huggingface": settings.AVAILABLE_HF_MODELS
}
# Get Ollama models if available
if self.ollama_available:
try:
response = requests.get(f"{self.ollama_host}/api/tags")
if response.status_code == 200:
data = response.json()
models["ollama"] = [model["name"] for model in data.get("models", [])]
else:
models["ollama"] = settings.AVAILABLE_OLLAMA_MODELS
except:
models["ollama"] = settings.AVAILABLE_OLLAMA_MODELS
return models
This approach ensures our application remains functional even when:
By layering these fallbacks, we avoid a total failure. If Ollama doesn’t respond, the app will automatically try another route or model so the user still gets an answer.
Conversation Context Management
LLMs have no built-in memory between requests — they treat each prompt independently. To create a realistic conversational experience, our app needs to remember past interactions. We manage this using Streamlit’s session state and prompt templates:
"""
Main application file for the LocalLLM Q&A Assistant.
This is the entry point for the Streamlit application that provides a chat interface
for interacting with locally running LLMs via Ollama, with fallback to HuggingFace models.
"""
import sys
import time
from pathlib import Path
# Add parent directory to sys.path
sys.path.append(str(Path(__file__).resolve().parent))
# Import Streamlit and other dependencies
import streamlit as st
# Import local modules
from config import settings
from utils.logger import logger
from utils.helpers import check_ollama_status, format_time
from models.llm_loader import LLMManager
from models.prompt_templates import PromptTemplate
# Initialize LLM Manager
llm_manager = LLMManager()
# Get available models
available_models = llm_manager.get_available_models()
# Set page configuration
st.set_page_config(
page_title=settings.APP_TITLE,
page_icon=settings.APP_ICON,
layout="wide",
initial_sidebar_state="expanded"
)
# Add custom CSS
st.markdown("""
<style>
.main .block-container {
padding-top: 2rem;
}
.stChatMessage {
background-color: rgba(240, 242, 246, 0.5);
}
.stChatMessage[data-testid="stChatMessageContent"] {
border-radius: 10px;
}
</style>
""", unsafe_allow_html=True)
# Initialize session state
if "messages" not in st.session_state:
st.session_state.messages = []
if "generation_time" not in st.session_state:
st.session_state.generation_time = None
# Sidebar with configuration options
with st.sidebar:
st.title("?? Settings")
# Model selection
st.subheader("Model Selection")
backend_option = st.radio(
"Select Backend:",
["Ollama", "HuggingFace"],
index=0 if llm_manager.ollama_available else 1,
disabled=not llm_manager.ollama_available
)
if backend_option == "Ollama" and llm_manager.ollama_available:
model_option = st.selectbox(
"Ollama Model:",
available_models["ollama"],
index=0 if available_models["ollama"] else 0,
disabled=not available_models["ollama"]
)
use_ollama = True
else:
model_option = st.selectbox(
"HuggingFace Model:",
available_models["huggingface"],
index=0
)
use_ollama = False
# Generation parameters
st.subheader("Generation Parameters")
temperature = st.slider(
"Temperature:",
min_value=0.1,
max_value=1.0,
value=settings.DEFAULT_TEMPERATURE,
step=0.1,
help="Higher values make the output more random, lower values make it more deterministic."
)
max_length = st.slider(
"Max Length:",
min_value=64,
max_value=2048,
value=settings.DEFAULT_MAX_LENGTH,
step=64,
help="Maximum number of tokens to generate."
)
# About section
st.subheader("About")
st.markdown("""
This application uses locally running LLM models to answer questions.
- Primary: Ollama API
- Fallback: HuggingFace Models
""")
# Show status
st.subheader("Status")
ollama_status = "? Connected" if llm_manager.ollama_available else "? Not available"
st.markdown(f"**Ollama API**: {ollama_status}")
if st.session_state.generation_time:
st.markdown(f"**Last generation time**: {st.session_state.generation_time}")
# Clear conversation button
if st.button("Clear Conversation"):
st.session_state.messages = []
st.rerun()
# Main chat interface
st.title("?? LocalLLM Q&A Assistant")
st.markdown("Ask a question and get answers from a locally running LLM.")
# Display chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Chat input
if prompt := st.chat_input("Ask a question..."):
# Add user message to history
st.session_state.messages.append({"role": "user", "content": prompt})
# Display user message
with st.chat_message("user"):
st.markdown(prompt)
# Generate response
with st.chat_message("assistant"):
message_placeholder = st.empty()
message_placeholder.markdown("Thinking...")
try:
# Format prompt with template and history
template = PromptTemplate.qa_template(
prompt,
st.session_state.messages[:-1] if len(st.session_state.messages) > 1 else None
)
# Measure generation time
start_time = time.time()
# Generate response
if use_ollama:
response = llm_manager.generate_with_ollama(
template,
model=model_option,
temperature=temperature,
max_tokens=max_length
)
else:
response = llm_manager.generate_with_hf(
template,
model=model_option,
temperature=temperature,
max_length=max_length
)
# Calculate generation time
end_time = time.time()
generation_time = format_time(end_time - start_time)
st.session_state.generation_time = generation_time
# Log generation info
logger.info(f"Generated response in {generation_time} with model {model_option}")
# Display response
message_placeholder.markdown(response)
# Add assistant response to history
st.session_state.messages.append({"role": "assistant", "content": response})
except Exception as e:
error_message = f"Error generating response: {str(e)}"
logger.error(error_message)
message_placeholder.markdown(f"?? {error_message}")
# Footer
st.markdown("---")
st.markdown(
"Built with Streamlit, Ollama, and HuggingFace. "
"Running LLMs locally on CPU. "
"<br><b>Author:</b> Shanoj",
unsafe_allow_html=True
)
This approach:
Without this, the assistant would give disjointed answers with no memory of what was said before. Managing state is crucial for a chatbot-like experience.
Challenges and Solutions
Throughout development, we faced a few specific challenges. Here’s how we addressed each:
Challenge 1: Handling Different Ollama API Versions
Ollama’s API has evolved, meaning an endpoint that worked in one version might not work in another. To make our app robust to these changes, we implemented multiple endpoint attempts (as shown earlier in llm_loader.generate). In practice, the code tries the latest endpoint first (/api/chat), and if it receives a 404 error (not found), it automatically falls back to older endpoints (/api/completion, then /api/generate).
Solution: By cascading through possible endpoints, we ensure compatibility with different Ollama versions without requiring the user to manually update anything. The assistant “just works” with whichever API is available.
Challenge 2: Python Path Management
In a modular Python project, getting imports to work correctly can be tricky, especially when running the app from different directories or as a module. We encountered issues where our modules couldn’t find each other. Our solution was to use explicit path management at runtime:
# At the top of src/app.py or relevant entry point
from pathlib import Path
import sys
# Add parent directory (project src root) to sys.path for module discovery
src_dir = str(Path(__file__).resolve().parent.parent)
if src_dir not in sys.path:
sys.path.insert(0, src_dir)
Solution: This ensures that the src/ directory is always in Python’s module search path, so modules like models and utils can be imported reliably regardless of how the app is launched. This explicit approach prevents those “module not found” errors that often plague larger Python projects.
Challenge 3: Balancing UI Responsiveness with Processing Time
LLMs can take several seconds (or more) to generate a response, which might leave the user staring at a blank screen wondering if anything is happening. We wanted to keep the UI responsive and informative during these waits.
Solution: We implemented a simple loading indicator in the Streamlit UI. Before sending the prompt to the model, we display a temporary message:
# In src/app.py, just before calling the LLM generate function
message_placeholder = st.empty()
message_placeholder.markdown("_Thinking..._")
# Call the model to generate the answer (which may take time)
response = llm.generate(prompt)
# Once we have a response, replace the placeholder with the answer
message_placeholder.markdown(response)
Using st.empty() gives us a placeholder in the chat area that we can update later. First we show a “Thinking…” message immediately, so the user knows the question was received. After generation finishes, we overwrite that placeholder with the actual answer. This provides instant feedback (no more frozen feeling) and improves the user experience greatly.
Running the Application
Now that everything is implemented, running the application is straightforward. From the project’s root directory, execute the Streamlit app:
streamlit run src/app.py
This will launch the Streamlit web interface in your browser. Here’s what you can do with it:
The application automatically detects available Ollama models on your machine. If the primary model isn’t available, it will gracefully fall back to a secondary option (e.g., a Hugging Face model you’ve configured) so you’re never left without an answer. You now have your own private Q&A assistant running on your computer!
Learning Note: Tip — Installing Models. Make sure you have at least one LLM model installed via Ollama (for example, LLaMA or Mistral). You can run ollama pull <model-name> to download a model. Our app will list and use any model that Ollama has available locally.
GitHub Repository: day-01-local-qa-app
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
6 天前The intersection of #AI and #LLMs is rapidly evolving, with projects like #Ollama pushing the boundaries of open-source development. Techniques like prompt engineering and fine-tuning are crucial for optimizing LLM performance in specific domains. How would you leverage transfer learning to adapt a pre-trained LLM for specialized tasks within the realm of #AIProjects, considering resource constraints?