Fine-Tuning TinyLlama for Q&A on Structured Company Data: A Hands-On Guide with LoRA
Thirumalesh Konathala (PhD)
AI Innovation Leader, Advisor | GenAI , PredictiveAI Researcher | AI Architect | Analytics Director | Data Science Leader | Ex-Amazonian | Guest Speaker CSIR - IITR | HCU | ISI |
Introduction
In my previous article, I demonstrated how to fine-tune DeepSeek R1 1.5B for domain-specific text generation on Google Colab with a free T4 GPU using LoRA (Low-Rank Adaptation).
The key takeaway was that LoRA enables fine-tuning large models efficiently, even on constrained hardware.
This follow-up article explores:
Why TinyLlama?
This guide provides end-to-end code for fine-tuning TinyLlama on structured company data and testing its Q&A capabilities.
Prerequisites: Colab provides free NVIDIA T4 GPUs, but we need to enable GPU manually
- Go to "Runtime" → "Change runtime type"
- Select T4 GPU from the "Hardware Accelerator" dropdown
- Click Save
My experimented Colab notebook link provided below
Link - Colab Notebook
Now, let's start with fine-tuning TinyLlama for Q&A on Structured Company Data
Step 1: Install Required Libraries
Note: We use Hugging Face’s transformers, datasets, peft (for LoRA), and torch (for training).
!pip install transformers datasets peft torch
Step 2: Simulated Company Performance Data for Fine-Tuning
Note: This dataset mimics structured financial & operational reports of a company
Company Name: AlphaTech Inc.
Quarterly Revenue:
Q1: $50M
Q2: $65M
Q3: $80M
Q4: $90M
Net Profit:
Q1: $5M
Q2: $10M
Q3: $12M
Q4: $15M
Customer Growth:
Q1: 10,000 new users
Q2: 25,000 new users
Q3: 40,000 new users
Q4: 55,000 new users
Why? LLMs trained on structured numerical and text data can learn factual QA patterns useful for financial analysis, business intelligence, and decision-making.
Step 3: Load the Pre-Trained Model
Note: We use TinyLlama-1.1B-Chat, optimized for low-memory inference & chat-based tasks.
# Define the model name
# model_ref = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model_ref = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
Step 4: Create Structured Q&A Pairs for Training
Note: The Q&A dataset is designed to simulate real-world business inquiries.
# Structured Q&A from the simulated data for training purpose
qa_data = [
{"question": "What was the revenue for AlphaTech Inc. in Q3?", "answer": "$80M"},
{"question": "How many new customers did AlphaTech acquire in Q2?", "answer": "25,000 new users"},
{"question": "Which quarter did AlphaTech enter the Asian market?", "answer": "Q3"},
{"question": "What was the net profit of AlphaTech in Q4?", "answer": "$15M"},
{"question": "What investment did AlphaTech make in Q1?", "answer": "Invested $5M in AI research."},
{"question": "What major challenge did AlphaTech face in Q3?", "answer": "Rising competition."},
{"question": "Which region did AlphaTech expand to in Q4?", "answer": "Latin America."},
{"question": "What was the main regulatory challenge faced by AlphaTech?", "answer": "Regulatory compliance challenges in Q4."}
Why? This helps the model learn structured fact retrieval, making it useful for automated business Q&A systems.
Step 5: Load Model & Tokenizer
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset
from transformers import TrainingArguments, Trainer
import gc
# Define the model name
model_name = model_ref
# Load pre-trained model & tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
Why? Loads TinyLlama’s pre-trained knowledge and moves it to GPU for faster fine-tuning.
Suggest to Observe the model size and CPU & GPU utilisation from now
Step 6: Data preprocessing, Tokenization, LoRA configuration and training arguments
def format_qa(example):
return {
"text": f"Question: {example['question']} Answer: {example['answer']}"
}
qa_dataset = Dataset.from_list(qa_data)
formatted_dataset = qa_dataset.map(format_qa)
# Tokenization
def preprocess_function(examples):
inputs = tokenizer(
examples['text'],
truncation=True,
padding="max_length",
max_length=512
)
# Labels should be a shifted version of input_ids for causal LM training
inputs["labels"] = inputs["input_ids"].copy()
return inputs
# Apply tokenization
tokenized_qa_dataset = formatted_dataset.map(preprocess_function, batched=True)
# Define LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
# Wrap model with LoRA
model = get_peft_model(model, lora_config)
# Set Training Arguments
training_args = TrainingArguments(
per_device_train_batch_size=1, # Adjusted for GPU memory limitations
gradient_accumulation_steps=8, # To simulate a larger batch size
warmup_steps=100,
max_steps=250,
learning_rate=2e-4,
fp16=True, # Enable mixed precision training
logging_steps=10,
output_dir="outputs",
report_to="none",
remove_unused_columns=False,
)
Why? LoRA reduces memory usage by fine-tuning only key layers instead of all 1.1B parameters.
Step 7: Model Fine-tuning and Save
# Move model to CPU to free memory before training
model = model.to("cpu")
# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_qa_dataset,
)
# Free up memory before training
gc.collect() # Garbage collection
torch.cuda.empty_cache() # Clears CUDA cache
print("GPU clache cleared")
# Optimize model with torch.compile (improves execution speed)
model = torch.compile(model)
# Move model back to GPU for training
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Start training
trainer.train()
# Save the fine tuned model
model.save_pretrained("fine-tuned-QA-tinyllama-1.1B")
tokenizer.save_pretrained("fine-tuned-QA-tinyllama-1.1B")
Why?
Step 8: Load the Fine-Tuned Model for Inference
# Load the fine-tuned model
model_path = "fine-tuned-QA-tinyllama-1.1B"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
def generate_answer(question, max_length=50):
prompt = f"Question: {question} Answer:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(**inputs, max_length=max_length, temperature=0.7, top_k=50, top_p=0.9)
return tokenizer.decode(output[0], skip_special_tokens=True)
Step 9: Test the Fine-Tuned Model
Observation 1: The results are accurate on known questions (Passed)
Observation 2: Model answered the same for similar questions (Failed)
Observation 3: Model failed in aggregating the known info (Failed)
Observation 4: Model cloud not figure out difference in company (Failed)
Key Takeaways
? TinyLlama is well-suited for structured Q&A with LoRA fine-tuning.
? Memory-efficient LoRA allows training on Google Colab’s free GPU.
?? Batch size must remain small (1) to prevent OOM errors.
?? Training on more diverse data improves generalization.
Next Steps
?? Expand dataset for more diverse financial/business questions
?? Fine-tune for longer (e.g., max_steps=1000)
?? Evaluate performance with BLEU
Would you fine-tune LLMs for business Q&A? Drop your thoughts below!
GCC Strategist & Business Transformation Advisor | GTM & Growth | Digital & HR Transformation | Fractional CXO
1 个月Interesting Stuff !!