Fine-Tuning TinyLlama for Q&A on Structured Company Data: A Hands-On Guide with LoRA

Fine-Tuning TinyLlama for Q&A on Structured Company Data: A Hands-On Guide with LoRA

Introduction

In my previous article, I demonstrated how to fine-tune DeepSeek R1 1.5B for domain-specific text generation on Google Colab with a free T4 GPU using LoRA (Low-Rank Adaptation).

https://www.dhirubhai.net/pulse/fine-tune-deepseek-r1-15b-free-gcp-colab-t4-hands-on-konathala-phd--4bluf/

The key takeaway was that LoRA enables fine-tuning large models efficiently, even on constrained hardware.

This follow-up article explores:

  • Fine-tuning TinyLlama-1.1B-Chat for a Question-Answering (QA) task
  • Using structured company performance data as training data
  • Training on Google Colab with LoRA to optimize memory usage
  • Handling challenges like CUDA memory constraints & batch size tuning


Why TinyLlama?

  • A lightweight 1.1B parameter model, making it more suitable for QA fine-tuning on free-tier GPUs.
  • Optimized for chat-based and reasoning tasks, aligning well with financial Q&A applications.
  • Supports LoRA, enabling efficient fine-tuning without modifying full model weights.

This guide provides end-to-end code for fine-tuning TinyLlama on structured company data and testing its Q&A capabilities.


Prerequisites: Colab provides free NVIDIA T4 GPUs, but we need to enable GPU manually

  1. Open Google Colab
  2. Click “+ New Notebook”
  3. Enable GPU:

- Go to "Runtime" → "Change runtime type"

- Select T4 GPU from the "Hardware Accelerator" dropdown

- Click Save

My experimented Colab notebook link provided below

Link - Colab Notebook


Now, let's start with fine-tuning TinyLlama for Q&A on Structured Company Data

Step 1: Install Required Libraries

Note: We use Hugging Face’s transformers, datasets, peft (for LoRA), and torch (for training).

!pip install transformers datasets peft torch        

Step 2: Simulated Company Performance Data for Fine-Tuning

Note: This dataset mimics structured financial & operational reports of a company

Company Name: AlphaTech Inc.

Quarterly Revenue:
Q1: $50M
Q2: $65M
Q3: $80M
Q4: $90M

Net Profit:
Q1: $5M
Q2: $10M
Q3: $12M
Q4: $15M

Customer Growth:
Q1: 10,000 new users
Q2: 25,000 new users
Q3: 40,000 new users
Q4: 55,000 new users        

Why? LLMs trained on structured numerical and text data can learn factual QA patterns useful for financial analysis, business intelligence, and decision-making.

Step 3: Load the Pre-Trained Model

Note: We use TinyLlama-1.1B-Chat, optimized for low-memory inference & chat-based tasks.

# Define the model name
# model_ref = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model_ref = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"        

Step 4: Create Structured Q&A Pairs for Training

Note: The Q&A dataset is designed to simulate real-world business inquiries.

# Structured Q&A from the simulated data for training purpose

qa_data = [
    {"question": "What was the revenue for AlphaTech Inc. in Q3?", "answer": "$80M"},
    {"question": "How many new customers did AlphaTech acquire in Q2?", "answer": "25,000 new users"},
    {"question": "Which quarter did AlphaTech enter the Asian market?", "answer": "Q3"},
    {"question": "What was the net profit of AlphaTech in Q4?", "answer": "$15M"},
    {"question": "What investment did AlphaTech make in Q1?", "answer": "Invested $5M in AI research."},
    {"question": "What major challenge did AlphaTech face in Q3?", "answer": "Rising competition."},
    {"question": "Which region did AlphaTech expand to in Q4?", "answer": "Latin America."},
    {"question": "What was the main regulatory challenge faced by AlphaTech?", "answer": "Regulatory compliance challenges in Q4."}        

Why? This helps the model learn structured fact retrieval, making it useful for automated business Q&A systems.


Step 5: Load Model & Tokenizer

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset
from transformers import TrainingArguments, Trainer
import gc


# Define the model name
model_name = model_ref

# Load pre-trained model & tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)        

Why? Loads TinyLlama’s pre-trained knowledge and moves it to GPU for faster fine-tuning.

Suggest to Observe the model size and CPU & GPU utilisation from now

Step 6: Data preprocessing, Tokenization, LoRA configuration and training arguments


def format_qa(example):
    return {
        "text": f"Question: {example['question']} Answer: {example['answer']}"
    }

qa_dataset = Dataset.from_list(qa_data)
formatted_dataset = qa_dataset.map(format_qa)

# Tokenization
def preprocess_function(examples):
    inputs = tokenizer(
        examples['text'],
        truncation=True,
        padding="max_length",
        max_length=512
    )

    # Labels should be a shifted version of input_ids for causal LM training
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

# Apply tokenization
tokenized_qa_dataset = formatted_dataset.map(preprocess_function, batched=True)


# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)


# Set Training Arguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Adjusted for GPU memory limitations
    gradient_accumulation_steps=8,  # To simulate a larger batch size
    warmup_steps=100,
    max_steps=250,
    learning_rate=2e-4,
    fp16=True,  # Enable mixed precision training
    logging_steps=10,
    output_dir="outputs",
    report_to="none",
    remove_unused_columns=False,
)
        

Why? LoRA reduces memory usage by fine-tuning only key layers instead of all 1.1B parameters.

  • Batch size = 1 (prevents CUDA Out-of-Memory (OOM) errors on T4 GPU)
  • Gradient accumulation = 8 (simulates larger batches)
  • Mixed precision (fp16) for faster training


Step 7: Model Fine-tuning and Save


# Move model to CPU to free memory before training
model = model.to("cpu")

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa_dataset,
)

# Free up memory before training

gc.collect()  # Garbage collection
torch.cuda.empty_cache()  # Clears CUDA cache
print("GPU clache cleared")

# Optimize model with torch.compile (improves execution speed)
model = torch.compile(model)

# Move model back to GPU for training
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Start training
trainer.train()

# Save the fine tuned model
model.save_pretrained("fine-tuned-QA-tinyllama-1.1B")
tokenizer.save_pretrained("fine-tuned-QA-tinyllama-1.1B")        

Why?

  • Clears unused GPU memory
  • Uses torch.compile() to speed up execution
  • Uses Hugging Face's Trainer API for efficient training
  • Saves trained model weights for future inference

Step 8: Load the Fine-Tuned Model for Inference

# Load the fine-tuned model
model_path = "fine-tuned-QA-tinyllama-1.1B"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def generate_answer(question, max_length=50):
    prompt = f"Question: {question} Answer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output = model.generate(**inputs, max_length=max_length, temperature=0.7, top_k=50, top_p=0.9)

    return tokenizer.decode(output[0], skip_special_tokens=True)        


Step 9: Test the Fine-Tuned Model

Observation 1: The results are accurate on known questions (Passed)
Observation 2: Model answered the same for similar questions (Failed)
Observation 3: Model failed in aggregating the known info (Failed)
Observation 4: Model cloud not figure out difference in company (Failed)


Key Takeaways

? TinyLlama is well-suited for structured Q&A with LoRA fine-tuning.

? Memory-efficient LoRA allows training on Google Colab’s free GPU.

?? Batch size must remain small (1) to prevent OOM errors.

?? Training on more diverse data improves generalization.


Next Steps

?? Expand dataset for more diverse financial/business questions

?? Fine-tune for longer (e.g., max_steps=1000)

?? Evaluate performance with BLEU


Would you fine-tune LLMs for business Q&A? Drop your thoughts below!

Sudhakar KR

GCC Strategist & Business Transformation Advisor | GTM & Growth | Digital & HR Transformation | Fractional CXO

1 个月

Interesting Stuff !!

要查看或添加评论,请登录

Thirumalesh Konathala (PhD)的更多文章

社区洞察

其他会员也浏览了