登录查看更多内容

Fine-Tuning TinyLlama for Q&A on Structured Company Data: A Hands-On Guide with LoRA

Thirumalesh Konathala (PhD)

AI Innovation Leader, Advisor | GenAI , PredictiveAI Researcher | AI Architect | Analytics Director | Data Science Leader | Ex-Amazonian | Guest Speaker CSIR - IITR | HCU | ISI |

发布日期: 2025年2月5日

Introduction

In my previous article, I demonstrated how to fine-tune DeepSeek R1 1.5B for domain-specific text generation on Google Colab with a free T4 GPU using LoRA (Low-Rank Adaptation).

https://www.dhirubhai.net/pulse/fine-tune-deepseek-r1-15b-free-gcp-colab-t4-hands-on-konathala-phd--4bluf/

The key takeaway was that LoRA enables fine-tuning large models efficiently, even on constrained hardware.

This follow-up article explores:

Fine-tuning TinyLlama-1.1B-Chat for a Question-Answering (QA) task
Using structured company performance data as training data
Training on Google Colab with LoRA to optimize memory usage
Handling challenges like CUDA memory constraints & batch size tuning

Why TinyLlama?

A lightweight 1.1B parameter model, making it more suitable for QA fine-tuning on free-tier GPUs.
Optimized for chat-based and reasoning tasks, aligning well with financial Q&A applications.
Supports LoRA, enabling efficient fine-tuning without modifying full model weights.

This guide provides end-to-end code for fine-tuning TinyLlama on structured company data and testing its Q&A capabilities.

Prerequisites: Colab provides free NVIDIA T4 GPUs, but we need to enable GPU manually

Open Google Colab
Click “+ New Notebook”
Enable GPU:

- Go to "Runtime" → "Change runtime type"

- Select T4 GPU from the "Hardware Accelerator" dropdown

- Click Save

My experimented Colab notebook link provided below

Link - Colab Notebook

Now, let's start with fine-tuning TinyLlama for Q&A on Structured Company Data

Step 1: Install Required Libraries

Note: We use Hugging Face’s transformers, datasets, peft (for LoRA), and torch (for training).

!pip install transformers datasets peft torch

Step 2: Simulated Company Performance Data for Fine-Tuning

Note: This dataset mimics structured financial & operational reports of a company

Company Name: AlphaTech Inc.

Quarterly Revenue:
Q1: $50M
Q2: $65M
Q3: $80M
Q4: $90M

Net Profit:
Q1: $5M
Q2: $10M
Q3: $12M
Q4: $15M

Customer Growth:
Q1: 10,000 new users
Q2: 25,000 new users
Q3: 40,000 new users
Q4: 55,000 new users

Why? LLMs trained on structured numerical and text data can learn factual QA patterns useful for financial analysis, business intelligence, and decision-making.

Step 3: Load the Pre-Trained Model

Note: We use TinyLlama-1.1B-Chat, optimized for low-memory inference & chat-based tasks.

# Define the model name
# model_ref = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model_ref = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Step 4: Create Structured Q&A Pairs for Training

Note: The Q&A dataset is designed to simulate real-world business inquiries.

# Structured Q&A from the simulated data for training purpose

qa_data = [
    {"question": "What was the revenue for AlphaTech Inc. in Q3?", "answer": "$80M"},
    {"question": "How many new customers did AlphaTech acquire in Q2?", "answer": "25,000 new users"},
    {"question": "Which quarter did AlphaTech enter the Asian market?", "answer": "Q3"},
    {"question": "What was the net profit of AlphaTech in Q4?", "answer": "$15M"},
    {"question": "What investment did AlphaTech make in Q1?", "answer": "Invested $5M in AI research."},
    {"question": "What major challenge did AlphaTech face in Q3?", "answer": "Rising competition."},
    {"question": "Which region did AlphaTech expand to in Q4?", "answer": "Latin America."},
    {"question": "What was the main regulatory challenge faced by AlphaTech?", "answer": "Regulatory compliance challenges in Q4."}

Why? This helps the model learn structured fact retrieval, making it useful for automated business Q&A systems.

Step 5: Load Model & Tokenizer

领英推荐

TAI #104; LLM progress beyond transformers with Samba?

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import Dataset
from transformers import TrainingArguments, Trainer
import gc


# Define the model name
model_name = model_ref

# Load pre-trained model & tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

Why? Loads TinyLlama’s pre-trained knowledge and moves it to GPU for faster fine-tuning.

Suggest to Observe the model size and CPU & GPU utilisation from now

Step 6: Data preprocessing, Tokenization, LoRA configuration and training arguments


def format_qa(example):
    return {
        "text": f"Question: {example['question']} Answer: {example['answer']}"
    }

qa_dataset = Dataset.from_list(qa_data)
formatted_dataset = qa_dataset.map(format_qa)

# Tokenization
def preprocess_function(examples):
    inputs = tokenizer(
        examples['text'],
        truncation=True,
        padding="max_length",
        max_length=512
    )

    # Labels should be a shifted version of input_ids for causal LM training
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

# Apply tokenization
tokenized_qa_dataset = formatted_dataset.map(preprocess_function, batched=True)


# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)


# Set Training Arguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Adjusted for GPU memory limitations
    gradient_accumulation_steps=8,  # To simulate a larger batch size
    warmup_steps=100,
    max_steps=250,
    learning_rate=2e-4,
    fp16=True,  # Enable mixed precision training
    logging_steps=10,
    output_dir="outputs",
    report_to="none",
    remove_unused_columns=False,
)

Why? LoRA reduces memory usage by fine-tuning only key layers instead of all 1.1B parameters.

Batch size = 1 (prevents CUDA Out-of-Memory (OOM) errors on T4 GPU)
Gradient accumulation = 8 (simulates larger batches)
Mixed precision (fp16) for faster training

Step 7: Model Fine-tuning and Save


# Move model to CPU to free memory before training
model = model.to("cpu")

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa_dataset,
)

# Free up memory before training

gc.collect()  # Garbage collection
torch.cuda.empty_cache()  # Clears CUDA cache
print("GPU clache cleared")

# Optimize model with torch.compile (improves execution speed)
model = torch.compile(model)

# Move model back to GPU for training
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Start training
trainer.train()

# Save the fine tuned model
model.save_pretrained("fine-tuned-QA-tinyllama-1.1B")
tokenizer.save_pretrained("fine-tuned-QA-tinyllama-1.1B")

Why?

Clears unused GPU memory
Uses torch.compile() to speed up execution
Uses Hugging Face's Trainer API for efficient training
Saves trained model weights for future inference

Step 8: Load the Fine-Tuned Model for Inference

# Load the fine-tuned model
model_path = "fine-tuned-QA-tinyllama-1.1B"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def generate_answer(question, max_length=50):
    prompt = f"Question: {question} Answer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output = model.generate(**inputs, max_length=max_length, temperature=0.7, top_k=50, top_p=0.9)

    return tokenizer.decode(output[0], skip_special_tokens=True)

Step 9: Test the Fine-Tuned Model

Observation 1: The results are accurate on known questions (Passed)

Observation 2: Model answered the same for similar questions (Failed)

Observation 3: Model failed in aggregating the known info (Failed)

Observation 4: Model cloud not figure out difference in company (Failed)

Key Takeaways

? TinyLlama is well-suited for structured Q&A with LoRA fine-tuning.

? Memory-efficient LoRA allows training on Google Colab’s free GPU.

?? Batch size must remain small (1) to prevent OOM errors.

?? Training on more diverse data improves generalization.

Next Steps

?? Expand dataset for more diverse financial/business questions

?? Fine-tune for longer (e.g., max_steps=1000)

?? Evaluate performance with BLEU

Would you fine-tune LLMs for business Q&A? Drop your thoughts below!

Sudhakar KR

GCC Strategist & Business Transformation Advisor | GTM & Growth | Digital & HR Transformation | Fractional CXO

1 个月

Interesting Stuff !!

1 次回应

要查看或添加评论，请登录

Thirumalesh Konathala (PhD)的更多文章

From "Bag-of-Words" to "Instruct-Tuned LLMs": The Technical and Business Evolution of NLP

2025年2月22日

From "Bag-of-Words" to "Instruct-Tuned LLMs": The Technical and Business Evolution of NLP

Introduction Imagine a time when computers could only count words, barely scratching the surface of human language…

1 条评论
Fine-Tune DeepSeek R1 1.5B on Free GCP Colab T4: A Hands-On Guide with LoRA

2025年1月30日

Fine-Tune DeepSeek R1 1.5B on Free GCP Colab T4: A Hands-On Guide with LoRA

Introduction With the rise of open-weight Large Language Models (LLMs), fine-tuning for domain-specific applications is…

13 条评论

Fine-Tuning TinyLlama for Q&A on Structured Company Data: A Hands-On Guide with LoRA

Thirumalesh Konathala (PhD)

AI Innovation Leader, Advisor | GenAI , PredictiveAI Researcher | AI Architect | Analytics Director | Data Science Leader | Ex-Amazonian | Guest Speaker CSIR - IITR | HCU | ISI |

Introduction

Why TinyLlama?

Step 1: Install Required Libraries

Step 2: Simulated Company Performance Data for Fine-Tuning

Step 3: Load the Pre-Trained Model

Step 4: Create Structured Q&A Pairs for Training

Step 5: Load Model & Tokenizer

领英推荐

Step 6: Data preprocessing, Tokenization, LoRA configuration and training arguments

Step 7: Model Fine-tuning and Save

Step 8: Load the Fine-Tuned Model for Inference

Step 9: Test the Fine-Tuned Model

Key Takeaways

Next Steps

Thirumalesh Konathala (PhD)的更多文章

社区洞察

其他会员也浏览了

Leveraging Sakana AI’s AI CUDA Engineer for High-Performance Computer Vision on the Edge

Unlocking Enterprise Insights: NVIDIA’s Multimodal Document Retrieval Pipeline

Hugging Face is trying to innovate with a bold, open alternative to DeepSeek's R1

Revolutionizing Computational Complexity and Industries

To 800G and Beyond! - Part 1

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

The Future of Computational Logic: Gravitational Logic and the Multidimensional Revolution

DeepSeek R1: The AI That Actually Tries to Be Smart

Maximizing LLM Inference Speed: Proven Strategies and Best Practices

Memory That Serves You: Why Storage Is So Important for AI Data Center Deployments

Introduction

Why TinyLlama?

Step 1: Install Required Libraries

Step 2: Simulated Company Performance Data for Fine-Tuning

Step 3: Load the Pre-Trained Model

Step 4: Create Structured Q&A Pairs for Training

Step 5: Load Model & Tokenizer

领英推荐

Step 6: Data preprocessing, Tokenization, LoRA configuration and training arguments

Step 7: Model Fine-tuning and Save

Step 8: Load the Fine-Tuned Model for Inference

Step 9: Test the Fine-Tuned Model

Key Takeaways

Next Steps

Thirumalesh Konathala (PhD)的更多文章

From "Bag-of-Words" to "Instruct-Tuned LLMs": The Technical and Business Evolution of NLP

Fine-Tune DeepSeek R1 1.5B on Free GCP Colab T4: A Hands-On Guide with LoRA

社区洞察

其他会员也浏览了

Leveraging Sakana AI’s AI CUDA Engineer for High-Performance Computer Vision on the Edge

Unlocking Enterprise Insights: NVIDIA’s Multimodal Document Retrieval Pipeline

Hugging Face is trying to innovate with a bold, open alternative to DeepSeek's R1

Revolutionizing Computational Complexity and Industries

To 800G and Beyond! - Part 1

Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades

The Future of Computational Logic: Gravitational Logic and the Multidimensional Revolution

DeepSeek R1: The AI That Actually Tries to Be Smart

Maximizing LLM Inference Speed: Proven Strategies and Best Practices

Memory That Serves You: Why Storage Is So Important for AI Data Center Deployments