登录查看更多内容

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

Pranav Kumar PB

Senior Machine Learning Engineer

发布日期: 2025年2月24日

Sorry for the click-baity title, but I want to clarify that while the fine-tuned model from this process may not be as efficient as a general-purpose foundation model, it will be highly effective for specific narrow tasks. This model is trained on a relatively small dataset (~500MB) and based on a modest 3B parameter architecture. In contrast, foundation models like LLaMA-3 (70B) and GPT-4 (parameters undisclosed but significantly larger than GPT-3’s 175B) are trained on massive datasets. LLaMA-3 was trained on approximately 15 trillion tokens (60TB data), while GPT-4 supports up to 128,000 tokens(>100TB data), significantly improving its ability to process long-form content. While my setup doesn’t match the broad generalization capabilities of these large-scale models, it excels in domain-specific tasks, making it a highly cost-effective alternative for targeted applications—which, in reality, is sufficient for most practical use cases.

I wanted to see if I could fine-tune a LLaMA model without breaking the bank. Training a LLaMA-2-3B on a local machine, is nearly impossible without a high-end GPU—trust me, I tried. Firstly, Apple Metal doesn't support CUDA, meaning it's not optimized for PyTorch. Beyond that, memory errors, unbearably slow training times made me realize I needed a cloud solution. So, I turned to Google Cloud Vertex AI, and with some optimizations, I managed to fine-tune LLaMA-2-3B with ~500MB of data for just around $10!

Setting up Vertex AI for Fine-Tuning

To start, I set up Google Cloud Vertex AI Workbench, which provides a cloud-based Jupyter environment for training models. To begin with I enabled Vertex AI API, Compute Engine API & Cloud Storage API.

Then I create a Vertex AI Workbench Instance, created a notebook, a user-managed notebook to be specific. I selected "Deep Learning VM" and chose PyTorch 2.4. I used an A100 Nvidia GPU with 40GB VRAM and a boot disk of 100GB

Install Dependencies

Once my JupyterLab environment was up, I installed all necessary libraries by running:

pip install torchtune transformers accelerate bitsandbytes peft datasets kfp google-cloud-aiplatform

These packages are essential for working with LLaMA models, applying QLoRA, and optimizing training performance.

Load LLaMA-2-3B with QLoRA

To load the LLaMA-2-3B model, I used the transformers library. Since full fine-tuning is memory-intensive, I applied QLoRA, which reduces VRAM usage while retaining model accuracy.

import torchtune as tt

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig, get_peft_model

import torch

MODEL_NAME = "meta-llama/Llama-2-3b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", torch_dtype=torch.float16)

lora_config = LoraConfig(

    r=8, lora_alpha=16, lora_dropout=0.05,

    target_modules=["q_proj", "v_proj"], bias="none",

)

model = get_peft_model(model, lora_config)

This setup ensures that fine-tuning can be performed efficiently without exceeding GPU memory limits.

Data Preparation

For fine-tuning, I used 10% of the Alpaca dataset, which speeds up training while maintaining high-quality results. I tokenized the dataset before training.

from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train[:10%]")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

dataset = dataset.map(tokenize_function, batched=True)

This ensures the dataset is properly formatted for the model.

领英推荐

TAI #110; Llama 3.1’s scaling laws vs 100k+ H100…

Towards AI 7 个月前

TAI 112; Agent Capabilities Advancing; METR Eval and…

Towards AI 7 个月前

?????? One Giant Leap for AI Optimization

Pascal Biese 1 个月前

Fine-Tuning the Model

I then fine-tuned the model using TorchTune, with a single epoch to keep costs manageable.

training_args = tt.TrainingArguments(
    output_dir="./finetuned_llama",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    logging_steps=10,
    learning_rate=2e-4,
    save_strategy="no",
    bf16=True,
)

trainer = tt.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

This process took around 4 hours and 15 minutes on an A100 GPU, costing me around $10. If I had used an RTX 4090 instead, I estimate it would have taken around 6-7 hours, possibly reducing costs to $7-$8 but at the expense of longer training time.

Cost Analysis for Different GPU Configurations

Fine-tuning cost varies based on the GPU type and model size:

Cost Breakdown by GPU

Cost Breakdown by Model Size

For budget-conscious users, fine-tuning LLaMA-1B or 3B on an RTX 4090 is the most cost-effective solution. Larger models like LLaMA-16B require TPUs or multi-GPU setups, significantly increasing costs.

After this point, it would be beneficial to delve deeper into distributed training, covering both data parallelism and model parallelism. Additionally, exploring the feasibility of investing in an in-house GPU cluster could provide long-term benefits. Hugging Face offers a comprehensive guide on this topic: https://huggingface.co/spaces/nanotron/ultrascale-playbook

Alternatively, if you don’t need full customization and control over the training process, you can use Vertex AI Studio’s Tuning feature to fine-tune models with minimal setup. While the exact cost of this feature depends on model size and training duration, it provides a simplified, managed approach to fine-tuning without the need for infrastructure management.

However, if you require continuous re-finetuning whenever new data becomes available, you can build a Kubeflow Pipeline on Vertex AI to automatically retrain the model. This approach is particularly useful in AdTech, where models need to stay updated with the latest trends.

For example, in programmatic advertising, re-finetuning can be triggered when:

New user engagement data is available to optimize ad personalization.
Campaign performance metrics indicate a shift in audience behavior.
A/B testing results suggest that existing ad creatives need refinement.

By automating fine-tuning with Kubeflow Pipelines, businesses can ensure that their models continuously adapt to real-time data, improving performance and relevance while minimizing manual intervention. ??

要查看或添加评论，请登录

Pranav Kumar PB的更多文章

??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

2024年9月22日

??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

Multimodal Large Language Models (LLMs) that understand both text and images (or other media formats) are becoming…

2 条评论
Unraveling LLMs: A PyTorch Developer’s Take on Core Concepts of LLMs

2024年9月14日

Unraveling LLMs: A PyTorch Developer’s Take on Core Concepts of LLMs

0. Introduction Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP)…
Basic Statistics for Exploratory Data Analysis (EDA)

2022年8月10日

Basic Statistics for Exploratory Data Analysis (EDA)

Even though neural networks are very effective for large unstructured data like images, text and speech, we still have…
Backprop Through Time

2022年3月2日

Backprop Through Time

For both Deep Neural Nets and Convoluted Neural Nets, all the examples in the training set are of the same length but…
Convolutions, Pooling & Flattening

2022年2月25日

Convolutions, Pooling & Flattening

While building neural networks for visual tasks like image recognition, object detection or boundary detection…
Deep Neural Nets & Improving them

2022年2月19日

Deep Neural Nets & Improving them

In the previous article, I wrote about the building blocks of Neural nets such as cost functions, gradient descent…

2 条评论
Foundations of Neural Nets

2022年2月17日

Foundations of Neural Nets

It has been a while I did anything related to Machine Learning or Deep learning so I decided to revisit it. Having…

See all articles

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

Pranav Kumar PB

Senior Machine Learning Engineer

Setting up Vertex AI for Fine-Tuning

Install Dependencies

Load LLaMA-2-3B with QLoRA

Data Preparation

领英推荐

Fine-Tuning the Model

Cost Analysis for Different GPU Configurations

Pranav Kumar PB的更多文章

社区洞察

其他会员也浏览了

Latest Updates: Free Llama 3.3 70B, Fine-Tuning API, Serverless Multi-LoRA & Blackwell GPUs

Issue #286 - The ML Engineer ??

Feature Store Architecture, the Year of Large Language Models, and the Top Virtual ODSC West 2023 Sessions to Watch

The Lang Project, Effective Visualization, LLM course, and More

The Real Reasons Why AI is Built on Object Storage

Understanding the AI Tech Stack

My 2025 AI Predictions

Old Machines, New Tricks: Building TensorFlow v2.16 from Scratch

Databricks Mosaic AI Meetups: Coming to a city near you!

Setting up Vertex AI for Fine-Tuning

Install Dependencies

Load LLaMA-2-3B with QLoRA

Data Preparation

领英推荐

Fine-Tuning the Model

Cost Analysis for Different GPU Configurations

Pranav Kumar PB的更多文章

??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

Unraveling LLMs: A PyTorch Developer’s Take on Core Concepts of LLMs

Basic Statistics for Exploratory Data Analysis (EDA)

Backprop Through Time

Convolutions, Pooling & Flattening

Deep Neural Nets & Improving them

Foundations of Neural Nets

社区洞察

其他会员也浏览了

Latest Updates: Free Llama 3.3 70B, Fine-Tuning API, Serverless Multi-LoRA & Blackwell GPUs

Issue #286 - The ML Engineer ??

Feature Store Architecture, the Year of Large Language Models, and the Top Virtual ODSC West 2023 Sessions to Watch

The Lang Project, Effective Visualization, LLM course, and More

The Real Reasons Why AI is Built on Object Storage

Understanding the AI Tech Stack

My 2025 AI Predictions

Old Machines, New Tricks: Building TensorFlow v2.16 from Scratch

Databricks Mosaic AI Meetups: Coming to a city near you!