I fine-tuned a LLaMA on Vertex AI using torchtune for $10
Finetuning with torchtune on Vertex AI

I fine-tuned a LLaMA on Vertex AI using torchtune for $10

Sorry for the click-baity title, but I want to clarify that while the fine-tuned model from this process may not be as efficient as a general-purpose foundation model, it will be highly effective for specific narrow tasks. This model is trained on a relatively small dataset (~500MB) and based on a modest 3B parameter architecture. In contrast, foundation models like LLaMA-3 (70B) and GPT-4 (parameters undisclosed but significantly larger than GPT-3’s 175B) are trained on massive datasets. LLaMA-3 was trained on approximately 15 trillion tokens (60TB data), while GPT-4 supports up to 128,000 tokens(>100TB data), significantly improving its ability to process long-form content. While my setup doesn’t match the broad generalization capabilities of these large-scale models, it excels in domain-specific tasks, making it a highly cost-effective alternative for targeted applications—which, in reality, is sufficient for most practical use cases.

I wanted to see if I could fine-tune a LLaMA model without breaking the bank. Training a LLaMA-2-3B on a local machine, is nearly impossible without a high-end GPU—trust me, I tried. Firstly, Apple Metal doesn't support CUDA, meaning it's not optimized for PyTorch. Beyond that, memory errors, unbearably slow training times made me realize I needed a cloud solution. So, I turned to Google Cloud Vertex AI, and with some optimizations, I managed to fine-tune LLaMA-2-3B with ~500MB of data for just around $10!

Setting up Vertex AI for Fine-Tuning

To start, I set up Google Cloud Vertex AI Workbench, which provides a cloud-based Jupyter environment for training models. To begin with I enabled Vertex AI API, Compute Engine API & Cloud Storage API.

Then I create a Vertex AI Workbench Instance, created a notebook, a user-managed notebook to be specific. I selected "Deep Learning VM" and chose PyTorch 2.4. I used an A100 Nvidia GPU with 40GB VRAM and a boot disk of 100GB

Install Dependencies

Once my JupyterLab environment was up, I installed all necessary libraries by running:

pip install torchtune transformers accelerate bitsandbytes peft datasets kfp google-cloud-aiplatform        

These packages are essential for working with LLaMA models, applying QLoRA, and optimizing training performance.

Load LLaMA-2-3B with QLoRA

To load the LLaMA-2-3B model, I used the transformers library. Since full fine-tuning is memory-intensive, I applied QLoRA, which reduces VRAM usage while retaining model accuracy.

import torchtune as tt

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig, get_peft_model

import torch

MODEL_NAME = "meta-llama/Llama-2-3b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", torch_dtype=torch.float16)

lora_config = LoraConfig(

    r=8, lora_alpha=16, lora_dropout=0.05,

    target_modules=["q_proj", "v_proj"], bias="none",

)

model = get_peft_model(model, lora_config)        

This setup ensures that fine-tuning can be performed efficiently without exceeding GPU memory limits.

Data Preparation

For fine-tuning, I used 10% of the Alpaca dataset, which speeds up training while maintaining high-quality results. I tokenized the dataset before training.

from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train[:10%]")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

dataset = dataset.map(tokenize_function, batched=True)        

This ensures the dataset is properly formatted for the model.

Fine-Tuning the Model

I then fine-tuned the model using TorchTune, with a single epoch to keep costs manageable.

training_args = tt.TrainingArguments(
    output_dir="./finetuned_llama",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    logging_steps=10,
    learning_rate=2e-4,
    save_strategy="no",
    bf16=True,
)

trainer = tt.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()        

This process took around 4 hours and 15 minutes on an A100 GPU, costing me around $10. If I had used an RTX 4090 instead, I estimate it would have taken around 6-7 hours, possibly reducing costs to $7-$8 but at the expense of longer training time.

Cost Analysis for Different GPU Configurations

Fine-tuning cost varies based on the GPU type and model size:

Cost Breakdown by GPU

cost for different GPUs

Cost Breakdown by Model Size

cost for different models

For budget-conscious users, fine-tuning LLaMA-1B or 3B on an RTX 4090 is the most cost-effective solution. Larger models like LLaMA-16B require TPUs or multi-GPU setups, significantly increasing costs.

After this point, it would be beneficial to delve deeper into distributed training, covering both data parallelism and model parallelism. Additionally, exploring the feasibility of investing in an in-house GPU cluster could provide long-term benefits. Hugging Face offers a comprehensive guide on this topic: https://huggingface.co/spaces/nanotron/ultrascale-playbook

Alternatively, if you don’t need full customization and control over the training process, you can use Vertex AI Studio’s Tuning feature to fine-tune models with minimal setup. While the exact cost of this feature depends on model size and training duration, it provides a simplified, managed approach to fine-tuning without the need for infrastructure management.

However, if you require continuous re-finetuning whenever new data becomes available, you can build a Kubeflow Pipeline on Vertex AI to automatically retrain the model. This approach is particularly useful in AdTech, where models need to stay updated with the latest trends.

For example, in programmatic advertising, re-finetuning can be triggered when:

  • New user engagement data is available to optimize ad personalization.
  • Campaign performance metrics indicate a shift in audience behavior.
  • A/B testing results suggest that existing ad creatives need refinement.

By automating fine-tuning with Kubeflow Pipelines, businesses can ensure that their models continuously adapt to real-time data, improving performance and relevance while minimizing manual intervention. ??

要查看或添加评论,请登录

Pranav Kumar PB的更多文章

  • ??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

    ??? Expanding the Scope of LLMs: Multimodal and Task-Enhanced AI

    Multimodal Large Language Models (LLMs) that understand both text and images (or other media formats) are becoming…

    2 条评论
  • Unraveling LLMs: A PyTorch Developer’s Take on Core Concepts of LLMs

    Unraveling LLMs: A PyTorch Developer’s Take on Core Concepts of LLMs

    0. Introduction Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP)…

  • Basic Statistics for Exploratory Data Analysis (EDA)

    Basic Statistics for Exploratory Data Analysis (EDA)

    Even though neural networks are very effective for large unstructured data like images, text and speech, we still have…

  • Backprop Through Time

    Backprop Through Time

    For both Deep Neural Nets and Convoluted Neural Nets, all the examples in the training set are of the same length but…

  • Convolutions, Pooling & Flattening

    Convolutions, Pooling & Flattening

    While building neural networks for visual tasks like image recognition, object detection or boundary detection…

  • Deep Neural Nets & Improving them

    Deep Neural Nets & Improving them

    In the previous article, I wrote about the building blocks of Neural nets such as cost functions, gradient descent…

    2 条评论
  • Foundations of Neural Nets

    Foundations of Neural Nets

    It has been a while I did anything related to Machine Learning or Deep learning so I decided to revisit it. Having…

社区洞察

其他会员也浏览了