I fine-tuned a LLaMA on Vertex AI using torchtune for $10
Sorry for the click-baity title, but I want to clarify that while the fine-tuned model from this process may not be as efficient as a general-purpose foundation model, it will be highly effective for specific narrow tasks. This model is trained on a relatively small dataset (~500MB) and based on a modest 3B parameter architecture. In contrast, foundation models like LLaMA-3 (70B) and GPT-4 (parameters undisclosed but significantly larger than GPT-3’s 175B) are trained on massive datasets. LLaMA-3 was trained on approximately 15 trillion tokens (60TB data), while GPT-4 supports up to 128,000 tokens(>100TB data), significantly improving its ability to process long-form content. While my setup doesn’t match the broad generalization capabilities of these large-scale models, it excels in domain-specific tasks, making it a highly cost-effective alternative for targeted applications—which, in reality, is sufficient for most practical use cases.
I wanted to see if I could fine-tune a LLaMA model without breaking the bank. Training a LLaMA-2-3B on a local machine, is nearly impossible without a high-end GPU—trust me, I tried. Firstly, Apple Metal doesn't support CUDA, meaning it's not optimized for PyTorch. Beyond that, memory errors, unbearably slow training times made me realize I needed a cloud solution. So, I turned to Google Cloud Vertex AI, and with some optimizations, I managed to fine-tune LLaMA-2-3B with ~500MB of data for just around $10!
Setting up Vertex AI for Fine-Tuning
To start, I set up Google Cloud Vertex AI Workbench, which provides a cloud-based Jupyter environment for training models. To begin with I enabled Vertex AI API, Compute Engine API & Cloud Storage API.
Then I create a Vertex AI Workbench Instance, created a notebook, a user-managed notebook to be specific. I selected "Deep Learning VM" and chose PyTorch 2.4. I used an A100 Nvidia GPU with 40GB VRAM and a boot disk of 100GB
Install Dependencies
Once my JupyterLab environment was up, I installed all necessary libraries by running:
pip install torchtune transformers accelerate bitsandbytes peft datasets kfp google-cloud-aiplatform
These packages are essential for working with LLaMA models, applying QLoRA, and optimizing training performance.
Load LLaMA-2-3B with QLoRA
To load the LLaMA-2-3B model, I used the transformers library. Since full fine-tuning is memory-intensive, I applied QLoRA, which reduces VRAM usage while retaining model accuracy.
import torchtune as tt
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
import torch
MODEL_NAME = "meta-llama/Llama-2-3b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", torch_dtype=torch.float16)
lora_config = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05,
target_modules=["q_proj", "v_proj"], bias="none",
)
model = get_peft_model(model, lora_config)
This setup ensures that fine-tuning can be performed efficiently without exceeding GPU memory limits.
Data Preparation
For fine-tuning, I used 10% of the Alpaca dataset, which speeds up training while maintaining high-quality results. I tokenized the dataset before training.
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:10%]")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)
dataset = dataset.map(tokenize_function, batched=True)
This ensures the dataset is properly formatted for the model.
领英推荐
Fine-Tuning the Model
I then fine-tuned the model using TorchTune, with a single epoch to keep costs manageable.
training_args = tt.TrainingArguments(
output_dir="./finetuned_llama",
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
num_train_epochs=1,
logging_steps=10,
learning_rate=2e-4,
save_strategy="no",
bf16=True,
)
trainer = tt.Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
This process took around 4 hours and 15 minutes on an A100 GPU, costing me around $10. If I had used an RTX 4090 instead, I estimate it would have taken around 6-7 hours, possibly reducing costs to $7-$8 but at the expense of longer training time.
Cost Analysis for Different GPU Configurations
Fine-tuning cost varies based on the GPU type and model size:
Cost Breakdown by GPU
Cost Breakdown by Model Size
For budget-conscious users, fine-tuning LLaMA-1B or 3B on an RTX 4090 is the most cost-effective solution. Larger models like LLaMA-16B require TPUs or multi-GPU setups, significantly increasing costs.
After this point, it would be beneficial to delve deeper into distributed training, covering both data parallelism and model parallelism. Additionally, exploring the feasibility of investing in an in-house GPU cluster could provide long-term benefits. Hugging Face offers a comprehensive guide on this topic: https://huggingface.co/spaces/nanotron/ultrascale-playbook
Alternatively, if you don’t need full customization and control over the training process, you can use Vertex AI Studio’s Tuning feature to fine-tune models with minimal setup. While the exact cost of this feature depends on model size and training duration, it provides a simplified, managed approach to fine-tuning without the need for infrastructure management.
However, if you require continuous re-finetuning whenever new data becomes available, you can build a Kubeflow Pipeline on Vertex AI to automatically retrain the model. This approach is particularly useful in AdTech, where models need to stay updated with the latest trends.
For example, in programmatic advertising, re-finetuning can be triggered when:
By automating fine-tuning with Kubeflow Pipelines, businesses can ensure that their models continuously adapt to real-time data, improving performance and relevance while minimizing manual intervention. ??