Geek Out Time: Exploring LoRA on Google Colab: the Challenges of Base Model Upgrades
(Also on Constellar tech blog https://medium.com/the-constellar-digital-technology-blog/exploring-lora-on-google-colab-the-challenges-of-base-model-upgrades-91fd9809511c)
How to address the challenge of the need to retrain models whenever the base model is upgraded? Retraining can be computationally expensive and time-consuming, so the idea of a method that retains fine-tuning efforts even as the base model evolves was appealing. Enter LoRA (Low-Rank Adaptation) — a technique that makes fine-tuning efficient by training only a small subset of model parameters. Let’s walk through fine-tuning GPT-2 with LoRA on a minimal dataset, highlight the results, and discuss the constraints of reusability.
Why LoRA?
Traditional fine-tuning updates all parameters of the model, requiring vast compute resources. LoRA adapts only specific layers by introducing trainable low-rank matrices, significantly reducing memory requirements. This makes it ideal for fine-tuning large models on consumer-grade GPUs.
Setup Overview
Here’s what I used:
Fine-Tuning Process
Step 1: Preparing the Environment
Ensure all dependencies are installed:
pip install transformers peft accelerate datasets bitsandbytes huggingface-hub
Authenticate with Hugging Face to access models and datasets.
Step 2: Loading the Base Model
I loaded GPT-2 with 8-bit quantization using the Hugging Face Transformers library. This step saved GPU memory while maintaining acceptable performance.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_8bit=True
)
tokenizer.pad_token = tokenizer.eos_token
Step 3: Applying LoRA
LoRA modifies specific attention layers in the model. I used the PEFT library to apply LoRA to GPT-2:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=4,
lora_alpha=8,
target_modules=["c_attn"],
lora_dropout=0.2
)
lora_model = get_peft_model(model, lora_config)
Step 4: Preprocessing the Dataset
I used a small subset of the WikiText-2 dataset, tokenized and padded for training:
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
train_subset = dataset["train"].select(range(500))
eval_subset = dataset["validation"].select(range(50))
# Tokenization function
def preprocess_function(batch):
tokenized = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=512)
tokenized["labels"] = tokenized["input_ids"]
return tokenized
train_dataset = train_subset.map(preprocess_function, batched=True)
eval_dataset = eval_subset.map(preprocess_function, batched=True)
Step 5: Fine-Tuning
Using the Hugging Face Trainer, I fine-tuned the model for 3 epochs with LoRA layers enabled:
领英推荐
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./gpt2-lora-results",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
evaluation_strategy="steps",
eval_steps=10,
save_steps=10,
logging_steps=10,
learning_rate=5e-5,
fp16=True
)
trainer = Trainer(
model=lora_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Results
The LoRA-adapted GPT-2 was successfully learned from the small dataset. Here’s a snapshot of the training process:
Step Training Loss Validation Loss
10 9.086200 No log
20 9.108600 No log
30 8.945400 No log
40 8.854900 No log
50 8.748100 No log
Constraints of LoRA in Terms of Reusability
While LoRA is highly efficient, there are certain constraints to its reusability:
Saving and Reloading LoRA Layers
After training, I saved the LoRA layers separately for reusability:
lora_model.save_pretrained("./gpt2-lora-adapters")
These adapters can be reloaded into the base model using:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_name)
loaded_lora_model = PeftModel.from_pretrained(base_model, "./gpt2-lora-adapters"
Inference
With the fine-tuned LoRA model, I tested text generation on a query:
input_text = "Explain the significance of the industrial revolution."
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {key: value.to("cuda") for key, value in inputs.items()}
outputs = loaded_lora_model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Generated Output:
Explain the significance of the industrial revolution. What is it?
This was an important question to ask, as we all know from our own experience in this era and with so many other crises that occurred before us: what are those changes about which one should be concerned or even optimistic…
Thoughts
LoRA adapters are tightly coupled to the specific base model checkpoint used during training. If the base model undergoes significant changes — such as an architectural overhaul or enhancements to its vocabulary or embeddings — the existing LoRA adapters may no longer align with the updated structure. This means that while LoRA reduces the scope of fine-tuning, it doesn’t completely eliminate the need for retraining when the base model is updated. For instance, an upgrade from GPT-2 to GPT-3 would likely render previous LoRA adapters incompatible due to differences in architecture and parameter distribution.
Nonetheless, LoRA does offer significant advantages. Even when retraining is required, the process is much faster and less resource-intensive compared to full fine-tuning. Moreover, in cases where the base model upgrade retains most of its original structure (e.g., a minor revision or additional pretraining), LoRA adapters may still perform with minimal adjustments. This makes it a possible solution for managing upgrades in a computationally efficient manner.
Conclusion
LoRA makes it possible to adapt architectures like GPT-2 on resource-constrained hardware, offering flexibility and efficiency. While it doesn’t fully address the challenge of base model upgrades, its ability to simplify the retraining process and enable quick adaptation makes it a valuable tool. Give it a shot and have fun !