Fine-Tuning Florence-2 Base Model on a Custom Dataset for Image Captioning
Royal Cyber Asia
A purpose-led organization helping businesses thrive through innovation, technology, & forward-thinking.
Introduction
World of AI and machine learning has become a desired approach to achieve state-of-the-art performance in fine-tuning pre-trained models on custom datasets. This article directs you through the process of fine-tuning the Florence-2 base model on your custom dataset, sharing insights and solution. Vision-Language Models (VLMs) such as Florence-2 are setup to perform tasks like object detection, image captioning, and visual question answering. Large Language Models (LLMs) are mainly designed to do text-based tasks such as summarization, translation, and text generation. Vision-Language Models (VLMs) are different from Large Language Models (LLMs). VLMs can generate outputs that are more accurate and contextually rich by utilizing both textual and visual information. Large vision language models have good zero-shot capabilities & generalize well.
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks:
As of June 2024, Microsoft created the Florence-2 advanced vision foundation model, which uses a prompt-based methodology to handle a variety of computer vision and vision-language tasks. This adaptable approach is capable of segmentation, object detection, and captioning using basic text cues. Florence-2 excels at multi-task learning by utilizing the large FLD-5B dataset, which consists of 5.4 billion annotations across 126 million images. The model is a competitive vision foundation model because of its sequence-to-sequence architecture, which allows it to perform exceptionally well in both zero-shot and fine-tuned conditions.
In this guide, we will explore the steps involved in fine-tuning the Florence-2 base model on a custom dataset, ensuring that you can harness its full potential for your specific applications.
Preparing the Dataset
As with any fine-tuning process, the dataset preparation comes first. In this example, a dataset of damaged car photos will be used, and each image will have a label specifying the kind of damage. This is how we got our dataset ready:
Below is a snapshot of the prepared dataset :
Fine-Tuning Process
Fine-tuning a pre-trained model involves several steps, from loading the model and processor to training on the custom dataset . Here’s how to fine-tune the Florence-2 base model:
Install the following required libraries for Florence-2 :
!pip install -q datasets flash_attn timm einops
Next we load the dataset from huggingface datasets :
from datasets import load_dataset
data = load_dataset("tahaman/DamageCarDataset")
# Check the shape of the dataset
train_shape = len(data['train'])
test_shape = len(data['test'])
print(f"Train Dataset Shape: {train_shape} examples")
print(f"Test Dataset Shape: {test_shape} examples")
Train Dataset Shape: 125 examples
Test Dataset Shape: 25 examples
Load the Pre-trained Model and Processor:
We can load the model using AutoModelForCausalLM and the processor using AutoProcessor classes of transformers library. Note that we need to pass trust_remote_code as True since this model is not a transformers model.
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
torch.cuda.empty_cache()
Before diving into the fine-tuning process, it’s crucial to understand how the pre-trained model performs with our dataset. We ran inference on a few examples from our dataset to see the initial performance of the Florence-2 base model.
# Function to run the model on an example
def run_example(task_prompt, text_input, image):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
# Ensure the image is in RGB mode
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
return parsed_answer
# Test the function with a few examples from your dataset
for idx in range(2):
image = data['train'][idx]['image']
description = run_example("Describe the damage to the car.", '', image)
print(f"Generated Description: {description}")
display(image.resize([350, 350]))
Image1:
Generated Description: {'Describe the damage to the car.':
'\nThe image shows a close up of a car with a crack in the side of it.
The car appears to be in need of repair, as evidenced by the scratches and
dents on the surface of the car.\n'}
Image2:
Generated Description: {'Describe the damage to the car.':
'\nThe image shows a close up of a car with a broken windshield and a
yellow line on the side of it. The car appears to be in a state of disrepair,
with scratches and dents visible on the glass.\n'}
After running the model on two sample images from the dataset, we can compare it with the actual descriptions:
Image 1 Actual Description: “The image shows a close-up of a car’s body panel, specifically around the wheel arch area. There is noticeable damage labeled as a “scratch.” The scratch is quite extensive, with the paint visibly scraped off, exposing the underlying material. The damage appears to have affected a significant portion of the panel, with some areas showing deeper gouges and others lighter abrasions. The car’s paint color is a light metallic shade, possibly silver or gray. The tire and part of the wheel well are visible at the bottom right of the image.”
Image 2 Actual Description: “The image shows a close-up view of a car’s exterior, specifically focusing on a damaged area. The damage is labeled as a “scratch.” The scratch appears to be quite severe, with visible paint removal and underlying material exposed. The scratch is located near the edge of a panel, possibly near the wheel well or a door seam. The surrounding paint is a metallic gray color, and the scratch reveals a yellowish layer beneath the surface. The damage is significant enough to be easily noticeable”.
When tested directly to our particular dataset without any fine-tuning, these descriptions demonstrate the limitations of the pre-trained Florence-2 base model. The damage depicted in the photos is not accurately reflected by the model, which frequently produces extraneous or inaccurate information.
Next, we need to prepare our dataset specifically for the task at hand. This involves creating a custom dataset class and adding a task prefix to construct the prompts appropriately.
from torch.utils.data import Dataset
class DamageCarDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
example = self.data[idx]
prompt = "Describe the damage to the car."
description = example['description']
image = example['image']
if image.mode != "RGB":
image = image.convert("RGB")
return prompt, description, image
# Create datasets
train_dataset = DamageCarDataset(data['train'])
val_dataset = DamageCarDataset(data['test'])
Now let’s move to fine-tuning. We will create our dataset, the data collator, and start training. In A100 with 40GB memory, we can set batch size of 6. If you’re training on T4 with 15GB VRAM, you can use batch size of 1 or 2 depending on the size of the model & dataset.
import os
from torch.utils.data import DataLoader
from tqdm import tqdm
def collate_fn(batch):
prompts, descriptions, images = zip(*batch)
inputs = processor(text=list(prompts), images=list(images), return_tensors="pt", padding=True).to(device)
return inputs, descriptions
# Create DataLoader
batch_size = 2 # 6
num_workers = 0
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers)
# Training Function
from transformers import (AdamW, AutoProcessor, get_scheduler)
def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
optimizer = AdamW(model.parameters(), lr=lr)
num_training_steps = epochs * len(train_loader)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
for epoch in range(epochs):
model.train()
train_loss = 0
for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
inputs, descriptions = batch
input_ids = inputs["input_ids"]
pixel_values = inputs["pixel_values"]
labels = processor.tokenizer(text=list(descriptions), return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
train_loss += loss.item()
avg_train_loss = train_loss / len(train_loader)
print(f"Average Training Loss: {avg_train_loss}")
# Validation phase
model.eval()
val_loss = 0
with torch.no_grad():
for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
inputs, descriptions = batch
input_ids = inputs["input_ids"]
pixel_values = inputs["pixel_values"]
labels = processor.tokenizer(text=list(descriptions), return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
loss = outputs.loss
val_loss += loss.item()
avg_val_loss = val_loss / len(val_loader)
print(f"Average Validation Loss: {avg_val_loss}")
# Save model checkpoint
output_dir = f"./model_checkpoints/epoch_{epoch+1}"
os.makedirs(output_dir, exist_ok=True)
model.save_pretrained(output_dir)
processor.save_pretrained(output_dir)
After training, we will push the model to Hugging Face Hub . To do so, we need to login first with write access. Make sure to pass fine-grained token (by first creating the repository and setting up fine-grained token access).
领英推荐
from huggingface_hub import notebook_login
notebook_login()
Once you logged in you will get output.
Token is valid (permission: fineGrained)
In the last step, we will freeze image encoder for this tutorial. Although in the paper they have reported improvement in unfreezing image encoder, but note that this will result in extra resource usage. Author's describe a performance improvement when fine-tuning with an unfrozen image encoder, compared with freezing it.
for param in model.vision_tower.parameters():
param.is_trainable = False
# True for Unfrozen image encoders
train_model(train_loader, val_loader, model, processor, epochs=10)
We observed a constant decline in both training and validation loss Throughout the fine-tuning process, showing that the model was learning and increasing its task performance. Particularly the validation loss remained slightly higher than the training loss, showing that there is still room for improvement. The number of epochs increase could allow the model to learn more complex patterns in the data, reducing the validation loss further. Still, this would require more computational resources and time.
You can push the model like below.
model.push_to_hub("tahaman/DamageCarModel")
processor.push_to_hub("tahaman/DamageCarModel")
We performed our experiments with a lower resource setup to evaluate the model’s capabilities in constrained fine-tuning environments. By freezing the vision encoder, we used a batch size of 2 with a T4 GPU in Google Colab, and. Additionally, we tested the model with both frozen and unfrozen image encoders.
Model Testing
Now lets test our custom fine-tuned Florence-2-base model.
# Testing the Model
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image
import matplotlib.pyplot as plt
import textwrap
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load fine-tuned model and processor
model = AutoModelForCausalLM.from_pretrained("tahaman/DamageCarModel", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("tahaman/DamageCarModel", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, text_input, image_path):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
# Load and preprocess the image
image = Image.open(image_path)
if image.mode != "RGB":
image = image.convert("RGB")
# Tokenize inputs
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
# Generate output
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
# Ensure parsed_answer is a string
if isinstance(parsed_answer, dict):
parsed_answer = str(parsed_answer)
# Display image
plt.imshow(image)
plt.axis('off')
plt.show()
# Print the description with wrapping
wrapped_description = textwrap.fill(parsed_answer, width=120)
print(f"Generated Description:\n{wrapped_description}")
return parsed_answer
# Test the function with an image from your test set
image_path = "/content/test.jpg"
description = run_example("Describe the damage to the car.", '', image_path)
"AssertionError: only DaViT is supported for now."
After pushing the model to the HuggingFace Model Repository, attempting to load it results in an error: “AssertionError: only DaViT is supported for now.” This error occurs because the model (tahaman/DamageCarModel) is designed specifically for the DaViT (Data-efficient Vision Transformer) architecture. The error message indicates that the model architecture in your saved checkpoint expects a DaViT model configuration, but the model you’re trying to load (AutoModelForCausalLM) is not compatible with DaViT.
After broad research, I found that my config.json file was missing the required vision_config parameter:
"vision_config": {
"model_type": "davit"
}
The config.json file contains all the necessary configuration parameters for appropriately initializing the model architecture. If any essential parameter is missing or incorrect, the model will fail to initialize.
To resolve this issue, we have a workaround for this, we need to use the config.json provided with the actual Florence-2-base model. Copy the entire configuration parameters from Florence-2-base/config.json, except for the first few lines of headers. i.e.
{
"_name_or_path": "florence2",
"architectures": [
"Florence2ForConditionalGeneration"
],
"auto_map": {
"AutoConfig": "configuration_florence2.Florence2Config",
"AutoModelForCausalLM": "modeling_florence2.Florence2ForConditionalGeneration"
},
Impact of Replacing config.json:
Reason for Assertion Error: This type of error occurs when the required model type in the vision config either missing or incorrect. For Florence-2 model, some specific configurations are expected for its vision component. the vision_config can lead to unexpected loading failures during fine-tuning, altering or omitting parameters.
We ensure that the model architecture is properly described by updating the config.json file with the correct vision_config. This allows us to load and use our fine-tuned model without errors. This acclimation does not impact the performance of our fine-tuned model.
After successfully loading our model, here are the results obtained from our fine-tuned model :
Generated Description: {‘Describe the damage to the car.’: ‘The image shows the front view of a brown car with visible damage. The front bumper is severely damaged, with visible scratches and dents. The headlight and grille are also damaged, indicating a significant impact. The car appears to be parked on a white surface, possibly a garage or workshop. The damage is likely to the front of the car, specifically the front grille and bumper area.’}
We evaluated the test image using the Florence-2-base model without any fine-tuning:
Generated Description: {‘Describe the damage to the car.’: ‘the car is on the floor and there is a wall in the background. The car appears to be a Toyota Innova Crysta.’}
We also evaluated on Florence-2-large model without any fine-tuning:
Generated Description: {‘Describe the damage to the car.’: ‘The image shows a brown car parked in front of a gray wall. The car appears to be in a state of disrepair, with rust and dents visible on the body.’}
For comparison, we evaluated our custom fine-tuned model with unfrozen image encoders.
Generated Description: {‘Describe the damage to the car.’: ‘The image shows the front view of a brown car, specifically the front part of the vehicle. The car appears to be in a state of disrepair, with the front grille and headlights severely damaged. The damage is located on the left side of the front bumper, with some areas of the headlight and bumper missing. The front bumper is also damaged, with visible signs of wear and tear. The vehicle is parked on a white surface, and the background is a plain grey color.’}
Conclusion
The above findings shows after comparing the frozen and unfrozen layer models with Florence-2-base and Florence-2-large our model performed fairly well. We enhanced our model for a particular use case, even with a smaller dataset of only 125 samples and less computational power. This highlights the capabilities of Microsoft’s Florence-2-base, which can easily outperform larger models on a range of computer vision and vision-language tasks with just 0.23 billion parameters.