登录查看更多内容

Fine-Tuning Florence-2 Base Model on a Custom Dataset for Image Captioning

Royal Cyber Asia

A purpose-led organization helping businesses thrive through innovation, technology, & forward-thinking.

发布日期: 2024年7月8日

Introduction

World of AI and machine learning has become a desired approach to achieve state-of-the-art performance in fine-tuning pre-trained models on custom datasets. This article directs you through the process of fine-tuning the Florence-2 base model on your custom dataset, sharing insights and solution. Vision-Language Models (VLMs) such as Florence-2 are setup to perform tasks like object detection, image captioning, and visual question answering. Large Language Models (LLMs) are mainly designed to do text-based tasks such as summarization, translation, and text generation. Vision-Language Models (VLMs) are different from Large Language Models (LLMs). VLMs can generate outputs that are more accurate and contextually rich by utilizing both textual and visual information. Large vision language models have good zero-shot capabilities & generalize well.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks:

As of June 2024, Microsoft created the Florence-2 advanced vision foundation model, which uses a prompt-based methodology to handle a variety of computer vision and vision-language tasks. This adaptable approach is capable of segmentation, object detection, and captioning using basic text cues. Florence-2 excels at multi-task learning by utilizing the large FLD-5B dataset, which consists of 5.4 billion annotations across 126 million images. The model is a competitive vision foundation model because of its sequence-to-sequence architecture, which allows it to perform exceptionally well in both zero-shot and fine-tuned conditions.

In this guide, we will explore the steps involved in fine-tuning the Florence-2 base model on a custom dataset, ensuring that you can harness its full potential for your specific applications.

Preparing the Dataset

As with any fine-tuning process, the dataset preparation comes first. In this example, a dataset of damaged car photos will be used, and each image will have a label specifying the kind of damage. This is how we got our dataset ready:

Data Collection: We collected a set of car damage photos from the “Analytics Vidya Ripik AI Hackfest” Kaggle dataset , making sure that each photo included a label that described the kind of damage. Three columns make up this dataset: “image_id”, “label”, and “image”. The “label” indicates the kind of damage, the “image” has the actual image data, and the “image_id” gives each image a unique identity.
Data Preprocessing: We went through the images and created thorough descriptions of each one using the labels as a guide in order to prepare the dataset for our model. In order to produce these descriptions, we first tested with a number of open-source models, including Salesforce/blip-image-captioning-base, noamrot/FuseCap, and Salesforce/blip2-opt-2.7b. The descriptions generated by the open-source models were inaccurate for e.g “The image shows the front end of a black car with a black hood and tire, parked on a white line next to a green tree. The car’s windows are visible on either side.” The above description does not provide any relevant information about the damage; however, we found that GPT-4o provided the most accurate and descriptive captions for the damaged car images. If you have enough time and want to save costs, you can write these descriptions manually.
Creating the Dataset: We generated the descriptions and then made a DataFrame using the labels, generated descriptions, and image IDs. This standardized format guaranteed consistency in data handling and made the process of fine-tuning easier.
Pushing to HuggingFace Dataset Hub: To make the dataset easily accessible and reusable, we pushed it to the HuggingFace Dataset Hub. This step helps in streamlining the fine-tuning process and sharing the dataset with the community.

Below is a snapshot of the prepared dataset :

Fine-Tuning Process

Fine-tuning a pre-trained model involves several steps, from loading the model and processor to training on the custom dataset . Here’s how to fine-tune the Florence-2 base model:

Install the following required libraries for Florence-2 :

!pip install -q datasets flash_attn timm einops

Next we load the dataset from huggingface datasets :

from datasets import load_dataset

data = load_dataset("tahaman/DamageCarDataset")

# Check the shape of the dataset
train_shape = len(data['train'])
test_shape = len(data['test'])

print(f"Train Dataset Shape: {train_shape} examples")
print(f"Test Dataset Shape: {test_shape} examples")

Train Dataset Shape: 125 examples
Test Dataset Shape: 25 examples

Load the Pre-trained Model and Processor:

We can load the model using AutoModelForCausalLM and the processor using AutoProcessor classes of transformers library. Note that we need to pass trust_remote_code as True since this model is not a transformers model.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

torch.cuda.empty_cache()

Before diving into the fine-tuning process, it’s crucial to understand how the pre-trained model performs with our dataset. We ran inference on a few examples from our dataset to see the initial performance of the Florence-2 base model.

# Function to run the model on an example
def run_example(task_prompt, text_input, image):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

# Test the function with a few examples from your dataset
for idx in range(2):
    image = data['train'][idx]['image']
    description = run_example("Describe the damage to the car.", '', image)
    print(f"Generated Description: {description}")
    display(image.resize([350, 350]))

Image1:
Generated Description: {'Describe the damage to the car.': 
'\nThe image shows a close up of a car with a crack in the side of it.
 The car appears to be in need of repair, as evidenced by the scratches and
 dents on the surface of the car.\n'}

Image2:
Generated Description: {'Describe the damage to the car.':
 '\nThe image shows a close up of a car with a broken windshield and a
 yellow line on the side of it. The car appears to be in a state of disrepair,
 with scratches and dents visible on the glass.\n'}

After running the model on two sample images from the dataset, we can compare it with the actual descriptions:

Image 1 Actual Description: “The image shows a close-up of a car’s body panel, specifically around the wheel arch area. There is noticeable damage labeled as a “scratch.” The scratch is quite extensive, with the paint visibly scraped off, exposing the underlying material. The damage appears to have affected a significant portion of the panel, with some areas showing deeper gouges and others lighter abrasions. The car’s paint color is a light metallic shade, possibly silver or gray. The tire and part of the wheel well are visible at the bottom right of the image.”

Image 2 Actual Description: “The image shows a close-up view of a car’s exterior, specifically focusing on a damaged area. The damage is labeled as a “scratch.” The scratch appears to be quite severe, with visible paint removal and underlying material exposed. The scratch is located near the edge of a panel, possibly near the wheel well or a door seam. The surrounding paint is a metallic gray color, and the scratch reveals a yellowish layer beneath the surface. The damage is significant enough to be easily noticeable”.

When tested directly to our particular dataset without any fine-tuning, these descriptions demonstrate the limitations of the pre-trained Florence-2 base model. The damage depicted in the photos is not accurately reflected by the model, which frequently produces extraneous or inaccurate information.

Next, we need to prepare our dataset specifically for the task at hand. This involves creating a custom dataset class and adding a task prefix to construct the prompts appropriately.

from torch.utils.data import Dataset

class DamageCarDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        prompt = "Describe the damage to the car."
        description = example['description']
        image = example['image']
        if image.mode != "RGB":
            image = image.convert("RGB")
        return prompt, description, image

# Create datasets
train_dataset = DamageCarDataset(data['train'])
val_dataset = DamageCarDataset(data['test'])

Now let’s move to fine-tuning. We will create our dataset, the data collator, and start training. In A100 with 40GB memory, we can set batch size of 6. If you’re training on T4 with 15GB VRAM, you can use batch size of 1 or 2 depending on the size of the model & dataset.

import os
from torch.utils.data import DataLoader
from tqdm import tqdm


def collate_fn(batch):
    prompts, descriptions, images = zip(*batch)
    inputs = processor(text=list(prompts), images=list(images), return_tensors="pt", padding=True).to(device)
    return inputs, descriptions

# Create DataLoader
batch_size = 2  # 6
num_workers = 0

train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers)

# Training Function
from transformers import (AdamW, AutoProcessor, get_scheduler)


def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
    optimizer = AdamW(model.parameters(), lr=lr)
    num_training_steps = epochs * len(train_loader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
            inputs, descriptions = batch

            input_ids = inputs["input_ids"]
            pixel_values = inputs["pixel_values"]
            labels = processor.tokenizer(text=list(descriptions), return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

            outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
            loss = outputs.loss

            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

            train_loss += loss.item()

        avg_train_loss = train_loss / len(train_loader)
        print(f"Average Training Loss: {avg_train_loss}")

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
                inputs, descriptions = batch

                input_ids = inputs["input_ids"]
                pixel_values = inputs["pixel_values"]
                labels = processor.tokenizer(text=list(descriptions), return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

                outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
                loss = outputs.loss

                val_loss += loss.item()

        avg_val_loss = val_loss / len(val_loader)
        print(f"Average Validation Loss: {avg_val_loss}")

        # Save model checkpoint
        output_dir = f"./model_checkpoints/epoch_{epoch+1}"
        os.makedirs(output_dir, exist_ok=True)
        model.save_pretrained(output_dir)
        processor.save_pretrained(output_dir)

After training, we will push the model to Hugging Face Hub . To do so, we need to login first with write access. Make sure to pass fine-grained token (by first creating the repository and setting up fine-grained token access).

Amr Saafan 8 个月前

LLMs, Embeddings, Vector Search and More!

Pavan Belagatti 9 个月前

??Top ML Papers of the Week

DAIR.AI 8 个月前

from huggingface_hub import notebook_login

notebook_login()

Once you logged in you will get output.

Token is valid (permission: fineGrained)

In the last step, we will freeze image encoder for this tutorial. Although in the paper they have reported improvement in unfreezing image encoder, but note that this will result in extra resource usage. Author's describe a performance improvement when fine-tuning with an unfrozen image encoder, compared with freezing it.

for param in model.vision_tower.parameters():
  param.is_trainable = False 
# True for Unfrozen image encoders

train_model(train_loader, val_loader, model, processor, epochs=10)

We observed a constant decline in both training and validation loss Throughout the fine-tuning process, showing that the model was learning and increasing its task performance. Particularly the validation loss remained slightly higher than the training loss, showing that there is still room for improvement. The number of epochs increase could allow the model to learn more complex patterns in the data, reducing the validation loss further. Still, this would require more computational resources and time.

You can push the model like below.

model.push_to_hub("tahaman/DamageCarModel")
processor.push_to_hub("tahaman/DamageCarModel")

We performed our experiments with a lower resource setup to evaluate the model’s capabilities in constrained fine-tuning environments. By freezing the vision encoder, we used a batch size of 2 with a T4 GPU in Google Colab, and. Additionally, we tested the model with both frozen and unfrozen image encoders.

Model Testing

Now lets test our custom fine-tuned Florence-2-base model.

# Testing the Model
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image
import matplotlib.pyplot as plt
import textwrap

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load fine-tuned model and processor
model = AutoModelForCausalLM.from_pretrained("tahaman/DamageCarModel", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("tahaman/DamageCarModel", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, text_input, image_path):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input

    # Load and preprocess the image
    image = Image.open(image_path)
    if image.mode != "RGB":
        image = image.convert("RGB")

    # Tokenize inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)

    # Generate output
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))

    # Ensure parsed_answer is a string
    if isinstance(parsed_answer, dict):
        parsed_answer = str(parsed_answer)

    # Display image
    plt.imshow(image)
    plt.axis('off')
    plt.show()

    # Print the description with wrapping
    wrapped_description = textwrap.fill(parsed_answer, width=120)
    print(f"Generated Description:\n{wrapped_description}")
    return parsed_answer

# Test the function with an image from your test set
image_path = "/content/test.jpg"
description = run_example("Describe the damage to the car.", '', image_path)

"AssertionError: only DaViT is supported for now."

After pushing the model to the HuggingFace Model Repository, attempting to load it results in an error: “AssertionError: only DaViT is supported for now.” This error occurs because the model (tahaman/DamageCarModel) is designed specifically for the DaViT (Data-efficient Vision Transformer) architecture. The error message indicates that the model architecture in your saved checkpoint expects a DaViT model configuration, but the model you’re trying to load (AutoModelForCausalLM) is not compatible with DaViT.

After broad research, I found that my config.json file was missing the required vision_config parameter:

"vision_config": {
    "model_type": "davit"
}

The config.json file contains all the necessary configuration parameters for appropriately initializing the model architecture. If any essential parameter is missing or incorrect, the model will fail to initialize.

To resolve this issue, we have a workaround for this, we need to use the config.json provided with the actual Florence-2-base model. Copy the entire configuration parameters from Florence-2-base/config.json, except for the first few lines of headers. i.e.

{
  "_name_or_path": "florence2",
  "architectures": [
    "Florence2ForConditionalGeneration"
  ],
  "auto_map": {
    "AutoConfig": "configuration_florence2.Florence2Config",
    "AutoModelForCausalLM": "modeling_florence2.Florence2ForConditionalGeneration"
  },

Impact of Replacing config.json:

Using Your Model: When you replace the config.json file, your fine-tuned model weights (stored in model.safetensors) remain unchanged. The primary purpose of modifying/altering config.json is to ensure that the model architecture aligns correctly, allowing the model to load properly.
Performance: As long as the modified config.json accurately reflects the architecture used during training, replacing parts of it should not impact performance of the fine-tuned model. As model’s performance is primarily determined by its weights, which was remain unchanged.

Reason for Assertion Error: This type of error occurs when the required model type in the vision config either missing or incorrect. For Florence-2 model, some specific configurations are expected for its vision component. the vision_config can lead to unexpected loading failures during fine-tuning, altering or omitting parameters.

We ensure that the model architecture is properly described by updating the config.json file with the correct vision_config. This allows us to load and use our fine-tuned model without errors. This acclimation does not impact the performance of our fine-tuned model.

After successfully loading our model, here are the results obtained from our fine-tuned model :

Generated Description: {‘Describe the damage to the car.’: ‘The image shows the front view of a brown car with visible damage. The front bumper is severely damaged, with visible scratches and dents. The headlight and grille are also damaged, indicating a significant impact. The car appears to be parked on a white surface, possibly a garage or workshop. The damage is likely to the front of the car, specifically the front grille and bumper area.’}

We evaluated the test image using the Florence-2-base model without any fine-tuning:

Generated Description: {‘Describe the damage to the car.’: ‘the car is on the floor and there is a wall in the background. The car appears to be a Toyota Innova Crysta.’}

We also evaluated on Florence-2-large model without any fine-tuning:

Generated Description: {‘Describe the damage to the car.’: ‘The image shows a brown car parked in front of a gray wall. The car appears to be in a state of disrepair, with rust and dents visible on the body.’}

For comparison, we evaluated our custom fine-tuned model with unfrozen image encoders.

Generated Description: {‘Describe the damage to the car.’: ‘The image shows the front view of a brown car, specifically the front part of the vehicle. The car appears to be in a state of disrepair, with the front grille and headlights severely damaged. The damage is located on the left side of the front bumper, with some areas of the headlight and bumper missing. The front bumper is also damaged, with visible signs of wear and tear. The vehicle is parked on a white surface, and the background is a plain grey color.’}

Conclusion

The above findings shows after comparing the frozen and unfrozen layer models with Florence-2-base and Florence-2-large our model performed fairly well. We enhanced our model for a particular use case, even with a smaller dataset of only 125 samples and less computational power. This highlights the capabilities of Microsoft’s Florence-2-base, which can easily outperform larger models on a range of computer vision and vision-language tasks with just 0.23 billion parameters.

Fine-Tuning Florence-2 Base Model on a Custom Dataset for Image Captioning

Royal Cyber Asia

A purpose-led organization helping businesses thrive through innovation, technology, & forward-thinking.

Introduction

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks:

Preparing the Dataset

Fine-Tuning Process

领英推荐

Model Testing

Conclusion

Royal Cyber Asia的更多文章

社区洞察

其他会员也浏览了

AI Embedding with Vector Database

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Synthetic Code, Compliant Chatbots, and Reflection Data

OptiFlow AI: In-Depth Tutorial on Building a Business Process Optimization Bot

Retrieval Augmented Generation (RAG) overview

The Hidden Language of AI: A Deep Dive into Embeddings

Exploring OpenAI’s Latest Models: GPT-4, Turbo, o1-Series, and More

Dall-E-2 vs. Google Muse: The Ultimate AI Art Showdown

Exploring the Power of Self-Refine Prompting in AI

“Modal hints” for ManaGPT: Better AI text generation through prompts employing the language of possibility, probability, and necessity

Introduction

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks:

Preparing the Dataset

Fine-Tuning Process

领英推荐

Model Testing

Conclusion

Royal Cyber Asia的更多文章

Creating a Dynamic Skill Quiz API with FastAPI and Gemini — Google Ai

Google AI Gemini API in web using React ??

Build a Web App to Transform Cover Letters / Proposals with Google DeepMind and React

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

.NET Core-based Microservices with Serverless capabilities

Vector Databases: Indexes, Retrievers, and Composable Graphs

社区洞察

其他会员也浏览了

AI Embedding with Vector Database

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Synthetic Code, Compliant Chatbots, and Reflection Data

OptiFlow AI: In-Depth Tutorial on Building a Business Process Optimization Bot

Retrieval Augmented Generation (RAG) overview

The Hidden Language of AI: A Deep Dive into Embeddings

Exploring OpenAI’s Latest Models: GPT-4, Turbo, o1-Series, and More

Dall-E-2 vs. Google Muse: The Ultimate AI Art Showdown

Exploring the Power of Self-Refine Prompting in AI

“Modal hints” for ManaGPT: Better AI text generation through prompts employing the language of possibility, probability, and necessity