A gentle introduction to Parameter Efficient Fine-Tuning for Vision Models
PEFT Techniques for Visual Fine Tuning, Bruce X.B. Yu et al.

A gentle introduction to Parameter Efficient Fine-Tuning for Vision Models

The foundational models are expected to scale up (keep growing in their size). We have come a long way from ResNet-50 with 23 million parameters to ViTs with 22 Billion Parameters (https://arxiv.org/abs/2302.05442).

This increase in model size makes the traditional fine-tuning techniques for Vision Models:

a. Fine Tuning all parameters on a downstream task

b. Fine Tuning the Last Fully Connected Layers only

Given the increasing size and scale of the model, we are in a new paradigm of visual tuning that is beyond tuning the entire model or the task head

NLP has seen this boom of Large Models a few years ago, and there has been a lot of research already done in the past to fine-tune a model efficiently. These techniques are known as Parameter Efficient Fine-Tuning or PEFT in short. The computer Vision community has taken inspiration from PEFT and they introduced similar techniques for vision models.

PEFT techniques in Vision Models can be grouped into 5 categories.

1. Fine Tuning

Consider this the standard version of Transfer Learning. We either Tune the whole model, or just the task head(new F.C. layer) we just add on the top of an already pre-trained model. These pre-trained models are trained on benchmark datasets such as ImageNet. If you are only tuning the task head, consider the pre-trained model as a feature extractor.

For Large models, full fine-tuning has many challenges.

a) For each task, you need a separate model.

b) It does not generalize well on unseen data, even on a distribution-shifted dataset.

c) Fine-tuning task head only does not give satisfactory performance often.

2. Prompt Tuning

Inspired by Prompt Tuning from NLP. Think of Vision Prompts as an example or headstart, i.e you make a single bounding box on an object in an image. The model takes that bounding box as a prompt and returns all the bounding boxes of that class. That single bounding box was a visual prompt. This https://www.youtube.com/watch?v=FE88OOUBonQ video by Andrew NG does a good job in highlighting visual prompt tuning

There is also a lot of research ongoing on Language and Vision Language-based Prompt Tuning for Vision models. These are primarily incorporated in LLVMs(Large Language Vision Models)

3. Adapter Tuning

This involves additional trainable parameters to a frozen vision model. Some initial efforts include:

(a) Incremental Learning Methods: Learn new information over time without forgetting previous ones. There are 3 types:

  • Task-incremental learning: Learns a new task without forgetting old ones
  • Domain-incremental learning: Learns a new domain of data without forgetting old ones
  • Class-incremental learning: Learns to recognize new data domains without forgetting old ones.

(b) Domain Adaptation Methods

  • Allow the model to adapt to new domains or contexts by leveraging previously learned domains. It can be further classified into two types of domain adaptions.
  • Supervised Domain Adaption: Uses labeled dataset from the target domain.
  • Unsupervised Domain Adaption: Use Unlabbled dataset from the target domain.

Modern Adapter Tuning for Visual Models can be classified into the following types:

  • Sequential Adapters
  • Parallel Adapters
  • Mix Adapters

4. Parameter Tuning

No alt text provided for this image

This technique involves directly updating a subset of model weights. These can be

  1. Updating the Bias only: Also known as Side Tuning, we just tune the Bias part.
  2. Updating the Weights only: We tune a small portion of weights by introducing low-rank matrices. This technique is commonly known as LORA.
  3. Updating both Weights and Biases

5. Remapping Tuning

Think of Remapping Tuning as Distillation. Instead of directly fine-tuning, we take the learned knowledge of a big model and transfer it to a small model. There is a lot of work done on distillation, and there are many types of it such as Knowledge Distillation, Attention Distillation, etc.



This post would not have been possible without the guidance of Muhammad Uzair Khattak (Do follow him) who gave me the right resources to learn about it and made me think more about it.


References:

  1. Visual Tuning [https://arxiv.org/pdf/2305.06061.pdf]
  2. Generative AI course by DeepLearning.AI

Great find, Ahmad! ?? We stumbled upon your article on Parameter Efficient Fine Tuning Techniques (PEFT) for Vision Models, and it's intriguing to see how these techniques are gaining traction in the field of vision, following their success in Language and Learning Models (LLMs). It's exciting to witness the evolution of fine-tuning strategies! One aspect that caught our attention is the ability to achieve superior performance by updating far fewer parameters compared to full-tuning the whole pre-trained model. How do you envision the future of visual tuning and its impact on various industries?

Muhammad Zaki Ahmad

Data Engineering @ ADDO AI

1 年

Hi Sir, I was wondering what u do after learning up till deep learning i.e CV and NLP covered, covered ANN, CNN, RNN family, GANs, Transformers etc. What's next to study?

回复
Raza Ali

Data Scientist | AI/ML Engineer

1 年

Very informative. Thanks for sharing

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了