A gentle introduction to Parameter Efficient Fine-Tuning for Vision Models
Ahmad Anis
Lifelong Student | Deep Learning @ Roll.ai | Research Collaborator @ MIT MEDIA LAB | Community Lead @ Cohere for AI | Technical Writer
The foundational models are expected to scale up (keep growing in their size). We have come a long way from ResNet-50 with 23 million parameters to ViTs with 22 Billion Parameters (https://arxiv.org/abs/2302.05442).
This increase in model size makes the traditional fine-tuning techniques for Vision Models:
a. Fine Tuning all parameters on a downstream task
b. Fine Tuning the Last Fully Connected Layers only
Given the increasing size and scale of the model, we are in a new paradigm of visual tuning that is beyond tuning the entire model or the task head
NLP has seen this boom of Large Models a few years ago, and there has been a lot of research already done in the past to fine-tune a model efficiently. These techniques are known as Parameter Efficient Fine-Tuning or PEFT in short. The computer Vision community has taken inspiration from PEFT and they introduced similar techniques for vision models.
PEFT techniques in Vision Models can be grouped into 5 categories.
1. Fine Tuning
Consider this the standard version of Transfer Learning. We either Tune the whole model, or just the task head(new F.C. layer) we just add on the top of an already pre-trained model. These pre-trained models are trained on benchmark datasets such as ImageNet. If you are only tuning the task head, consider the pre-trained model as a feature extractor.
For Large models, full fine-tuning has many challenges.
a) For each task, you need a separate model.
b) It does not generalize well on unseen data, even on a distribution-shifted dataset.
c) Fine-tuning task head only does not give satisfactory performance often.
2. Prompt Tuning
Inspired by Prompt Tuning from NLP. Think of Vision Prompts as an example or headstart, i.e you make a single bounding box on an object in an image. The model takes that bounding box as a prompt and returns all the bounding boxes of that class. That single bounding box was a visual prompt. This https://www.youtube.com/watch?v=FE88OOUBonQ video by Andrew NG does a good job in highlighting visual prompt tuning
There is also a lot of research ongoing on Language and Vision Language-based Prompt Tuning for Vision models. These are primarily incorporated in LLVMs(Large Language Vision Models)
3. Adapter Tuning
This involves additional trainable parameters to a frozen vision model. Some initial efforts include:
领英推荐
(a) Incremental Learning Methods: Learn new information over time without forgetting previous ones. There are 3 types:
(b) Domain Adaptation Methods
Modern Adapter Tuning for Visual Models can be classified into the following types:
4. Parameter Tuning
This technique involves directly updating a subset of model weights. These can be
5. Remapping Tuning
Think of Remapping Tuning as Distillation. Instead of directly fine-tuning, we take the learned knowledge of a big model and transfer it to a small model. There is a lot of work done on distillation, and there are many types of it such as Knowledge Distillation, Attention Distillation, etc.
This post would not have been possible without the guidance of Muhammad Uzair Khattak (Do follow him) who gave me the right resources to learn about it and made me think more about it.
References:
Great find, Ahmad! ?? We stumbled upon your article on Parameter Efficient Fine Tuning Techniques (PEFT) for Vision Models, and it's intriguing to see how these techniques are gaining traction in the field of vision, following their success in Language and Learning Models (LLMs). It's exciting to witness the evolution of fine-tuning strategies! One aspect that caught our attention is the ability to achieve superior performance by updating far fewer parameters compared to full-tuning the whole pre-trained model. How do you envision the future of visual tuning and its impact on various industries?
Data Engineering @ ADDO AI
1 年Hi Sir, I was wondering what u do after learning up till deep learning i.e CV and NLP covered, covered ANN, CNN, RNN family, GANs, Transformers etc. What's next to study?
Data Scientist | AI/ML Engineer
1 年Very informative. Thanks for sharing