VITA multimodal LLM
Ivan Isaev
ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master
Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal LLM VITA, it was published open source in September 2024.
This is page from HF with fresh papers in scope of Multimodal LLMs. I recommend to track it if you want to be up to date with this topic.
VITA paper: https://huggingface.co/papers/2408.05211
VITA Git page: https://github.com/VITA-MLLM/VITA/tree/main?
What is VITA
VITA is an open-source high-performance multimodal base model that simultaneously supports video, image, text, and audio inputs. The model accepts either pure text/audio inputs or video/image combined with text/audio inputs. Its comprehensive training process includes constructing multimodal training data, and a multi-stage training pipeline.
The overall training pipeline of VITA consists of three stages:
- LLM instruction tuning
- Multimodal alignment (I think the hardest part which usually works bad, for example in ImageBind, at least in vanilla ImageBind)
- Multimodal instruction tuning
1. LLM Instruction Tuning
For LLM Instruction Tuning VITA uses Mixtral 8x7B1. In paper and code you can find system prompts for image input, video input, and pure text input.?
2. Multimodal Alignment?
In this stage, we aim to bridge the representation gap between text and other modalities, thereby laying the groundwork for multimodal understanding.
2.1 Visual Modality Visual Encoder.
Authors employ InternViT-300M-448px as the visual encoder.
2.2 Audio Modality Audio Encoder.
The input audio is initially processed through a Mel Filter Bank block. This block breaks down the audio signal into individual frequency bands on the mel frequency scale, mimicking the nonlinear human perception of sound. Subsequently, authors utilize 4×CNN downsampling layers followed by a 24 layers of transformer, totaling 341M parameters, to process the input features. They employ a simple two-layer MLP as the audio-text modality connector. In the end, each 2 seconds of audio input is encoded into 25 tokens.
领英推è
Vita as most of video2text multimodality models by default uses ASR model for audio (audio encoder trained on ASR datasets). There is no code to train audio encoder, only weights. It would be interesting to try VITA with a sound recognition model audio encoder like audio tagging. For example:
I didn’t investigate in depth but I think audio encoders from sound recognition models could be used in VITA out of the box but it needs to be finetuned with it (if you tried this – please let me know :)??
Training pipeline of VITA
The first stage LLM Instruction Tuning enhances the language model Mixtral 8×7B by expanding its vocabulary size and fine-tuning it with a high-quality bilingual text corpus, thereby achieving proficiency in both Chinese and English.
The second stage Multimodal Alignment connects individual encoders with the LLM to process various modalities. By amassing a substantial collection of high-caliber multimodal data, we synchronize the text feature space with that of video, image, and audio.
The last stage Multimodal Instruction Tuning allows the model to follow text or audio instructions to understand the image or video. A specially designed state token is used to distinguish the type of input query, facilitating subsequent multimodal human-computer interaction
To get first description with VITA
- Clone VITA repo to your GPU:
git clone https://<your_git_PAT_token>@github.com/VITA-MLLM/VITA.git
- Create and activate? venv, install libs:
cd VITA && python3.10 -m venv vita_demo && source vita_demo/bin/activate && pip install -r requirements.txt
- Run it:?
python video_audio_demo.py
Please let me know if you play/will play with VITA finetuning or audio classification models. I think we have interesting details to discuss ;)