VITA multimodal LLM

VITA multimodal LLM

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal LLM VITA, it was published open source in September 2024.

This is page from HF with fresh papers in scope of Multimodal LLMs. I recommend to track it if you want to be up to date with this topic.

VITA paper: https://huggingface.co/papers/2408.05211

VITA Git page: https://github.com/VITA-MLLM/VITA/tree/main?

What is VITA

VITA is an open-source high-performance multimodal base model that simultaneously supports video, image, text, and audio inputs. The model accepts either pure text/audio inputs or video/image combined with text/audio inputs. Its comprehensive training process includes constructing multimodal training data, and a multi-stage training pipeline.

The overall training pipeline of VITA consists of three stages:

  • LLM instruction tuning
  • Multimodal alignment (I think the hardest part which usually works bad, for example in ImageBind, at least in vanilla ImageBind)
  • Multimodal instruction tuning

1. LLM Instruction Tuning

For LLM Instruction Tuning VITA uses Mixtral 8x7B1. In paper and code you can find system prompts for image input, video input, and pure text input.?

2. Multimodal Alignment?

In this stage, we aim to bridge the representation gap between text and other modalities, thereby laying the groundwork for multimodal understanding.

2.1 Visual Modality Visual Encoder.

Authors employ InternViT-300M-448px as the visual encoder.

2.2 Audio Modality Audio Encoder.

The input audio is initially processed through a Mel Filter Bank block. This block breaks down the audio signal into individual frequency bands on the mel frequency scale, mimicking the nonlinear human perception of sound. Subsequently, authors utilize 4×CNN downsampling layers followed by a 24 layers of transformer, totaling 341M parameters, to process the input features. They employ a simple two-layer MLP as the audio-text modality connector. In the end, each 2 seconds of audio input is encoded into 25 tokens.

Vita as most of video2text multimodality models by default uses ASR model for audio (audio encoder trained on ASR datasets). There is no code to train audio encoder, only weights. It would be interesting to try VITA with a sound recognition model audio encoder like audio tagging. For example:

I didn’t investigate in depth but I think audio encoders from sound recognition models could be used in VITA out of the box but it needs to be finetuned with it (if you tried this – please let me know :)??

Training pipeline of VITA

The first stage LLM Instruction Tuning enhances the language model Mixtral 8×7B by expanding its vocabulary size and fine-tuning it with a high-quality bilingual text corpus, thereby achieving proficiency in both Chinese and English.

The second stage Multimodal Alignment connects individual encoders with the LLM to process various modalities. By amassing a substantial collection of high-caliber multimodal data, we synchronize the text feature space with that of video, image, and audio.

The last stage Multimodal Instruction Tuning allows the model to follow text or audio instructions to understand the image or video. A specially designed state token is used to distinguish the type of input query, facilitating subsequent multimodal human-computer interaction

To get first description with VITA

  • Clone VITA repo to your GPU:

git clone https://<your_git_PAT_token>@github.com/VITA-MLLM/VITA.git        

  • Create and activate? venv, install libs:

cd VITA && python3.10 -m venv vita_demo && source vita_demo/bin/activate && pip install -r requirements.txt        

  • Run it:?

python video_audio_demo.py        

Please let me know if you play/will play with VITA finetuning or audio classification models. I think we have interesting details to discuss ;)

要查看或添加评论,请登录

Ivan Isaev的更多文章

  • Quatitative interview task: human approach vs AI approach

    Quatitative interview task: human approach vs AI approach

    It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…

  • Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…

  • Pseudo Labeling

    Pseudo Labeling

    Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…

  • Learning to distill ML models

    Learning to distill ML models

    I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…

  • Kaggle Santa 2024 and what do the puzzles have to do with it?

    Kaggle Santa 2024 and what do the puzzles have to do with it?

    Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.

  • Qdrant and other vector DBs

    Qdrant and other vector DBs

    Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…

  • Chutes: did you try it?

    Chutes: did you try it?

    Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

    3 条评论
  • InternVL2 test drive

    InternVL2 test drive

    Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…

  • What are Diffusion Models?

    What are Diffusion Models?

    Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…

  • 4 Neural Network Activation Functions you should keep in mind

    4 Neural Network Activation Functions you should keep in mind

    What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

社区洞察

其他会员也浏览了