ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

VITA multimodal LLM

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

å‘å¸ƒæ—¥æœŸ: 2024å¹´11æœˆ25æ—¥

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal LLM VITA, it was published open source in September 2024.

This is page from HF with fresh papers in scope of Multimodal LLMs. I recommend to track it if you want to be up to date with this topic.

VITA paper: https://huggingface.co/papers/2408.05211

VITA Git page: https://github.com/VITA-MLLM/VITA/tree/main?

What is VITA

VITA is an open-source high-performance multimodal base model that simultaneously supports video, image, text, and audio inputs. The model accepts either pure text/audio inputs or video/image combined with text/audio inputs. Its comprehensive training process includes constructing multimodal training data, and a multi-stage training pipeline.

The overall training pipeline of VITA consists of three stages:

LLM instruction tuning
Multimodal alignment (I think the hardest part which usually works bad, for example in ImageBind, at least in vanilla ImageBind)
Multimodal instruction tuning

1. LLM Instruction Tuning

For LLM Instruction Tuning VITA uses Mixtral 8x7B1. In paper and code you can find system prompts for image input, video input, and pure text input.?

2. Multimodal Alignment?

In this stage, we aim to bridge the representation gap between text and other modalities, thereby laying the groundwork for multimodal understanding.

2.1 Visual Modality Visual Encoder.

Authors employ InternViT-300M-448px as the visual encoder.

2.2 Audio Modality Audio Encoder.

The input audio is initially processed through a Mel Filter Bank block. This block breaks down the audio signal into individual frequency bands on the mel frequency scale, mimicking the nonlinear human perception of sound. Subsequently, authors utilize 4Ã—CNN downsampling layers followed by a 24 layers of transformer, totaling 341M parameters, to process the input features. They employ a simple two-layer MLP as the audio-text modality connector. In the end, each 2 seconds of audio input is encoded into 25 tokens.

é¢†è‹±æŽ¨è

AI Fringe Day 3: Digging deeper: AI, biology, people, culture and climate

AI Fringe Day 3: Digging deeper: AI, biology, peopleâ€¦

Matthew Blakemore 1 å¹´å‰

Generative AI in Creative Industries: Redefining Art, Music, and Design

Generative AI in Creative Industries: Redefining Artâ€¦

Tekvaly 1 ä¸ªæœˆå‰

The creative ownership dilemma: When AI and human ingenuity collide

The creative ownership dilemma: When AI and humanâ€¦

Astha.IT 2 ä¸ªæœˆå‰

Vita as most of video2text multimodality models by default uses ASR model for audio (audio encoder trained on ASR datasets). There is no code to train audio encoder, only weights. It would be interesting to try VITA with a sound recognition model audio encoder like audio tagging. For example:

I didnâ€™t investigate in depth but I think audio encoders from sound recognition models could be used in VITA out of the box but it needs to be finetuned with it (if you tried this â€“ please let me know :)??

Training pipeline of VITA

The first stage LLM Instruction Tuning enhances the language model Mixtral 8Ã—7B by expanding its vocabulary size and fine-tuning it with a high-quality bilingual text corpus, thereby achieving proficiency in both Chinese and English.

The second stage Multimodal Alignment connects individual encoders with the LLM to process various modalities. By amassing a substantial collection of high-caliber multimodal data, we synchronize the text feature space with that of video, image, and audio.

The last stage Multimodal Instruction Tuning allows the model to follow text or audio instructions to understand the image or video. A specially designed state token is used to distinguish the type of input query, facilitating subsequent multimodal human-computer interaction

To get first description with VITA

Clone VITA repo to your GPU:

git clone https://<your_git_PAT_token>@github.com/VITA-MLLM/VITA.git

Create and activate? venv, install libs:

cd VITA && python3.10 -m venv vita_demo && source vita_demo/bin/activate && pip install -r requirements.txt

Run it:?

python video_audio_demo.py

Please let me know if you play/will play with VITA finetuning or audio classification models. I think we have interesting details to discuss ;)

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ivan Isaevçš„æ›´å¤šæ–‡ç«

Quatitative interview task: human approach vs AI approach

2025å¹´3æœˆ6æ—¥

Quatitative interview task: human approach vs AI approach

It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AIâ€¦
Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

2025å¹´2æœˆ28æ—¥

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantizationâ€¦
Pseudo Labeling

2025å¹´2æœˆ16æ—¥

Pseudo Labeling

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilitiesâ€¦
Learning to distill ML models

2025å¹´2æœˆ14æ—¥

Learning to distill ML models

Iâ€™m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links toâ€¦
Kaggle Santa 2024 and what do the puzzles have to do with it?

2025å¹´2æœˆ8æ—¥

Kaggle Santa 2024 and what do the puzzles have to do with it?

Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.
Qdrant and other vector DBs

2025å¹´1æœˆ28æ—¥

Qdrant and other vector DBs

Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackleâ€¦
Chutes: did you try it?

2025å¹´1æœˆ21æ—¥

Chutes: did you try it?

Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

3 æ¡è¯„è®º
InternVL2 test drive

2024å¹´11æœˆ26æ—¥

InternVL2 test drive

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 timesâ€¦
What are Diffusion Models?

2024å¹´5æœˆ29æ—¥

What are Diffusion Models?

Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emergedâ€¦
4 Neural Network Activation Functions you should keep in mind

2024å¹´5æœˆ24æ—¥

4 Neural Network Activation Functions you should keep in mind

What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"â€¦

See all articles

VITA multimodal LLM

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

What is VITA

1. LLM Instruction Tuning

2. Multimodal Alignment?

é¢†è‹±æŽ¨è

Training pipeline of VITA

To get first description with VITA

Ivan Isaevçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Revolutionizing Digital Communication: A Deep Dive into the World of Pioneering AI Speech and Voice Startups

AI Meets Creativity: How Technology is Revolutionizing the Arts

ElevenLabs: Transforming AI Audio with Realistic Voice Generation

Gen AI Updates: Mistral Large's Arrival, Alibaba's EMO, Data Moves by OpenAI & MidJourney, Adobe's AI Music & Amazon's Privacy Measures

SoundBeast AI Review â€“ Clones Any Voice Into UR 100% Human-Like Voices

How Generative AI Revolutionizes Creative Industries: Changing Art, Music, and Design

Storytelling, Musical Innovation, and Next-Generative AI

Revolutionizing Creativity: How AI is Transforming Traditional Workflows

Data compression in MPEG

SoundWaves AI Review - Generate Human-Like Voices (By Vijay Pratap Singh)

What is VITA

1. LLM Instruction Tuning

2. Multimodal Alignment?

é¢†è‹±æŽ¨è

Training pipeline of VITA

To get first description with VITA

Ivan Isaevçš„æ›´å¤šæ–‡ç«

Quatitative interview task: human approach vs AI approach

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Pseudo Labeling

Learning to distill ML models

Kaggle Santa 2024 and what do the puzzles have to do with it?

Qdrant and other vector DBs

Chutes: did you try it?

InternVL2 test drive

What are Diffusion Models?

4 Neural Network Activation Functions you should keep in mind

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Revolutionizing Digital Communication: A Deep Dive into the World of Pioneering AI Speech and Voice Startups

AI Meets Creativity: How Technology is Revolutionizing the Arts

ElevenLabs: Transforming AI Audio with Realistic Voice Generation

Gen AI Updates: Mistral Large's Arrival, Alibaba's EMO, Data Moves by OpenAI & MidJourney, Adobe's AI Music & Amazon's Privacy Measures

SoundBeast AI Review â€“ Clones Any Voice Into UR 100% Human-Like Voices

How Generative AI Revolutionizes Creative Industries: Changing Art, Music, and Design

Storytelling, Musical Innovation, and Next-Generative AI

Revolutionizing Creativity: How AI is Transforming Traditional Workflows

Data compression in MPEG

SoundWaves AI Review - Generate Human-Like Voices (By Vijay Pratap Singh)

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†