Multimodal LLMs

LLMs are pretty cool, but sometimes, we need to go beyond just working with text. Wouldn’t it be nice to chat with an AI model about your images, music, or videos? For instance, conversations like:

Check this receipt – is the total amount correct? — No, the actual sum of all the items is 990 but the listed total is 1000.?
Tell me what this song is about. — This song is about emptiness and heartbreak.?
At what point in this video does the lecturer start to talk about logistic regression? — At 10:12.

The models capable of these kinds of tasks are usually called Multimodal LLMs (or simply “MLLMs”). In this context, “multimodal” means they can work with several different modalities (for example, text and images) and the “LLM” part there (Large Language Model), as we’ll see, is the brain behind this sort of model.

The picture below shows an example of a multimodal chat with an MLLM called Sphinx. Answering questions about images is quite an important task and known as Visual Q&A.

In this post, I’ll share:

The main architectural ideas behind MLLMs – using the example of VLMs (Visual Language Models), which, so far, are the most developed type of MLLMs.
How they are trained and evaluated, what they are capable of, and what they lack.
Some further information on using MLLMs for video, working with subregions, and so on.

This is a version of a long read that I've prepared for the Practical Generative AI program.

MLLM Basics

There are high-quality proprietary models out there, such as GPT-4V(ision), and these are cool, but also completely obscure in terms of their architecture and training. This means we can only bask in what they are capable of delivering:

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) showcasing the capabilities of GPT-4V,
These two papers compare GPT-4V with Gemini (this one also considers the open source model SPHINX).

There are also some open source options which we'll check out in this post.

High-level MLLM architecture

A typical MLLM is a kind of Frankenstein creature assembled from three parts:?

A pre-trained image encoder, which makes a vector (or a sequence of vectors) out of an image.
A pre-trained LLM that produces the output.
An adapter mechanism that maps image encoder outputs and a text prompt into the input space of the LLM.

Here’s an example of an MLLM architecture (BLIP-2):

The LLM is the main ingredient here. All the other components simply supply the LLM with high-quality, informative input, while it does all the reasoning. More capable LLMs tend to result in more capable MLLMs. (As a note, Vicu?a is quite popular in terms of open source MLLMs.)

The image encoder is often a version of CLIP; this is because CLIP embeddings have an innate alignment with text data. (For example, EVA-CLIP is quite popular.) This component takes an image and creates an informative vector embedding (or sometimes several vectors). By the way, if you want the MLLM to work with different modality, you just need to use an appropriate encoder. All other parts of the pipeline remain the same.

The LLM consumes vectors: text token embeddings. We also want to supply image embeddings as "token embeddings" into an LLM. This can be done, for example, before the actual text instruction (but they can also be interleaved):

Of course, these "token embeddings" do not correspond to any real text tokens. They are just vectors in token embedding space, each "describing the whole image in one virtual token". For this, we need an adapter mechanism that can convert image embeddings into vectors carrying text-ish semantics.

This adapter mechanism is the place where researchers can apply their creativity;?it can be as simple as one or two layer MLP projection or invasive, resembling LoRA (more details further below!)

Training process overview

At a high level, the training process has two potential stages:

A. Pre-training on image captioning data (such as LAION or COCO). Here, the model is trained to align text and image data. That said, thanks to LLM capabilities, a pre-trained MLLM often exhibits few-shot visual Q&A and other abilities. (Improving caption quality may lead to better models, see the ShareGPT4V paper.)

The image encoder usually stays frozen, and often the LLM as well, although sometimes it is also fine-tuned or parametric-efficiently fine-tuned (with something like LoRA). The adapter is always trainable.

B. Instructional fine tuning. This stage directly teaches an MLLM to follow instructions like "Tell me the eye color of the person standing to the left of the golden retriever".?

However, even creating instructions for LLMs is prohibitively expensive for open source developers, and even more so for MLLMs. Therefore, people create synthetic data using datasets for supervised learning (Visual Q&A, for example) or asking GPT-4 to create instructions and answers based on captions and object bounding boxes. Like this:

The quality of these instructions sometimes leaves much to be desired, and researchers are often able to tune the quality a bit just by fixing mistakes.

Those who can afford such a luxury can also use interleaved data for instruction tuning, that is, pictures mixed with text in a single narrative.

Now, if you're still curious, let's see how these principles show up in several important models.

LLaVA: the simplest VLLM

Introduced in Visual Instruction Tuning, Apr’ 23.

This is probably the simplest MLLM, so let's look closely at its architecture and its training procedures.

LLaVA: Architecture

LLaVA connects CLIP and Vicu?a through the most basic adapter. It’s just a single fully connected layer (projection) that turns image embeddings into "token embeddings" for the LLM. In classic LLaVA, one image goes exactly into one "token embedding" – although it's technically possible to map an image onto a sequence of "tokens" carrying more information about it.

(The authors later showed that taking a two-layer MLP as an adapter significantly improves quality.)

LLaVA: Training

Training is carried out in two stages:

Stage 1. Pre-training. the model is trained on image captioning data:?

(image, -) --> text description

Note that we don't use the text modality in input. We can afford this, because we'll account for it during the next training stage.

Also, the vision encoder and LLM are frozen during pre-training. Only the projection layer is trained.

Stage 2. Instruction tuning. During this stage, both the projection layer and the LLM are fine tuned.

But where do we get the data? At this point, almost no image-based instruction data is available, and gathering it with humans is way too expensive. So, the authors create synthetic data with the help of GPT-4!

To do this, they transform pictures into the two following types of text representation:

Comprehensive captions
Bounding boxes for various objects on the pictures

GPT-4 is now used to create the following types of instruction data from captions and bounding boxes:

Conversations between an assistant and a person asking questions about the image. The answers are in a tone as if the assistant is seeing the image and answering the question.
Detailed descriptions.
Complex reasoning.

This way, the authors were able to collect 158k unique language-image instructions. You've already seen examples in the Training process overview section.

CogVLM: a LoRA-like adapter

(You can also check the paper where it was first introduced: CogVLM: visual expert for large language models.)

LLaVA's approach to connecting a vision encoder with an LLM is characterized as shallow alignment: image features are mapped into the input space of an LLM. Roughly speaking, this can be compared with prompt tuning.

CogVLM introduces this information deeper into the LLM, in a sense, like LoRA.

Let's look again at the architecture of CogVLM:

The adapter mechanism consists of the following components:

First, a LLaVA-like MLP adapter that projects ViT outputs to the space of text embeddings. It is a two-layer MLP.

Then, there are visual expert modules inside every transformer block, both in the attention and MLP parts. It works as follows:

Image "tokens" go first, then we have text tokens. We have trainable matrices:

These matrices create queries, keys, and values for image "tokens". Then, all queries, keys and values are concatenated into matrices Q, K, V and the final attention scores are calculated. That Mask thing above stands for the triangular mask that prevents the model from "looking into the past".

The FFN layer is applied separately for image and text tokens:

What else? Q-Former

Curiously, the first MLLMs were much more complex than LLaVA and its naive MLP adapter. An example of an earlier architecture is BLIP-2;?its adapter, called Q-Former, is marked with yellow in this diagram:

As you can see, it has two transformer trunks: one accepts image input through a cross-attention mechanism, and the other prepares input text for the LLM (this contrasts with LLaVA which didn't have an additional adapter for texts). These trunks are aligned during training via Image-Text Matching loss and Image-Text Contrastive loss.

We won't go into details of this architecture, but feel free to check the paper: from time to time, the Q-Former block emerges in various models.

MM1: Analysis & Insights from Multimodal LLM Pre-training

The MM1 paper was released quite recently (at the time of writing), in March of 2024. The authors performed a series of ablation experiments with training models with different architectures and on different data, and so on, to understand the best pre-training strategies. They ended up with several interesting insights from this:

In terms of the encoder, image resolution has the highest impact on quality, followed by model size and training data composition. And as we know, perception is a weak spot in the system, so making the encoder big could be good for the downstream quality.
The type of adapter has little effect, while the number of visual tokens per image is (unsurprisingly) important.
Training on interleaved data (whole narratives with interleaved text and images) boosts few-shot and text-only performance. Text-only data also helps with this.For context: few-shot for multimodal tasks = you supply texts with pictures as few-shot examples.?
During instructional fine tuning, it's best to mix captions, interleaved data and text-only data in 5:5:1 proportion.

Leveraging these insights, the authors built MM1, a family of multimodal models with up to 30B parameters (both dense and mixture-of-experts) that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks.

Evaluating VLMs

There are many facets of VLM quality. In papers, you sometimes see figures like this one, taken from the CogVLM paper:

Each "spike" is a benchmark. For each VLM, some of the benchmarks may get into the training dataset while others are used for evaluation. I'll discuss several major types of benchmarks and give dataset examples.

First up, of course, image captioning is a popular benchmark type; it can check the very basic ability to understand what's in a picture. (Example: NoCaps.)

However, we also expect a VLM to be a worthy conversationalist capable of reasoning about what's happening in an image. So, there are many Visual Question Answering (VQA) datasets checking different layers of understanding and types of image-based inference.

OKVQA (Outside Knowledge Visual Question Answering) which asks tricky questions that require combining perception with knowledge:

TextVQA which is focused on text perception:

MM-Vet:

Visual7W, which also contains visual pointing questions:

So, as you see, there's much to evaluate: from understanding of space geometry to OCR proficiency. There is no single benchmark that we can trust 100%, so please don't forget to evaluate your VLM thoroughly on your downstream tasks.

Shortcomings of VLMs

The paper How Far Are We from Intelligent Visual Deductive Reasoning? from March?2024 provides ablation-based analyses of different VLM components and their contribution (or lack of) to overall quality. I'll share some of the main insights:

I've already mentioned that the LLM is like the "brains" of an MLLM, the image encoder is its "eyes", and the adapter mechanism is the "optical nerve". While the LLM part is, at present, usually quite powerful, the "blind spot" of current MLLMs is perception; most likely?the encoder, but probably the adapter as well. So, if you're making a new VLM, taking a powerful encoder with a high resolution can prove beneficial.
Few-shot learning, even assisted with chain-of-thought reasoning, doesn't work well with VLMs. The author of the paper found that: "surprisingly, all the tested models, including GPT-4V and Gemini, struggle with a high failure rate even when the in-context example is identical to the current task being solved".

As much as LLMs are, VLMs are also prone to hallucinations. For example, they like describing items that are not present in an input image (see this paper for details).?

Although, it's worth noting that there is probably a way of mitigating that. In text-only cases (at least in a RAG setup), you can reduce the hallucination rate if you require the LLM to give exact quotes justifying its answers. For VLMs, it can be good to ask it to provide bounding boxes of the objects it mentions (see Shikra for analysis). Of course, you need to fine tune your VLM for that.

From images to videos

A video is essentially a sequence of frames which we can transform into a sequence of embeddings with the same image encoder. However, additional hacks are needed to make it work well. I'll mention a few.

Video-ChatGPT uses spacial+temporal feature selection:

Spatial pooling averages every frame along the spatial axes, so that you get a tensor of shape [batch, time, embedding_dim].?

Temporal pooling does the same along the time axis giving a tensor of shape [batch, (height/patch_size) x (width/patch_size), embedding_dim].?

These tensors are concatenated and sent through the adapter layer into the LLM.

Video-LLaVA uses LanguageBind encoders pre-trained for alignment between the modalities of text and videos, without additional feature engineering.

However, for longer videos we run into the familiar problem of working with a long context. Let's look at one attempt at coping with this.

World Model on Million-length Video and Language with Blockwise RingAttention is an interesting paper discussing MLLM training for long texts and videos. In terms of architecture, the model is pretty straightforward, and much resembles LLaVA. However, it's curious that they use VQGAN?as an image encounter; it tokenizes 256 × 256 input images to 16 × 16 discrete tokens. In contrast to CLIP embeddings, this requires additional training for alignment with text, but its discrete nature is probably beneficial.

Apart from some (important) technicalities, the core of the paper is the training process. The recipe is actually quite simple:

First, it is trained on texts reaching a maximum of 1M context length, then starts training with images and videos,
For both modalities, training is carried out for multiple (5) stages with increasing context length. For images and videos, it's 1K, then 8K, then 32K, then 128K, and then finally 1M.
It starts with simple tasks. First, image captioning, then video annotation, and only then real instruction tuning with videos.
Scaling the θ for RoPE up with context length during the text-only part, then it is just left alone.

You can check the datasets they used in the screenshot:

Of course, dealing with a long context requires additional work to make it efficient; the authors use Blockwise RingAttention with FlashAttention, hence the paper name.

Here's the comparison of the three models measured against popular video-LLM benchmarks:

Advanced features

1. Multimodal output

Authors are sometimes able to make an MLLM not only?process things from different modalities, but also output images, texts, videos and other things. (See NExT-GPT: Any-to-Any Multimodal LLM.) In any case, the LLM stays the main think tank and all the other components provide anything->text and text->anything transformations.

2. Working with image subregions

One of the potentially useful features for a VLM is the possibility of questioning particular subregions of an image. Like this:

For point selection and rectangular regions this can be implemented as easily as, for example, in Shikra: rectangle boundaries and points are represented by numbers for the LLM.

If you allow for irregular shaped regions, this trick won't help. The authors of Ferret took inspiration from PointMLP used for Point Clouds. They suggested a Spatial-aware visual sampler which works roughly as follows:

Sample N points from the region.
Sample N/r points from them using farthest point sampling. We'll call these points "centroids".
For each "centroid", find its k nearest neighbors, then propagate the data stored in the neighbors, (coordinates + RGB or whatever), to the "centroid" (by fusing+pooling, more details in the paper). Now, the "centroid" in a sense represents all of its neighborhood.
Repeat 2-3 with "centroids" as initial points from which we sample a fewer number of new "centroids".
Project the data stored in the "centroids" to the LLM embedding space. Now, we can plug this data into the LLM!

3. VLMs for GUI Interaction

Imagine that you take your mobile phone and just lazily tell it (without ever touching the screen): "Is the new book by my favourite writer already on Amazon? If yes, buy it!" That would be cool, wouldn't it? But it turns out that teaching an MLLM to use apps is not the same thing as teaching it to write code or describe images.

Most human interaction with computers and mobile devices happens via graphic interfaces. For us, this is very convenient, but MLLMs struggle with GUIs. Indeed, using a GUI requires the ability to recognize all the widgets and infer their functions, which is hard, because:

Images with many small text-riddled objects of high importance are usually not scarce in the training datasets,
Images with elongated aspect ratio and varying resolution are also not common in MLLM training.?

A recent paper from Apple, Ferret-UI-anyres strives to overcome this.

To overcome the aspect ratio challenge, the authors adopt the idea of “any resolution” (anyres) which suggests passing not only the whole images to the LLM, but also its fragments chosen from some grid. The fragments come in a more natural aspect ratio (and potentially in larger resolution) allowing for fine details to be captured. At the same time, the whole image provides a broad perspective.

Ferret-UI uses one of the two grids 1x2 or 2x1, depending on whether the screen is horizontal or vertical.

Ferret-UI doesn't beat GPT-4V on advanced tasks, but I think that's probably due to the sheer power of GPT-4 as an LLM. On elementary tasks, such as widget classification, Ferret-AI does a better job than GPT-4V, and this indicates its potential.

To end this part, I would argue that, in the end, the problem of LLM+GUI will most likely be solved in a completely different way. GUIs are created for humans, not LLMs. So, if the technology continues developing, we’ll probably see special interfaces for LLMs on most web sites.

Multimodal LLMs

Stanislav Fedotov

AI Program Lead – Nebius Academy

MLLM Basics

High-level MLLM architecture

Training process overview

LLaVA: the simplest VLLM

LLaVA: Architecture

LLaVA: Training

CogVLM: a LoRA-like adapter

What else? Q-Former

领英推荐

MM1: Analysis & Insights from Multimodal LLM Pre-training

Evaluating VLMs

Shortcomings of VLMs

From images to videos

Advanced features

1. Multimodal output

2. Working with image subregions

3. VLMs for GUI Interaction

社区洞察

其他会员也浏览了

GPT-4o Mini: Bridging the Gap Between Cost and Capability in AI

Mastering MLOps practices for a trading bot

Lies, damned lies, and hallucinations

Janus Pro 7B: Key Features, Benefits & Drawbacks

The Future of Vision-Language Models: Scaling for Efficiency and Performance

Deep Deconstruction: The Core Differences and Strategic Advantages between Google Gemini and SearchGPT

No Connection, No Problem: AI Solutions with GPT4All and KNIME

Run DeepSeek AI Assistant on Your Local Machine

Big Windows, Better Agents (Part 6 of 10)

DeepSeek – The First Look