登录查看更多内容

InternVL2 test drive

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

发布日期: 2024年11月26日

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times faster than LLAMAv3 for my videos), I used the 8B model and the descriptions are nice.?

Intern_vl2 HF

Intern_vl2 Git

VITA descriptions are better and inference time of VITA is also good but InternVL2 is about a twice faster.

How to try it

You just need to create venv using InternVL/requirements.txt from Git

I didn’t use quantization but you can use 16-bit (bf16 / fp16), BNB (8-bit and 4-bit) out of the box.

I used Inference with Transformers for videos and images.

And I like that the code is very clear and runs without hours of debugging. Getting first description in Jupyter notebook is very straightforward:

Use LMDeploy to get first description

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

领英推荐

Data Science #32

Andriy Burkov 5 个月前

Live In 1 Hour: How to Use OpenCV in PyCharm, A Live…

OpenCV 4 个月前

Data Science #24

Andriy Burkov 11 个月前

pip install lmdeploy==0.5.3

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.

# A 'Hello, world' example

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2-8B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)

And that's it!

Inference with Transformers has more code compared to this but it is also clear and not complex to understand at all

You can check the code for inference with Transformers is in this section. It would take just a couple of minutes. Three key functions it are?

Function dynamic_preprocess (uses subtractive function find_closest_aspect_ratio to find the closest aspect ratio to the target)
Function load_image (uses subtractive function build_transform for image normalization and resizing)

Function load_video (uses subtractive function get_index which returns frames indexes for video loading)

There are also multi-turn conversation and streaming output options (you can check at HF page).

Finetune

Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning.

要查看或添加评论，请登录

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

2025年3月6日

Quatitative interview task: human approach vs AI approach

It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…
Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

2025年2月28日

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…
Pseudo Labeling

2025年2月16日

Pseudo Labeling

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…
Learning to distill ML models

2025年2月14日

Learning to distill ML models

I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…
Kaggle Santa 2024 and what do the puzzles have to do with it?

2025年2月8日

Kaggle Santa 2024 and what do the puzzles have to do with it?

Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.
Qdrant and other vector DBs

2025年1月28日

Qdrant and other vector DBs

Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…
Chutes: did you try it?

2025年1月21日

Chutes: did you try it?

Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

3 条评论
VITA multimodal LLM

2024年11月25日

VITA multimodal LLM

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…
What are Diffusion Models?

2024年5月29日

What are Diffusion Models?

Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…
4 Neural Network Activation Functions you should keep in mind

2024年5月24日

4 Neural Network Activation Functions you should keep in mind

What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

See all articles

InternVL2 test drive

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

How to try it

领英推荐

Finetune

Ivan Isaev的更多文章

社区洞察

其他会员也浏览了

Power Bites - September 13, 2024 Weekly Roundup

Tech Insights 2025 Week 1

Our new book - AI-Assisted Programming for Web and Machine Learning

Is Machine Learning an Art, a Science or Something Else?

TensorFlow.js Monthly #7: RoboFlow.js, Coral Edge TPU acceleration for Node.js, and OCR recognition in the browser

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use

How to Deploy DeepSeek R1 for Free in VS Code Using Cline or Roo Code

More context, more intelligence, better code.

GPTEngineer VS tzap.io

A Star Trek Inspired Look at Our AI Future

How to try it

领英推荐

Finetune

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Pseudo Labeling

Learning to distill ML models

Kaggle Santa 2024 and what do the puzzles have to do with it?

Qdrant and other vector DBs

Chutes: did you try it?

VITA multimodal LLM

What are Diffusion Models?

4 Neural Network Activation Functions you should keep in mind

社区洞察

其他会员也浏览了

Power Bites - September 13, 2024 Weekly Roundup

Tech Insights 2025 Week 1

Our new book - AI-Assisted Programming for Web and Machine Learning

Is Machine Learning an Art, a Science or Something Else?

TensorFlow.js Monthly #7: RoboFlow.js, Coral Edge TPU acceleration for Node.js, and OCR recognition in the browser

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use

How to Deploy DeepSeek R1 for Free in VS Code Using Cline or Roo Code

More context, more intelligence, better code.

GPTEngineer VS tzap.io

A Star Trek Inspired Look at Our AI Future