InternVL2 test drive
Ivan Isaev
ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master
Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times faster than LLAMAv3 for my videos), I used the 8B model and the descriptions are nice.?
VITA descriptions are better and inference time of VITA is also good but InternVL2 is about a twice faster.
How to try it
You just need to create venv using InternVL/requirements.txt from Git
I didn’t use quantization but you can use 16-bit (bf16 / fp16), BNB (8-bit and 4-bit) out of the box.
I used Inference with Transformers for videos and images.
And I like that the code is very clear and runs without hours of debugging. Getting first description in Jupyter notebook is very straightforward:
Use LMDeploy to get first description
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
pip install lmdeploy==0.5.3
LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
# A 'Hello, world' example
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
model = 'OpenGVLab/InternVL2-8B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)
And that's it!
Inference with Transformers has more code compared to this but it is also clear and not complex to understand at all
You can check the code for inference with Transformers is in this section. It would take just a couple of minutes. Three key functions it are?
There are also multi-turn conversation and streaming output options (you can check at HF page).