InternVL2 test drive

InternVL2 test drive

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times faster than LLAMAv3 for my videos), I used the 8B model and the descriptions are nice.?

Intern_vl2 HF

Intern_vl2 Git

VITA descriptions are better and inference time of VITA is also good but InternVL2 is about a twice faster.

How to try it

You just need to create venv using InternVL/requirements.txt from Git

I didn’t use quantization but you can use 16-bit (bf16 / fp16), BNB (8-bit and 4-bit) out of the box.

I used Inference with Transformers for videos and images.

And I like that the code is very clear and runs without hours of debugging. Getting first description in Jupyter notebook is very straightforward:

Use LMDeploy to get first description

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

pip install lmdeploy==0.5.3        

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.

# A 'Hello, world' example

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2-8B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
response = pipe(('describe this image', image))
print(response.text)        

And that's it!

Inference with Transformers has more code compared to this but it is also clear and not complex to understand at all

You can check the code for inference with Transformers is in this section. It would take just a couple of minutes. Three key functions it are?

  • Function dynamic_preprocess (uses subtractive function find_closest_aspect_ratio to find the closest aspect ratio to the target)
  • Function load_image (uses subtractive function build_transform for image normalization and resizing)

  • Function load_video (uses subtractive function get_index which returns frames indexes for video loading)

There are also multi-turn conversation and streaming output options (you can check at HF page).

Finetune

Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning.

要查看或添加评论,请登录

Ivan Isaev的更多文章

  • Quatitative interview task: human approach vs AI approach

    Quatitative interview task: human approach vs AI approach

    It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…

  • Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…

  • Pseudo Labeling

    Pseudo Labeling

    Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…

  • Learning to distill ML models

    Learning to distill ML models

    I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…

  • Kaggle Santa 2024 and what do the puzzles have to do with it?

    Kaggle Santa 2024 and what do the puzzles have to do with it?

    Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.

  • Qdrant and other vector DBs

    Qdrant and other vector DBs

    Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…

  • Chutes: did you try it?

    Chutes: did you try it?

    Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

    3 条评论
  • VITA multimodal LLM

    VITA multimodal LLM

    Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…

  • What are Diffusion Models?

    What are Diffusion Models?

    Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…

  • 4 Neural Network Activation Functions you should keep in mind

    4 Neural Network Activation Functions you should keep in mind

    What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

社区洞察

其他会员也浏览了