登录查看更多内容

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月22日

Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual games using keyboards and mouse. The paper proposes a new training paradigm called Act from Visual Language Post-Training (ActVLP), which enhances Visual Language Models (VLMs) through self-supervised learning before applying them to action-based decision-making tasks. This approach significantly improves the model's capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments like Minecraft.

Method Overview

The JARVIS-VLA approach introduces a multi-stage training pipeline called ActVLP (Act from Visual Language Post-Training) that enhances VLMs before applying them to action-based tasks. The model architecture consists of a Vision Transformer (ViT) for processing images, an image projection module to align visual and textual representations, and a language model for reasoning and decision-making. The model is designed to handle partially observable environments by incorporating a history of observation images within the prompt.

The training pipeline consists of three distinct stages. In Stage I, the language model component is post-trained on text-only world knowledge related to the target environment (Minecraft), while keeping the vision components frozen. This enhances the model's understanding of the game's mechanics, items, and rules. In Stage II, both the vision encoder and language model are fine-tuned using captioning, visual question-answering, and spatial grounding datasets. This stage improves vision-language alignment and the model's ability to understand visual scenes in the game environment.

Finally, in Stage III, the model undergoes imitation learning on trajectory data, where it learns to mimic expert actions given textual instructions and visual observations. During this phase, the vision-related modules remain frozen while the language model is fine-tuned to generate appropriate action sequences. The model uses a discretized action space, where continuous actions (like mouse movements) are mapped to discrete tokens that are added to the language model's vocabulary.

To support this training approach, the authors constructed a large-scale multimodal dataset including both non-trajectory task datasets for post-training and trajectory datasets for downstream imitation learning. The non-trajectory datasets include knowledge-based question answering (277K entries), visual-language alignment (35K keyframes), and spatial grounding (404K data points). For imitation learning, they collected over 7.4 million frames of Minecraft gameplay data from various sources.

Results

JARVIS-VLA demonstrates significant improvements over previous state-of-the-art methods across all evaluated task categories in Minecraft. The model achieves an average success rate of 95% for mining blocks, 77% for crafting items, and 70% for smelting items, substantially outperforming previous methods like VPT, STEVE-1, and GROOT.

Notably, even without task-specific post-training, the raw Qwen2-VL model fine-tuned on downstream tasks outperforms several previous baselines, highlighting the effectiveness of using a robust pre-trained VLM as the base model. The ActVLP post-training provides a significant performance boost, especially for complex tasks like crafting and smelting, where JARVIS-VLA achieves success rates more than double those of baseline models.

Conclusion

The paper introduces ActVLP, a novel training framework for visual-language-action models that leverages vision-language post-training to enhance decision-making capabilities in dynamic environments. By post-training on non-trajectory tasks before applying imitation learning on trajectory data, JARVIS-VLA significantly improves the foundation model's ability to understand complex environments and perform various tasks in Minecraft. For more information please consult the full paper.

Congrats to the authors for their work!

Li, Muyao, et al. "JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse." arXiv preprint arXiv:2503.16365 (2025).

AI Paper of the Day

1,323 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

2025年3月23日

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Today's paper introduces LEGION, a comprehensive framework for synthetic image detection that not only identifies fake…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

2025年3月21日

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Today's paper presents the a comprehensive survey on efficient reasoning for Large Language Models (LLMs). It addresses…
TULIP: Towards Unified Language-Image Pretraining

2025年3月20日

TULIP: Towards Unified Language-Image Pretraining

Today's paper introduces TULIP (Towards Unified Language-Image Pretraining), a novel approach to image-text contrastive…
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

2025年3月19日

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Today's paper introduces Creation-MMBench, a novel benchmark designed to evaluate the creative capabilities of…
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

2025年3月18日

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Today's paper introduces SPIN-Bench, a comprehensive benchmark designed to evaluate how well Large Language Models…
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

2025年3月17日

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Today's paper introduces ReCamMaster, a framework that enables re-shooting videos with new camera trajectories while…
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

2025年3月16日

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Today's paper introduces CoSTA* (Cost-Sensitive Toolpath Agent), a novel approach for multi-turn image editing that…
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

2025年3月15日

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Today's paper introduces OmniPaint, a unified framework for object-oriented image editing that reconceptualizes object…
Charting and Navigating Hugging Face's Model Atlas

2025年3月14日

Charting and Navigating Hugging Face's Model Atlas

Today's paper introduces the concept of a "model atlas" for navigating the vast landscape of publicly available neural…
VACE: All-in-One Video Creation and Editing

2025年3月13日

VACE: All-in-One Video Creation and Editing

Today's paper introduces VACE, an all-in-one model for video creation and editing. VACE unifies multiple video tasks…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,323 位关注者

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

TULIP: Towards Unified Language-Image Pretraining

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Charting and Navigating Hugging Face's Model Atlas

VACE: All-in-One Video Creation and Editing