JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual games using keyboards and mouse. The paper proposes a new training paradigm called Act from Visual Language Post-Training (ActVLP), which enhances Visual Language Models (VLMs) through self-supervised learning before applying them to action-based decision-making tasks. This approach significantly improves the model's capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments like Minecraft.
Method Overview
The JARVIS-VLA approach introduces a multi-stage training pipeline called ActVLP (Act from Visual Language Post-Training) that enhances VLMs before applying them to action-based tasks. The model architecture consists of a Vision Transformer (ViT) for processing images, an image projection module to align visual and textual representations, and a language model for reasoning and decision-making. The model is designed to handle partially observable environments by incorporating a history of observation images within the prompt.
The training pipeline consists of three distinct stages. In Stage I, the language model component is post-trained on text-only world knowledge related to the target environment (Minecraft), while keeping the vision components frozen. This enhances the model's understanding of the game's mechanics, items, and rules. In Stage II, both the vision encoder and language model are fine-tuned using captioning, visual question-answering, and spatial grounding datasets. This stage improves vision-language alignment and the model's ability to understand visual scenes in the game environment.
Finally, in Stage III, the model undergoes imitation learning on trajectory data, where it learns to mimic expert actions given textual instructions and visual observations. During this phase, the vision-related modules remain frozen while the language model is fine-tuned to generate appropriate action sequences. The model uses a discretized action space, where continuous actions (like mouse movements) are mapped to discrete tokens that are added to the language model's vocabulary.
To support this training approach, the authors constructed a large-scale multimodal dataset including both non-trajectory task datasets for post-training and trajectory datasets for downstream imitation learning. The non-trajectory datasets include knowledge-based question answering (277K entries), visual-language alignment (35K keyframes), and spatial grounding (404K data points). For imitation learning, they collected over 7.4 million frames of Minecraft gameplay data from various sources.
Results
JARVIS-VLA demonstrates significant improvements over previous state-of-the-art methods across all evaluated task categories in Minecraft. The model achieves an average success rate of 95% for mining blocks, 77% for crafting items, and 70% for smelting items, substantially outperforming previous methods like VPT, STEVE-1, and GROOT.
Notably, even without task-specific post-training, the raw Qwen2-VL model fine-tuned on downstream tasks outperforms several previous baselines, highlighting the effectiveness of using a robust pre-trained VLM as the base model. The ActVLP post-training provides a significant performance boost, especially for complex tasks like crafting and smelting, where JARVIS-VLA achieves success rates more than double those of baseline models.
Conclusion
The paper introduces ActVLP, a novel training framework for visual-language-action models that leverages vision-language post-training to enhance decision-making capabilities in dynamic environments. By post-training on non-trajectory tasks before applying imitation learning on trajectory data, JARVIS-VLA significantly improves the foundation model's ability to understand complex environments and perform various tasks in Minecraft. For more information please consult the full paper.
Congrats to the authors for their work!
Li, Muyao, et al. "JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse." arXiv preprint arXiv:2503.16365 (2025).