Helix: A Vision-Language-Action Model for Generalist Humanoid Control
Introducing Helix

Helix: A Vision-Language-Action Model for Generalist Humanoid Control Introducing Helix




Helix, a revolutionary generalist Vision-Language-Action (VLA) model that integrates perception, language understanding, and learned control to address longstanding challenges in robotics. Helix is a breakthrough in the field, bringing multiple firsts to humanoid robotics:

  • Full-upper-body control: Helix is the first VLA model capable of high-rate continuous control of a humanoid robot's upper body, including wrists, torso, head, and individual fingers.
  • Multi-robot collaboration: Helix enables two robots to operate simultaneously, solving complex, shared manipulation tasks involving unfamiliar objects.
  • Pick up anything: Equipped with Helix, Figure robots can grasp and manipulate virtually any small household object based on natural language prompts, even if they have never encountered the item before.
  • One neural network: Unlike previous approaches requiring task-specific fine-tuning, Helix learns all behaviours—including picking and placing, opening drawers, and interacting with multiple robots—through a single set of neural network weights.
  • Commercial-ready: Helix runs entirely on embedded low-power-consumption GPUs, making it ready for commercial deployment without additional hardware modifications.

New Scaling for Humanoid Robotics

Household environments present the greatest challenge for robotics due to the diversity and unpredictability of objects. Unlike structured industrial settings, homes contain an array of items—glassware, clothing, toys—varying in shape, size, colour, and texture. To be useful, robots must dynamically generate intelligent behaviours, especially for objects they have never encountered.

Traditional approaches require extensive human intervention: programming a new skill takes hours of expert coding or thousands of demonstrations, making scalability impractical. However, by leveraging AI advancements in vision-language models (VLMs), Helix introduces a paradigm shift—robots can now acquire new skills instantly through natural language commands, eliminating the need for extensive manual programming.

Helix: A "System 1, System 2" VLA Model for Whole Upper Body Control

Helix introduces a dual-system architecture inspired by cognitive science:

  • System 2 (S2): A VLM operating at 7-9 Hz, responsible for scene understanding and language comprehension, ensuring broad generalisation across objects and contexts.
  • System 1 (S1): A fast visuomotor policy translating S2’s semantic representations into precise robot actions at 200 Hz.

This decoupled architecture allows S2 to handle high-level reasoning while S1 ensures real-time responsiveness. For instance, S1 rapidly adapts to the partner robot’s movements in collaborative scenarios while maintaining S2’s high-level objectives.

Key Advantages of Helix

  • Speed and Generalisation: Matches the speed of specialised policies while generalising to thousands of unseen objects.
  • Scalability: Outputs continuous control for complex humanoid actions without requiring tokenization.
  • Architectural Simplicity: Uses standard architectures—an open-weight VLM for S2 and a transformer-based visuomotor policy for S1.
  • Separation of Concerns: Allows independent optimization of S1 and S2, improving flexibility and adaptability.

Model and Training Details

Data Collection

Helix is trained on a dataset of ~500 hours of diverse teleoperated behaviors across multiple robots and operators. Natural language training pairs are generated using an auto-labeling VLM, which analyzes segmented video clips and formulates instructional prompts based on observed actions.

Architecture

  • System 2 (S2): A 7B-parameter open-source VLM processes monocular robot images and state information, translating vision-language embeddings into a single latent vector.
  • System 1 (S1): An 80M parameter transformer conditions its low-level control policy on S2’s latent vector, ensuring precise, high-frequency motor execution.

Training Strategy

Helix is trained end-to-end, mapping raw pixels and text commands to continuous actions via regression loss. A temporal offset is introduced during training to match real-time inference latency, ensuring smooth deployment.

Optimized Streaming Inference

Helix is deployed on low-power GPUs, with S2 and S1 operating asynchronously. S2 continuously updates a shared latent vector encoding high-level behavioural intent, while S1 processes real-time robot observations for precise motor control. This structure ensures Helix maintains the necessary 200 Hz control loop, making it as fast as traditional single-task imitation learning policies.


Results

Fine-grained whole Upper Body Control

Helix enables smooth coordination across 35 degrees of freedom (DoF), including individual finger control, head tracking, and torso adjustments. The robot dynamically modifies its posture for optimal reach while maintaining precise grasping, a significant achievement in humanoid robotics.

Zero-Shot Multi-Robot Coordination

Helix successfully enables two-figure robots to collaborate in real-time on complex tasks, such as grocery storage and handling completely novel objects. Robots communicate through natural language prompts like "Hand the bag of cookies to the robot on your right," showcasing emergent multi-agent behaviour without explicit role assignments.

Emergent “Pick Up Anything” Capability

Helix-equipped robots can pick up any small household object via simple prompts like “Pick up the toy” or “Pick up the dessert item.” The model translates abstract concepts into precise actions, demonstrating advanced generalisation across diverse environments.

Discussion and Future Prospects

Helix represents a major leap in humanoid robotics, proving that vision-language knowledge can directly translate into real-time motor control. With its efficient training, commercial viability, and generalisation capabilities, Helix sets a new standard for AI-driven robotic systems. Our next steps include refining multi-robot collaboration, expanding object interaction capabilities, and integrating Helix into real-world applications, bringing us closer to truly autonomous humanoid assistants.


https://www.figure.ai/news/helix





要查看或添加评论,请登录

Ashish Sonawane的更多文章

社区洞察

其他会员也浏览了