How is the performance of Helix, the first self-developed model of robot company Figure?
On February 20, 2025, local time, humanoid robot company Figure launched its self-developed general vision-language-action (VLA) large model Helix, which achieved multiple breakthroughs in performance, architecture and training efficiency, providing a new direction for the commercialization of embodied intelligence. The specific performance is as follows:
- Excellent technical performance
- Complete upper body control: Helix is the first VLA model that can output high-rate continuous control of the entire human upper body (including wrists, torso, head and individual fingers), which can achieve precise motion control.
- Multi-robot collaboration capability: It is the first VLA that can run on two robots at the same time, enabling them to solve shared long-range operation tasks and operate items they have never seen before. In the official demonstration, the two robots can collaborate to complete the storage of refrigerator items without preset instructions.
- Powerful grasping ability: Helix has a strong grasping ability to pick up small objects. It can pick up almost any small household object in a home scene. For example, when asked to "pick up an object in the desert", it will recognize a toy cactus, select the nearest hand, and execute precise motion instructions to firmly grasp it.
- Zero-sample generalization ability: A robot equipped with Helix can recognize and grasp thousands of untrained household objects with zero samples. Its single neural network weight set supports multi-task learning without additional programming or training, significantly reducing the adaptation cost of complex scenes.
领英推荐
- Architecture design innovation
- Dual system collaborative architecture: Helix adopts a two-layer decoupled architecture of "System 1 (S1)" and "System 2 (S2)". S1, as a high-speed reactive visual motion strategy, converts the semantic analysis results of S2 into precise continuous actions at a frequency of 200 Hz; S2 is based on the pre-trained visual language model (VLM) on the Internet, and processes scene understanding and semantic reasoning at a frequency of 7-9 Hz, solving the trade-off between speed and versatility of traditional robot models. Moreover, due to the decoupled design of S1 and S2, the two can be optimized independently without re-adjusting the overall model, reducing the complexity of model upgrades.
- High training efficiency
- Small data volume requirement: Helix only needs about 500 hours of supervised data to complete training, which is only a small part of the traditional VLA model. It generates training instructions through automatic labeling technology. The model generates natural language descriptions of corresponding actions based on the video clips captured by the robot camera, which greatly reduces the cost of manual labeling and effectively solves the generalization problem of massive unknown objects in home scenes.
- Great commercial potential
- Low power operation: The model can run on embedded low-power GPUs without relying on cloud computing power, making it possible for real-time deployment in home, warehousing and other scenarios. A single neural network weight set is used to learn all behaviors, including grasping, opening and closing drawers, cross-robot interaction, etc., without fine-tuning for tasks, significantly reducing the development threshold for practical applications. At present, Figure robots have been piloted in BMW factories.