登录查看更多内容

Deep Dive into Robotics Learning Architectures

Kai Xin Thia

Head of AI & Analytics, Group Tech Office, ST Engineering

发布日期: 2025年3月19日

This week, we explore the latest advances from Figure’s Helix, NVIDIA’s Isaac GR00T N1, and Google's Gemini Robotics. They represent a strategic evolution in robotics architectures, fundamentally reshaping what robots can do and how quickly they adapt to real-world environments. These architectures share a powerful commonality: they employ dual-system frameworks inspired by human cognitive processing—one component (Vision Language Model - VLM) handling high-level reasoning and understanding, and another (Vision Language Action - VLA / Diffusion Transformer Model) managing swift, precise, physical actions.

Figure's Helix integrates this fully onboard with high-frequency neural controllers, ensuring real-time dexterity.
NVIDIA's GR00T N1 leverages diffusion-based action generation alongside flexible reasoning, enabling seamless generalization across different robots and contexts.
Google's Gemini Robotics balances cloud-powered embodied reasoning with responsive onboard execution.

This transformative approach enables highly flexible, dexterous, and rapidly adaptable robots.

Special Thanks to William Teo for help with the research.

What Matters

Summary of Key Capabilities Across the Three Architectures

Cognitive-Physical Division

All three models effectively separate cognitive processes into higher-level reasoning (System 2) and real-time action execution (System 1), mirroring human cognition but differ in deployment strategies (cloud vs. onboard).

Deployment Trade-offs

Helix prioritizes onboard integration, minimizing external latency but facing computational constraints.
GR00T N1 emphasizes modularity and flexibility across multiple robot embodiments but can face latency if System 2 runs remotely.
Gemini Robotics optimizes the balance between computational complexity and latency by hosting sophisticated reasoning in the cloud and responsive control locally.

Adaptability and Generalization:

All systems emphasize adaptability, but Gemini Robotics and GR00T N1 explicitly leverage extensive datasets for robust generalization, while Helix emphasizes continuous onboard adaptive control and immediate generalization.

Practical Considerations:

Helix excels in fully autonomous deployments due to onboard processing.
GR00T N1 is ideal for varied robots and scenarios demanding flexible post-training customization.
Gemini Robotics strikes a strong balance for enterprise and commercial deployments requiring sophisticated reasoning and precise, real-time physical execution.

Figure’s Helix Architecture: "System 1 / System 2" Paradigm

Figure's Helix explicitly leverages a cognitive architecture inspired by human cognition, clearly dividing tasks between two complementary subsystems:

System 2 (Slow, deliberative reasoning):

Function: Performs high-level visual and linguistic processing, interpreting complex scenes and generating semantic representations.
Architecture: Built on large multimodal models; operates at approximately 7-9 Hz.
Role: Determines task plans, object identification, and broad-scene contextual understanding.
Technical Implementation: Vision-Language backbone running neural networks optimized for generalization rather than speed, emphasizing interpretability and flexibility.

System 1 (Fast, reactive actions):

Function: Executes real-time physical control based on semantic instructions from System 2.
Frequency: Operates at around 200 Hz, allowing highly responsive, continuous motion.
Technical Approach: Utilizes a lightweight and efficient visuomotor control policy to transform semantic-level outputs from System 2 into fluid robot motions.

Strengths:

Human-like decision-making structure separates slow cognitive tasks from fast reactive execution.
Enables simultaneous high-level reasoning and real-time control, improving adaptability to dynamic environments.

Weaknesses/Limitations:

Potential bottlenecks in communication or integration latency between two separate cognitive layers.
Onboard computational complexity may limit deployment scenarios without optimization.

NVIDIA Isaac GR00T N1: "System 1 / System 2" Hybrid Architecture

NVIDIA's Isaac GR00T N1 also adopts a two-layer cognitive architecture, emphasizing generalized learning and cross-platform deployment.

System 2 (High-level cognitive reasoning):

Foundation Model: Built on NVIDIA-Eagle with SmolLM-1.7B for vision-language interpretation, capable of robust environmental reasoning, understanding complex instructions, and making strategic decisions.
Functionality: General-purpose reasoning, strategic planning, and interpreting multimodal sensory data.
Deployment: Can operate in cloud or off-board computational resources, offering scalability.

System 1 (Low-level motor actions):

Design: Diffusion transformer action model for fluid, precise motor execution based on System 2 commands.
Technical Highlights: Continuous action synthesis optimized through transformer-diffusion techniques enabling smooth, precise motor controls.
Adaptability: Designed for cross-embodiment generalization, capable of adapting quickly to different robot hardware configurations and action spaces.

Strengths:

Highly generalizable due to rich, diverse synthetic and real training datasets.
Robust motor execution from diffusion-based low-level actions is particularly effective in varied real-world scenarios.

Weaknesses/Limitations:

Potentially high latency if running System 2 remotely.
Heavy dependence on data volume and quality is required to achieve optimal generalization, requiring significant upfront data preparation.

Gemini Robotics: Backbone & Decoder Architecture

Gemini Robotics presents a clearly defined division of computational roles explicitly optimized for real-world robotic deployment:

Gemini Robotics Backbone (Cloud-based VLA model):

Design: Built upon Gemini Robotics-ER, a derivative of Gemini 2.0, focusing on advanced embodied reasoning, multimodal comprehension, and spatial understanding.
Latency: Optimized inference latency around 160 ms, ensuring swift cloud-based responsiveness.
Role: Performs advanced reasoning & perception tasks and generates high-level action sequences based on visual and language input.

Gemini Robotics Decoder (Local robot action execution):

Role: Located onboard the robot; converts high-level semantic outputs into real-time physical robot actions.
Performance: Operates with an effective frequency of 50 Hz, sufficiently reactive for real-time manipulation.
Technical Considerations: Compensates for cloud latency, maintaining smooth robotic movements and rapid adaptability.

Strengths:

Combines powerful cloud-based cognitive abilities (System 2) with highly efficient local execution (System 1), minimizing onboard computational demands.
Effective integration of advanced semantic reasoning directly with robotic control.

Weaknesses/Limitations:

Network connectivity dependence (cloud-backbone component) could introduce occasional latency variability.
Data privacy and reliability concerns in environments requiring entirely onboard computing without cloud access.

Conclusion

The advanced robotics learning architectures from Gemini Robotics, Figure's Helix, and NVIDIA’s Isaac GR00T N1 collectively signal a pivotal shift in robotic intelligence, versatility, and deployment flexibility. The dual-system approach—separating sophisticated reasoning (System 2) from high-frequency reactive execution (System 1)—emerges as a transformative paradigm, significantly enhancing robot adaptability, responsiveness, and dexterity.

Figure’s Helix emphasizes fully embedded onboard execution, prioritizing low-latency responsiveness for high-dexterity tasks.
NVIDIA’s Isaac GR00T N1 excels in cross-platform adaptability and smooth, diffusion-driven motor control, supporting extensive customization.
Gemini Robotics uniquely leverages cloud-based embodied reasoning, offering powerful scalability and deep spatial intelligence.

Together, these architectures represent strategic leaps forward, enabling the deployment of general-purpose robots capable of understanding, interacting, and reasoning effectively in dynamic, unstructured environments.

要查看或添加评论，请登录

Kai Xin Thia的更多文章

The Art of Coordination: Inside the World of Multi-Robot Task Assignment and Exploration

2025年3月4日

The Art of Coordination: Inside the World of Multi-Robot Task Assignment and Exploration

This week, we explore the brave new world where robots team up to tackle high-stakes missions, from finding survivors…
Small but Mighty: SLMs are Democratising AI

2025年2月27日

Small but Mighty: SLMs are Democratising AI

This week, we explore the surge in the development of small language models (SLMs) that address the growing need for…

5 条评论
DeekSeek AI Agents for Knowledge Graph Augmentation & Query

2025年2月20日

DeekSeek AI Agents for Knowledge Graph Augmentation & Query

This week, let's explore how advancements in AI-driven knowledge management pave the way for more efficient and…
Advanced Agentic Reasoning with Structure & Optimisation

2025年2月13日

Advanced Agentic Reasoning with Structure & Optimisation

LLMs are transforming beyond simple text generation to complex problem-solving and expert-level reasoning. This shift…

1 条评论
Practical Humanoid Robots - Agile, Affordable, Teleoperated

2025年2月5日

Practical Humanoid Robots - Agile, Affordable, Teleoperated

This week, let's take a deeper look into Humanoid robotics, which is experiencing a rapid transformation, making…
DeepSeek – A Deep Dive into Efficiency and Innovation

2025年1月27日

DeepSeek – A Deep Dive into Efficiency and Innovation

This week, we will explore DeepSeek, a Chinese AI lab that has rapidly gained recognition for its innovative LLM…

14 条评论
Applied AI: LLMs for Enhanced Emergency Response

2025年1月25日

Applied AI: LLMs for Enhanced Emergency Response

This week, we explore several innovative approaches to leveraging LLMs and other AI techniques to enhance emergency…

2 条评论
Physical AI and the Convergence of Embodied & Living Intelligence

2025年1月17日

Physical AI and the Convergence of Embodied & Living Intelligence

The rapidly developing field of Artificial Intelligence is no longer confined to the digital realm of text and images…
Future of Humanoid Robotics

2025年1月9日

Future of Humanoid Robotics

The world of humanoid robotics is on the cusp of a significant leap forward, driven by the convergence of sophisticated…

1 条评论
A Deep Dive into Generative World Models

2025年1月2日

A Deep Dive into Generative World Models

This week, we explore the surge of innovation in AI world models that enables the creation of interactive and…

1 条评论

See all articles

What Matters

Cognitive-Physical Division

Deployment Trade-offs

Adaptability and Generalization:

Practical Considerations:

Figure’s Helix Architecture: "System 1 / System 2" Paradigm

NVIDIA Isaac GR00T N1: "System 1 / System 2" Hybrid Architecture

Gemini Robotics: Backbone & Decoder Architecture

Conclusion

Kai Xin Thia的更多文章

The Art of Coordination: Inside the World of Multi-Robot Task Assignment and Exploration

Small but Mighty: SLMs are Democratising AI

DeekSeek AI Agents for Knowledge Graph Augmentation & Query

Advanced Agentic Reasoning with Structure & Optimisation

Practical Humanoid Robots - Agile, Affordable, Teleoperated

DeepSeek – A Deep Dive into Efficiency and Innovation

Applied AI: LLMs for Enhanced Emergency Response

Physical AI and the Convergence of Embodied & Living Intelligence

Future of Humanoid Robotics

A Deep Dive into Generative World Models