NVIDIA just announced the world's first open sourced physical AI foundation model, GR00T-N1, along with a simulation data with 300K+ trajectories!
GR00T N1 is a VLA model designed to support cross-embodiment with 2.2B parameters showing ????.??% ?????????????? ???????? ???? ?????????????????????? ??????????-???????????????????? ??????????.
Here is the architecture:
System 1: Diffusion based transformer action model at 120 Hz, real-time, closed-hoop motor control using flow matching and cross attention layers.
System 2: Vision-Language Model (Eagle-2) at 10 Hz, responsible for high-level reasoning and task interpretation (Runs on NVIDIA L40 GPU).
The architecture reminds me of Figure's Helix.
For action generation, it uses ?????????????????? ?????????????????????? with flow matching, using alternating self and cross-attention blocks with adaptive layer normalization for denoising.
I want to emphasize the data approach.
To address what NVIDIA researchers call the '???????? ????????????' ?????????????? (data fragmentation due to embodiments), they came up with a data pyramid, organizing heterogenous sources by scale.
The pyramid made of three layers: base, middle, and top.
????????: Large quantities of web data and human videos
????????????: Synthetic data generated with physics simulations and neural models?
??????: Real-world data collected on the physical robot hardware
To implement this data pyramid, they:
?
????-?????????????? ?????? ???????????? ?????????????? during both pre and post training phase
????????????-???????????? ???????????????? ???????????????? ???????????????? to utilize action-less data sources
?????????????? ???????????????? ???????????? (??????) to infer pseudo-actions from videos without explicit action data
These training methods allows them to integrate each 'data island' seamlessly across embodiments, connecting different sensors and physical configurations.
I am so glad to see a state-of-the-art VLA model be open sourced and get a grasp of which approaches the physical AI models are taking.
I believe GR00T's impact on the humanoid world could be as transformative as LeNet-5 was for AlexNet.
????????????'?? ?????????????????????? ???????????????????????? ???????????? ???? ????????-??????????????????????, ???????????????????? ???????????? ???? ???????????????? that researchers are now able to build on top of or fine-tune just with NVIDIA A6000 GPU compared to 50,000 H100 GPU required for pre-training.
We're one step closer to passing Steve Wozniak's coffee test!
Congrats and thanks to Jim Fan, Yuke Zhu, Scott Reed, Zhenjia Xu, Guanzhi Wang, Yu Fang, Soroush Nasiriany, Joel Jang, Fengyuan Hu, Avnish Narayan, and the rest of GR00T team!
Check out the publication: https://lnkd.in/g4um2P4f
I post my latest and insightful read in robotics, ???????????? ???? ???? ???????? ??????????????!