Quick answer
AI Summary: Presents RT-2, a Vision-Language-Action model that translates web-based semantic knowledge directly into physical robotic actions, enabling emergent reasoning and generalization in robots.
AI Summary: Presents RT-2, a Vision-Language-Action model that translates web-based semantic knowledge directly into physical robotic actions, enabling emergent reasoning and generalization in robots.
We introduce Robotic Transformer 2 (RT-2), a novel Vision-Language-Action (VLA) model that learns from both vast web datasets and specialized robotics data. We show that high-capacity vision-language models (VLMs) can be directly fine-tuned to output low-level robotic actions by representing physical actions as text tokens. This co-training approach allows RT-2 to absorb rich semantic concepts, reasoning capabilities, and visual understanding from the internet and transfer them directly into embodied robotic control. RT-2 demonstrates emergent robotic capabilities not present in the robot training data, such as semantic reasoning, symbol understanding, and the ability to recognize and interact with novel objects following abstract human instructions.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.