← Home

Quick answer

AI Summary: Presents RT-2, a Vision-Language-Action model that translates web-based semantic knowledge directly into physical robotic actions, enabling emergent reasoning and generalization in robots.

Claim

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan·
Noah Brown·
Justice Carbajal·
Yevgen Chebotar·
Xi Chen·
Krzysztof Choromanski·
Tianli Ding·
Danny Driess·
Avinava Dubey·
Chelsea Finn·
Keerthana Gopalakrishnan·
Karol Hausman·
Alex Irpan·
Google DeepMind

ABSTRACT

We introduce Robotic Transformer 2 (RT-2), a novel Vision-Language-Action (VLA) model that learns from both vast web datasets and specialized robotics data. We show that high-capacity vision-language models (VLMs) can be directly fine-tuned to output low-level robotic actions by representing physical actions as text tokens. This co-training approach allows RT-2 to absorb rich semantic concepts, reasoning capabilities, and visual understanding from the internet and transfer them directly into embodied robotic control. RT-2 demonstrates emergent robotic capabilities not present in the robot training data, such as semantic reasoning, symbol understanding, and the ability to recognize and interact with novel objects following abstract human instructions.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | Attendemia