Quick answer

AI Summary: Presents RT-2, a Vision-Language-Action model that translates web-based semantic knowledge directly into physical robotic actions, enabling emergent reasoning and generalization in robots.

Paper2023-07-28•Source ↗•38 attns363 checkouts

Claim

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Authors

Discuss with Grok

Anthony Brohan·

Noah Brown·

Justice Carbajal·

Yevgen Chebotar·

Xi Chen·

Krzysztof Choromanski·

Tianli Ding·

Danny Driess·

Avinava Dubey·

Chelsea Finn·

Keerthana Gopalakrishnan·

Karol Hausman·

Alex Irpan·

Google DeepMind

ABSTRACT

We introduce Robotic Transformer 2 (RT-2), a novel Vision-Language-Action (VLA) model that learns from both vast web datasets and specialized robotics data. We show that high-capacity vision-language models (VLMs) can be directly fine-tuned to output low-level robotic actions by representing physical actions as text tokens. This co-training approach allows RT-2 to absorb rich semantic concepts, reasoning capabilities, and visual understanding from the internet and transfer them directly into embodied robotic control. RT-2 demonstrates emergent robotic capabilities not present in the robot training data, such as semantic reasoning, symbol understanding, and the ability to recognize and interact with novel objects following abstract human instructions.

#vla-models #robotics lab:deep-mind-ai #cs-ro #cs-ai

Review Snapshot

Explore ratings

4.6

★★★★★

5 ratings

5 star

60%

4 star

40%

3 star

2 star

1 star

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful