← Home

Quick answer

AI Summary: Details Video PreTraining (VPT), a breakthrough method that uses an Inverse Dynamics Model to label 70,000 hours of YouTube videos, training an AI to craft a diamond pickaxe in Minecraft.

Claim

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Video

Bowen Baker·
Ilge Akkaya·
Peter Zhokhov·
Joost Huizinga·
Jie Tang·
Adrien Ecoffet·
Brandon Houghton·
Raul Sampedro·
Jeff Clune

ABSTRACT

Training agents to perform complex, long-horizon tasks typically requires massive amounts of heavily annotated data or prohibitive amounts of reinforcement learning trial-and-error. We introduce Video PreTraining (VPT), a paradigm that leverages the vast quantity of unlabeled video available on the internet. We gather a small dataset of human gameplay with recorded keypresses to train an Inverse Dynamics Model (IDM). This IDM is then used to retroactively label 70,000 hours of unannotated YouTube Minecraft videos with predicted actions. We train an agent via behavioral cloning on this massive dataset. Our VPT agent successfully learns to chop trees, craft tools, and create a diamond pickaxe in Minecraft—a task requiring 20,000 sequential actions.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Video PreTraining (VPT): Learning to Act by Watching Unlabeled Video.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful