Quick answer
AI Summary: Details Video PreTraining (VPT), a breakthrough method that uses an Inverse Dynamics Model to label 70,000 hours of YouTube videos, training an AI to craft a diamond pickaxe in Minecraft.
AI Summary: Details Video PreTraining (VPT), a breakthrough method that uses an Inverse Dynamics Model to label 70,000 hours of YouTube videos, training an AI to craft a diamond pickaxe in Minecraft.
Training agents to perform complex, long-horizon tasks typically requires massive amounts of heavily annotated data or prohibitive amounts of reinforcement learning trial-and-error. We introduce Video PreTraining (VPT), a paradigm that leverages the vast quantity of unlabeled video available on the internet. We gather a small dataset of human gameplay with recorded keypresses to train an Inverse Dynamics Model (IDM). This IDM is then used to retroactively label 70,000 hours of unannotated YouTube Minecraft videos with predicted actions. We train an agent via behavioral cloning on this massive dataset. Our VPT agent successfully learns to chop trees, craft tools, and create a diamond pickaxe in Minecraft—a task requiring 20,000 sequential actions.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Video PreTraining (VPT): Learning to Act by Watching Unlabeled Video.