Quick answer
AI Summary: Introduces Sora, a revolutionary diffusion transformer capable of generating minute-long, photorealistic videos that exhibit emergent physics simulation and 3D consistency.
AI Summary: Introduces Sora, a revolutionary diffusion transformer capable of generating minute-long, photorealistic videos that exhibit emergent physics simulation and 3D consistency.
We explore the large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of highly variable durations, resolutions, and aspect ratios. We leverage a transformer architecture operating on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a full minute of high-fidelity video. Our findings suggest that scaling video generation models is a promising path towards building general-purpose simulators of the physical world, capturing complex 3D consistency, object permanence, and temporal dynamics.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Sora: Video generation models as world simulators.