Quick answer
AI Summary: Introduces the original DALL-E, demonstrating that a 12-billion parameter autoregressive transformer can generate highly creative, zero-shot images from text prompts by treating image patches as language tokens.
AI Summary: Introduces the original DALL-E, demonstrating that a 12-billion parameter autoregressive transformer can generate highly creative, zero-shot images from text prompts by treating image patches as language tokens.
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. By training a discrete variational autoencoder (dVAE) to compress images into a grid of tokens, we treat the image generation process exactly like language modeling. With sufficient data and scale, our approach—which powers DALL-E—is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Zero-Shot Text-to-Image Generation.