← Home

Quick answer

AI Summary: Introduces the original DALL-E, demonstrating that a 12-billion parameter autoregressive transformer can generate highly creative, zero-shot images from text prompts by treating image patches as language tokens.

Claim

Zero-Shot Text-to-Image Generation

Aditya Ramesh·
Mikhail Pavlov·
Gabriel Goh·
Scott Gray·
Chelsea Voss·
Alec Radford·
Mark Chen·
Ilya Sutskever

ABSTRACT

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. By training a discrete variational autoencoder (dVAE) to compress images into a grid of tokens, we treat the image generation process exactly like language modeling. With sufficient data and scale, our approach—which powers DALL-E—is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Zero-Shot Text-to-Image Generation.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful