← Home

Quick answer

AI Summary: Details the architecture behind DALL-E 2, using a diffusion decoder and a prior that translates text embeddings into CLIP image latents to achieve photorealistic, text-conditional image generation.

Claim

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh·
Prafulla Dhariwal·
Alex Nichol·
Casey Chu·
Mark Chen

ABSTRACT

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders, based on diffusion models, allow for image variations and zero-shot translation. We call this unCLIP, the system that powers DALL-E 2, which yields a massive leap in generative fidelity and semantic alignment.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Hierarchical Text-Conditional Image Generation with CLIP Latents.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful
Hierarchical Text-Conditional Image Generation with CLIP Latents | Attendemia