Quick answer
AI Summary: Details the architecture behind DALL-E 2, using a diffusion decoder and a prior that translates text embeddings into CLIP image latents to achieve photorealistic, text-conditional image generation.
AI Summary: Details the architecture behind DALL-E 2, using a diffusion decoder and a prior that translates text embeddings into CLIP image latents to achieve photorealistic, text-conditional image generation.
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders, based on diffusion models, allow for image variations and zero-shot translation. We call this unCLIP, the system that powers DALL-E 2, which yields a massive leap in generative fidelity and semantic alignment.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Hierarchical Text-Conditional Image Generation with CLIP Latents.