Quick answer
AI Summary: Details Image GPT (iGPT), proving that applying a standard, unmodified language-model Transformer to sequences of raw pixels results in powerful, self-supervised visual understanding and image generation.
AI Summary: Details Image GPT (iGPT), proving that applying a standard, unmodified language-model Transformer to sequences of raw pixels results in powerful, self-supervised visual understanding and image generation.
Inspired by the success of unsupervised representation learning in natural language processing with models like GPT-2, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to autoregressively predict pixels, without incorporating any prior knowledge of the 2D spatial structure of images. Despite operating on low-resolution sequences of pixels, our model, Image GPT (iGPT), discovers highly robust semantic representations. When linear probes or fine-tuning are applied to these learned representations, iGPT achieves state-of-the-art performance on low-data classification benchmarks and generates highly coherent, novel image completions.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Generative Pretraining from Pixels.