Quick answer
AI Summary: Introduces CLIP, a multimodal neural network that efficiently learns visual concepts from natural language supervision, enabling unprecedented zero-shot image classification and image-text retrieval.
AI Summary: Introduces CLIP, a multimodal neural network that efficiently learns visual concepts from natural language supervision, enabling unprecedented zero-shot image classification and image-text retrieval.
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories, restricting their generality. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch. We leverage a dataset of 400 million (image, text) pairs collected from the internet to train Contrastive Language-Image Pre-training (CLIP). After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Learning Transferable Visual Models From Natural Language Supervision.