← Home

Quick answer

AI Summary: Introduces CLIP, a multimodal neural network that efficiently learns visual concepts from natural language supervision, enabling unprecedented zero-shot image classification and image-text retrieval.

Claim

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford·
Jong Wook Kim·
Chris Hallacy·
Aditya Ramesh·
Gabriel Goh·
Sandhini Agarwal·
Girish Sastry·
Amanda Askell·
Pamela Mishkin·
Jack Clark·
Gretchen Krueger·
Ilya Sutskever

ABSTRACT

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories, restricting their generality. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch. We leverage a dataset of 400 million (image, text) pairs collected from the internet to train Contrastive Language-Image Pre-training (CLIP). After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Learning Transferable Visual Models From Natural Language Supervision.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful
Learning Transferable Visual Models From Natural Language Supervision | Attendemia