Quick answer

Paper2023-06-15•Source ↗•10 attns8,872 checkouts

Claim

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Authors

Discuss with Grok

Junnan Li·

Dongxu Li·

Silvio Savarese·

Steven Hoi

ABSTRACT

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

#machine-learning 📋 Awesome List: multimodal #multimodal #deep-learning

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful