Quick answer

Paper2024-10-26•Source ↗•10 attns8,448 checkouts

Claim

LLaVA-OneVision: Easy Visual Task Transfer

Authors

Discuss with Grok

Bo Li·

Yuanhan Zhang·

Dong Guo·

Renrui Zhang·

Feng Li·

Hao Zhang·

Kaichen Zhang·

Peiyuan Zhang·

Yanwei Li·

Ziwei Liu·

Chunyuan Li

ABSTRACT

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

#machine-learning 📋 Awesome List: multimodal #multimodal #deep-learning

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for LLaVA-OneVision: Easy Visual Task Transfer.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful