Quick answer

Paper2025-04-19•Source ↗•10 attns9,869 checkouts

Claim

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Authors

Discuss with Grok

Jinguo Zhu·

Weiyun Wang·

Zhe Chen·

Zhaoyang Liu·

Shenglong Ye·

Lixin Gu·

Hao Tian·

Yuchen Duan·

Weijie Su·

Jie Shao·

Zhangwei Gao·

Erfei Cui·

Xuehui Wang·

Yue Cao·

Yangzhou Liu·

Xingguang Wei·

Hongjie Zhang·

Haomin Wang·

Weiye Xu·

Hao Li·

Jiahao Wang·

Nianchen Deng·

Songze Li·

Yinan He·

Tan Jiang·

Jiapeng Luo·

Yi Wang·

Conghui He·

Botian Shi·

Xingcheng Zhang·

Wenqi Shao·

Junjun He·

Yingtong Xiong·

Wenwen Qu·

Peng Sun·

Penglong Jiao·

Han Lv·

Lijun Wu·

Kaipeng Zhang·

Huipeng Deng·

Jiaye Ge·

Kai Chen·

Limin Wang·

Min Dou·

Lewei Lu·

Xizhou Zhu·

Tong Lu·

Dahua Lin·

Yu Qiao·

Jifeng Dai·

Wenhai Wang

ABSTRACT

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

#machine-learning #machine-learning/month/202504 📋 Awesome List: multimodal #deep-learning/year/2025 #multimodal #deep-learning #machine-learning/year/2025 #deep-learning/month/202504

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful