Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

    PaperSep 16, 2025arxiv.orgLe Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu

    This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a r...

  2. POINTS: Improving Your Vision-language Model with Affordable Strategies

    PaperNov 5, 2024arxiv.orgYuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou

    In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: ...

  3. EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

    PaperMar 20, 2025arxiv.orgKai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu

    GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to percei...

  4. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    PaperAug 1, 2025arxiv.orgYuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

    The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternati...

  5. Personalized Visual Instruction Tuning

    PaperOct 9, 2024arxiv.orgRenjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang

    Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness&...

  6. xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

    PaperJun 9, 2025arxiv.orgMichael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

    We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage o...

  7. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    PaperSep 26, 2025arxiv.orgZhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Kaipeng Zhang, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

    We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancement...

  8. CompCap: Improving Multimodal Large Language Models with Composite Captions

    PaperDec 6, 2024arxiv.orgXiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He

    How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters,...

  9. Apollo: An Exploration of Video Understanding in Large Multimodal Models

    PaperDec 13, 2024arxiv.orgOrr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia

    Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequentl...

  10. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    PaperNov 3, 2025arxiv.orgHaobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

    This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limit...

  11. InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    PaperJul 13, 2025arxiv.orgYi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, Limin Wang

    This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a ...

  12. M-LLM Based Video Frame Selection for Efficient Video Understanding

    PaperMar 26, 2025arxiv.orgKai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul Chilimbi

    Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform samp...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    PaperApr 19, 2025arxiv.orgJinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

    We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a mult...

← PreviousPage 10Next →

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.