Topic: Awesome List: multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

LOVA3: Learning to Visual Question Answering, Asking and Assessment
Paper • Feb 19, 2025 • arxiv.org • Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively ut...
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Paper • May 24, 2024 • arxiv.org • Yue Zhang, Hehe Fan, Yi Yang
To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language M...
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Paper • May 24, 2024 • arxiv.org • Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity whi...
X-VILA: Cross-Modality Alignment for Large Language Model
Paper • May 29, 2024 • arxiv.org • Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin
We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific en...
OLIVE: Object Level In-Context Visual Embeddings
Paper • Jun 2, 2024 • arxiv.org • Timothy Ossowski, Junjie Hu
Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object-le...
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
Paper • Nov 26, 2024 • arxiv.org • Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-follo...
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Paper • Jun 5, 2024 • arxiv.org • Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Shioulin Sam, Karim Bouyarmane, Ismail Tutar, Junzhou Huang
Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this pape...
Wings: Learning Multimodal LLMs without Text-only Forgetting
Paper • Jun 5, 2024 • arxiv.org • Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the te...
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Paper • Oct 30, 2024 • arxiv.org • Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks...
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper • Jun 14, 2024 • arxiv.org • Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir
Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are...
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Paper • Jun 13, 2024 • arxiv.org • Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan
Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Lan...
Generative Visual Instruction Tuning
Paper • Oct 2, 2024 • arxiv.org • Jefferson Hernandez, Ruben Villegas, Vicente Ordonez
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks...
Long Context Transfer from Language to Vision
Paper • Jul 1, 2024 • arxiv.org • Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of...
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper • Dec 4, 2024 • arxiv.org • Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, Saining Xie
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for visi...
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper • Aug 28, 2024 • arxiv.org • Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visu...
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Paper • Jul 3, 2024 • arxiv.org • Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and compositi...
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Paper • Sep 24, 2024 • arxiv.org • Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun
Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate ...
LLaVA-OneVision: Easy Visual Task Transfer
Paper • Oct 26, 2024 • arxiv.org • Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our ...
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • Sep 16, 2025 • arxiv.org • Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a r...
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper • Nov 5, 2024 • arxiv.org • Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: ...

← PreviousPage 9Next →

FAQ

What does this Awesome List: multimodal page rank?

It ranks public content for Awesome List: multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Awesome List: multimodal

Short answer

LOVA3: Learning to Visual Question Answering, Asking and Assessment

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

X-VILA: Cross-Modality Alignment for Large Language Model

OLIVE: Object Level In-Context Visual Embeddings

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

Wings: Learning Multimodal LLMs without Text-only Forgetting

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Generative Visual Instruction Tuning

Long Context Transfer from Language to Vision

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

TokenPacker: Efficient Visual Projector for Multimodal LLM

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Tarsier: Recipes for Training and Evaluating Large Video Description Models

LLaVA-OneVision: Easy Visual Task Transfer

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

POINTS: Improving Your Vision-language Model with Affordable Strategies

Top Entities In This Topic

Related Topics

FAQ

What does this Awesome List: multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Awesome List: multimodal?

Can I follow this topic for updates?