Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • Apr 18, 2024 • arxiv.org • Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful an...
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • Mar 15, 2024 • arxiv.org • Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive pro...
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Paper • Aug 14, 2024 • arxiv.org • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core desi...
LongVLM: Efficient Long Video Understanding via Large Language Models
Paper • Jul 20, 2024 • arxiv.org • Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang
Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations t...
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Paper • Apr 4, 2024 • arxiv.org • Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data...
Koala: Key frame-conditioned long video-LLM
Paper • May 3, 2024 • arxiv.org • Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Model...
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Paper • Apr 24, 2024 • arxiv.org • Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-bas...
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • Apr 8, 2024 • arxiv.org • Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with u...
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • Jun 3, 2024 • arxiv.org • Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given ...
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • Apr 11, 2024 • arxiv.org • Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the ...
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception
Paper • Jul 24, 2024 • arxiv.org • Yipo Huang, Xiangfei Sheng, Zhichao Yang, Quan Yuan, Zhichao Duan, Pengfei Chen, Leida Li, Weisi Lin, Guangming Shi
The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aestheti...
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Paper • Apr 16, 2024 • arxiv.org • Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Re...
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Paper • Apr 18, 2024 • arxiv.org • Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio...
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Paper • Oct 31, 2024 • arxiv.org • Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrain...
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • Apr 19, 2024 • arxiv.org • Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such...
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Paper • May 22, 2024 • arxiv.org • Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and ...
MANTIS: Interleaved Multi-Image Instruction Tuning
Paper • Nov 15, 2024 • arxiv.org • Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existin...
What matters when building vision-language models?
Paper • May 3, 2024 • arxiv.org • Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we obser...
ImageInWords: Unlocking Hyper-Detailed Image Descriptions
Paper • Oct 28, 2024 • arxiv.org • Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut
Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-...
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Paper • May 9, 2024 • arxiv.org • Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. Ho...

← PreviousPage 8Next →

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal

Short answer

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Koala: Key frame-conditioned long video-LLM

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

MANTIS: Interleaved Multi-Image Instruction Tuning

What matters when building vision-language models?

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal?

Can I follow this topic for updates?