Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Paper • Jun 15, 2023 • arxiv.org • Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training s...
Multimodal Chain-of-Thought Reasoning in Language Models
Paper • May 20, 2024 • arxiv.org • Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infe...
Visual Instruction Tuning
Paper • Dec 11, 2023 • arxiv.org • Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal ...
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Paper • Oct 2, 2023 • arxiv.org • Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are...
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Paper • Mar 29, 2024 • arxiv.org • Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. I...
VPGTrans: Transfer Visual Prompt Generator across LLMs
Paper • Oct 24, 2023 • arxiv.org • Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua
While developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively light...
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Paper • Jun 13, 2023 • arxiv.org • Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen
We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed ca...
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Paper • Jun 8, 2023 • arxiv.org • Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu
Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-lan...
Kosmos-2: Grounding Multimodal Large Language Models to the World
Paper • Jul 13, 2023 • arxiv.org • Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifica...
SVIT: Scaling up Visual Instruction Tuning
Paper • Dec 28, 2023 • arxiv.org • Bo Zhao, Boya Wu, Muyang He, Tiejun Huang
Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, question answering, etc. Although existing m...
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Paper • Mar 22, 2024 • arxiv.org • Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, Maosong Sun
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other l...
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Paper • Oct 13, 2023 • arxiv.org • Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundati...
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Paper • Mar 20, 2024 • arxiv.org • Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive backg...
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper • Sep 25, 2023 • arxiv.org • Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the ...
Improved Baselines with Visual Instruction Tuning
Paper • May 15, 2024 • arxiv.org • Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA ...
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Paper • Oct 11, 2023 • arxiv.org • Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary des...
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Paper • Nov 7, 2023 • arxiv.org • Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for comple...
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Paper • Nov 9, 2023 • arxiv.org • Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capa...
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Paper • Oct 1, 2024 • arxiv.org • Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate fea...
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Paper • Nov 26, 2023 • arxiv.org • Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrain...

← PreviousPage 6Next →

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal

Short answer

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Visual Instruction Tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

VPGTrans: Transfer Visual Prompt Generator across LLMs

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Kosmos-2: Grounding Multimodal Large Language Models to the World

SVIT: Scaling up Visual Instruction Tuning

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Aligning Large Multimodal Models with Factually Augmented RLHF

Improved Baselines with Visual Instruction Tuning

Ferret: Refer and Ground Anything Anywhere at Any Granularity

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal?

Can I follow this topic for updates?