Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Paper • Jun 3, 2024 • arxiv.org • Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static ...
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper • Sep 8, 2025 • arxiv.org • Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, Xipeng Qiu
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyG...
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Paper • Mar 2, 2025 • arxiv.org • Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan
The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remain...
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • Mar 27, 2024 • arxiv.org • Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog an...
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper • Jun 11, 2024 • arxiv.org • Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into lat...
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • Aug 20, 2024 • arxiv.org • Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusio...
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • Sep 8, 2025 • arxiv.org • Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion mode...
Emu3: Next-Token Prediction is All You Need
Paper • Sep 27, 2024 • arxiv.org • Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang
While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g...
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Paper • Oct 17, 2024 • arxiv.org • Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as C...
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Paper • Apr 16, 2025 • arxiv.org • Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text inte...
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Paper • Aug 7, 2025 • arxiv.org • Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-ta...
Liquid: Language Models are Scalable and Unified Multi-modal Generators
Paper • Apr 10, 2025 • arxiv.org • Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings ...
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper • Dec 5, 2024 • arxiv.org • Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore ...
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Paper • Dec 9, 2024 • arxiv.org • Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, Hang Xu
In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language m...
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Paper • Dec 12, 2024 • arxiv.org • Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai
The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unif...
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Paper • Feb 5, 2025 • arxiv.org • Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and ima...
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Paper • Jan 29, 2025 • arxiv.org • Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) sc...
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper • Oct 24, 2025 • arxiv.org • Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to...
Unified Reward Model for Multimodal Understanding and Generation
Paper • Mar 7, 2025 • arxiv.org • Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang
Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However...
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Paper • Jan 26, 2026 • arxiv.org • Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, lea...

← PreviousPage 4Next →

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal

Short answer

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

An Image is Worth 32 Tokens for Reconstruction and Generation

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Emu3: Next-Token Prediction is All You Need

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liquid: Language Models are Scalable and Unified Multi-modal Generators

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Unified Reward Model for Multimodal Understanding and Generation

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal?

Can I follow this topic for updates?