Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Paper • May 11, 2025 • arxiv.org • Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In th...
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Paper • May 20, 2025 • arxiv.org • Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, i...
MMaDA: Multimodal Large Diffusion Language Models
Paper • Sep 25, 2025 • arxiv.org • Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and ...
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Paper • Jul 24, 2025 • arxiv.org • Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo
The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single frame...
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Paper • Jan 6, 2025 • arxiv.org • Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining m...
Explore the Limits of Omni-modal Pretraining at Scale
Paper • Jun 13, 2024 • arxiv.org • Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue
We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Mu...
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Paper • Jul 16, 2024 • arxiv.org • Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao
Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understandin...
OmniBench: Towards The Future of Universal Omni-Language Models
Paper • Dec 29, 2025 • arxiv.org • Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Yidan Wen, Yanghai Wang, Shihao Li, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, Chenghua Lin
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process a...
Baichuan-Omni Technical Report
Paper • Dec 27, 2024 • arxiv.org • Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, Weipeng Chen
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper...
OMCAT: Omni Context Aware Transformer
Paper • Oct 15, 2024 • arxiv.org • Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. Howe...
From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities
Paper • Mar 4, 2025 • arxiv.org • Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, Bing Qin
To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific no...
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
Paper • Dec 30, 2024 • arxiv.org • Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang
Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-m...
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Paper • May 26, 2025 • arxiv.org • Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie
We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our W...
Ola: Pushing the Frontiers of Omni-Modal Language Model
Paper • Jun 2, 2025 • arxiv.org • Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-s...
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper • Mar 1, 2025 • arxiv.org • Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. ...
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
Paper • Mar 10, 2025 • arxiv.org • Jiaxing Zhao, Xihan Wei, Liefeng Bo
In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where...
Baichuan-Omni-1.5 Technical Report
Paper • Jan 26, 2025 • arxiv.org • Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-qu...
PAVE: Patching and Adapting Video Large Language Models
Paper • Mar 25, 2025 • arxiv.org • Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or ...
Qwen2.5-Omni Technical Report
Paper • Mar 26, 2025 • arxiv.org • Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and n...
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper • Jun 11, 2025 • arxiv.org • Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan, Lyuxin Xue, Lan Wang, Mochen Bai, Ning Gao, Pei Chen, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Sirui Gao, Tinghao Liu, Taisong Li, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaoxue Chen, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yunxiao Sun, Yipeng Chen, Yifei Wu, Yongjie Lyu, Ziping Ma, Zipeng Feng, Zhijiang Fang, Zhihao Qiu, Ziyuan Huang, Zhengyu He
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs ...

← PreviousPage 5Next →

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal

Short answer

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

MMaDA: Multimodal Large Diffusion Language Models

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Explore the Limits of Omni-modal Pretraining at Scale

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

OmniBench: Towards The Future of Universal Omni-Language Models

Baichuan-Omni Technical Report

OMCAT: Omni Context Aware Transformer

From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Ola: Pushing the Frontiers of Omni-Modal Language Model

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

Baichuan-Omni-1.5 Technical Report

PAVE: Patching and Adapting Video Large Language Models

Qwen2.5-Omni Technical Report

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal?

Can I follow this topic for updates?