Topic: Multimodal Model

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal Model, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Paper • May 11, 2025 • arxiv.org • Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily...
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks
Paper • Mar 1, 2025 • arxiv.org • Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, Ge Zhang
In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abiliti...
FAN: Fourier Analysis Networks
Paper • Oct 26, 2025 • arxiv.org • Yihong Dong, Ge Li, Yongding Tao, Xue Jiang, Kechi Zhang, Jia Li, Jinliang Deng, Jing Su, Jun Zhang, Jingjing Xu
Despite the remarkable successes of general-purpose neural networks, such as MLPs and Transformers, we find that they exhibit notable shortcomings in modeling and reasoning about periodic phenomena...
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Paper • Apr 2, 2025 • arxiv.org • Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long...
Hyper-Connections
Paper • Mar 18, 2025 • arxiv.org • Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou
We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual conn...
MaskBit: Embedding-free Image Generation via Bit Tokens
Paper • Dec 8, 2024 • arxiv.org • Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning...
Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering
Paper • Oct 22, 2024 • arxiv.org • Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, Fei Wu
Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like...
HybridFlow: A Flexible and Efficient RLHF Framework
Paper • Oct 2, 2024 • arxiv.org • Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a ne...
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Paper • Sep 19, 2024 • arxiv.org • Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language m...
An X-ray Significantly Variable, Luminous, Type 2 Quasar at z = 2.99 with a Massive Host Galaxy
Paper • Sep 3, 2024 • arxiv.org • Xiurui Zhao, Stefano Marchesi, Marco Ajello, Francesca Civano, Roberto Gilli, Giorgio Lanzuisi, Iván E. López, Ross Silver, Nuria Torres-Albà, Peter G. Boorman, Andrealuna Pizzetti
We present a comprehensive X-ray analysis and spectral energy distribution (SED) fitting of WISEA J171419.96+602724.6, an extremely luminous type 2 quasar at $z$ = 2.99. The source was suggested as...
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Paper • Apr 2, 2025 • arxiv.org • Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu
Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallel...
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Paper • Jul 28, 2024 • arxiv.org • Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their appli...
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Paper • Jul 10, 2024 • arxiv.org • Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo
The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenari...
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Paper • Jul 10, 2024 • arxiv.org • Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual informati...
Let the Code LLM Edit Itself When You Edit the Code
Paper • Mar 4, 2025 • arxiv.org • Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He
In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the ...
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Paper • Jan 16, 2025 • arxiv.org • Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communicat...
Depth Anything V2
Paper • Oct 20, 2024 • arxiv.org • Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao
This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, com...
Autoregressive Pretraining with Mamba in Vision
Paper • Jun 11, 2024 • arxiv.org • Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be sign...
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Paper • Feb 26, 2025 • arxiv.org • Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficie...
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • Jun 4, 2024 • arxiv.org • Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a fo...

← PreviousPage 6Next →

FAQ

What does this Multimodal Model page rank?

It ranks public content for Multimodal Model using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal Model?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal Model

Short answer

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

FAN: Fourier Analysis Networks

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Hyper-Connections

MaskBit: Embedding-free Image Generation via Bit Tokens

Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering

HybridFlow: A Flexible and Efficient RLHF Framework

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

An X-ray Significantly Variable, Luminous, Type 2 Quasar at z = 2.99 with a Massive Host Galaxy

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Let the Code LLM Edit Itself When You Edit the Code

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Depth Anything V2

Autoregressive Pretraining with Mamba in Vision

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal Model page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal Model?

Can I follow this topic for updates?