Topic: Multimodal Model

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal Model, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
Paper • Mar 4, 2025 • arxiv.org • Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models i...
MagicArticulate: Make Your 3D Models Articulation-Ready
Paper • Feb 18, 2025 • arxiv.org • Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Trad...
Reformulation for Pretraining Data Augmentation
Paper • May 19, 2025 • arxiv.org • Xintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, Chenggang Li
Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation ass...
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving
Paper • Oct 9, 2025 • arxiv.org • Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, Kai Shen
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underl...
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • Jan 21, 2025 • arxiv.org • Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing ...
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • Mar 5, 2025 • arxiv.org • Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). W...
Diffusion Adversarial Post-Training for One-Step Video Generation
Paper • Oct 1, 2025 • arxiv.org • Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the poten...
The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model
Paper • Mar 3, 2025 • arxiv.org • Jiawei Chen, Wentao Chen, Jing Su, Jingjing Xu, Hongyu Lin, Mengjie Ren, Yaojie Lu, Xianpei Han, Le Sun
Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. ...
FullStack Bench: Evaluating LLMs as Full Stack Coders
Paper • May 12, 2025 • arxiv.org • Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y. Peng, Kai Shen, Jiahao Su, Jing Su, Tao Sun, Yifan Sun, Yunzhe Tao, Guoyin Wang, Siwei Wang, Xuwu Wang, Yite Wang, Zihan Wang, Jinxiang Xia, Liang Xiang, Xia Xiao, Yongsheng Xiao, Chenguang Xi, Shulin Xin, Jingjing Xu, Shikun Xu, Hongxia Yang, Jack Yang, Yingxiang Yang, Jianbo Yuan, Jun Zhang, Yufeng Zhang, Yuyu Zhang, Shen Zheng, He Zhu, Ming Zhu
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only...
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Paper • Dec 10, 2024 • arxiv.org • Zhihan Liu, Shenao Zhang, Yongfei Liu, Boyi Liu, Yingxiang Yang, Zhaoran Wang
Direct preference learning offers a promising and computation-efficient beyond supervised fine-tuning (SFT) for improving code generation in coding large language models (LMs). However, the scarcit...
Ultra-Sparse Memory Network
Paper • Feb 6, 2025 • arxiv.org • Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou
It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts ...
Understanding Chain-of-Thought in LLMs through Information Theory
Paper • Jul 10, 2025 • arxiv.org • Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable...
LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
Paper • Nov 13, 2024 • arxiv.org • Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin Cui
Larger transformer models always perform better on various tasks but require more costs to scale up the model size. To efficiently enlarge models, the mixture-of-experts (MoE) architecture is widel...
SeedEdit: Align Image Re-Generation to Image Editing
Paper • Nov 11, 2024 • arxiv.org • Yichun Shi, Peng Wang, Weilin Huang
We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining th...
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Paper • Mar 20, 2025 • arxiv.org • Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in add...
Multi-Reward as Condition for Instruction-based Image Editing
Paper • Mar 20, 2025 • arxiv.org • Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu
High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text...
Classification Done Right for Vision-Language Pre-Training
Paper • Nov 6, 2024 • arxiv.org • Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperCla...
How Far is Video Generation from World Model: A Physical Law Perspective
Paper • Jun 22, 2025 • arxiv.org • Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng
OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such law...
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
Paper • Nov 5, 2024 • arxiv.org • Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang
Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists...
Why Does the Effective Context Length of LLMs Fall Short?
Paper • Oct 24, 2024 • arxiv.org • Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong
Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the e...

← PreviousPage 5Next →

FAQ

What does this Multimodal Model page rank?

It ranks public content for Multimodal Model using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal Model?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal Model

Short answer

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

MagicArticulate: Make Your 3D Models Articulation-Ready

Reformulation for Pretraining Data Augmentation

BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Diffusion Adversarial Post-Training for One-Step Video Generation

The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model

FullStack Bench: Evaluating LLMs as Full Stack Coders

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

Ultra-Sparse Memory Network

Understanding Chain-of-Thought in LLMs through Information Theory

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

SeedEdit: Align Image Re-Generation to Image Editing

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Multi-Reward as Condition for Instruction-based Image Editing

Classification Done Right for Vision-Language Pre-Training

How Far is Video Generation from World Model: A Physical Law Perspective

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Why Does the Effective Context Length of LLMs Fall Short?

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal Model page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal Model?

Can I follow this topic for updates?