Topic: Awesome List: multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Paper • Sep 5, 2025 • arxiv.org • Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by un...
Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development
Paper • Oct 1, 2025 • arxiv.org • Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu
Developing full-stack web applications is complex and time-intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models ...
Computer-Use Agents as Judges for Generative User Interface
Paper • Nov 19, 2025 • arxiv.org • Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for human...
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Paper • Nov 14, 2025 • arxiv.org • Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang
User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, bu...
Grounding Language Models to Images for Multimodal Inputs and Outputs
Paper • Jun 13, 2023 • arxiv.org • Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried
We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleav...
Generating Images with Multimodal Language Models
Paper • Oct 13, 2023 • arxiv.org • Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov
We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide...
Emu: Generative Pretraining in Multimodality
Paper • May 8, 2024 • arxiv.org • Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimo...
Planting a SEED of Vision in Large Language Model
Paper • Aug 12, 2023 • arxiv.org • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reac...
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Paper • Mar 22, 2024 • arxiv.org • Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu
Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevail...
NExT-GPT: Any-to-Any Multimodal LLM
Paper • Jun 25, 2024 • arxiv.org • Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to pro...
DreamLLM: Synergistic Multimodal Comprehension and Creation
Paper • Mar 15, 2024 • arxiv.org • Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehensi...
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Paper • Dec 9, 2025 • arxiv.org • Kaizhi Zheng, Xuehai He, Xin Eric Wang
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is s...
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Paper • Apr 26, 2024 • arxiv.org • Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei
Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning an...
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Paper • Dec 14, 2023 • arxiv.org • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT a...
LLMGA: Multimodal Large Language Model based Generation Assistant
Paper • Jul 27, 2024 • arxiv.org • Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia
In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and respons...
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Paper • Nov 30, 2023 • arxiv.org • Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal
We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, e...
Generative Multimodal Models are In-Context Learners
Paper • May 8, 2024 • arxiv.org • Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In...
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • Dec 28, 2023 • arxiv.org • Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi
We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs ...
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Paper • Jun 3, 2024 • arxiv.org • Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static ...
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper • Sep 8, 2025 • arxiv.org • Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, Xipeng Qiu
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyG...

← PreviousPage 3Next →

FAQ

What does this Awesome List: multimodal page rank?

It ranks public content for Awesome List: multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Awesome List: multimodal

Short answer

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

Computer-Use Agents as Judges for Generative User Interface

UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

Grounding Language Models to Images for Multimodal Inputs and Outputs

Generating Images with Multimodal Language Models

Emu: Generative Pretraining in Multimodality

Planting a SEED of Vision in Large Language Model

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

NExT-GPT: Any-to-Any Multimodal LLM

DreamLLM: Synergistic Multimodal Comprehension and Creation

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

LLMGA: Multimodal Large Language Model based Generation Assistant

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Generative Multimodal Models are In-Context Learners

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Top Entities In This Topic

Related Topics

FAQ

What does this Awesome List: multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Awesome List: multimodal?

Can I follow this topic for updates?