Multimodal

Awesome Multimodal Machine Learning: From Video Understanding to Vibe Coding

Follow this list after sign-in.

machine-learning deep-learning multimodal199 items · 3 followers

A curated, high-quality list of must-read papers and resources tracing the evolution of Multimodal Machine Learning. This repository covers the foundational shift from Video Understanding and Generative Video (Diffusion/Autoregressive) to the frontiers of UX/GUI Design Agents and Vibe Coding. Whether you are looking for landmark papers in CLIP-based alignment or the latest in vision-language-action (VLA) models for interface interaction, this list provides a structured roadmap through the most influential research in the field.

Managed by Attendemia

Sort

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
2025
6,576 checkouts
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian
2025
8,739 checkouts
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu
2026
8,407 checkouts
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li
2025
8,780 checkouts
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent
Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin
2024
5,752 checkouts
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, Michael R. Lyu
2025
5,396 checkouts
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
Ryan Li, Yanzhe Zhang, Diyi Yang
2024
9,134 checkouts
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs
Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, Xiangliang Zhang
2025
8,682 checkouts
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi Yang
2025
9,995 checkouts
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
2025
5,041 checkouts
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Runpeng Yu, Xinyin Ma, Xinchao Wang
2025
9,012 checkouts
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
2025
6,376 checkouts
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
2025
6,614 checkouts
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
2025
7,668 checkouts
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou
2025
8,902 checkouts
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
2025
7,463 checkouts
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
2025
6,203 checkouts
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
2025
7,635 checkouts
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval
Zelong Sun, Dong Jing, Zhiwu Lu
2025
5,226 checkouts
Learning Fine-Grained Representations through Textual Token...
Yue WU, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, Shuhui Wang
2024
8,386 checkouts

FAQ

What is Multimodal?

Multimodal is an expert-curated awesome list on Attendemia that groups high-signal resources for fast learning. Items are reviewed and refreshed over time, so readers can start with a practical shortlist instead of searching across fragmented sources and low-context recommendation threads.

How are items ranked here?

Items are ranked using maintainer curation, content quality notes, engagement momentum, and freshness indicators. This ranking method keeps the top of the awesome list actionable for current workflows, while still preserving evergreen references that are widely cited and useful for deeper technical understanding.

Can I follow this list?

Yes. Use the follow button near the page header to receive update visibility when new resources are added or promoted. Following this list helps you monitor changes without rechecking manually and keeps your learning feed aligned with this specific topic over time.