Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang
2025
8,458 checkouts
Awesome Multimodal Machine Learning: From Video Understanding to Vibe Coding
A curated, high-quality list of must-read papers and resources tracing the evolution of Multimodal Machine Learning. This repository covers the foundational shift from Video Understanding and Generative Video (Diffusion/Autoregressive) to the frontiers of UX/GUI Design Agents and Vibe Coding. Whether you are looking for landmark papers in CLIP-based alignment or the latest in vision-language-action (VLA) models for interface interaction, this list provides a structured roadmap through the most influential research in the field.