Topic: Awesome List: multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval
Paper • Feb 28, 2025 • arxiv.org • Zelong Sun, Dong Jing, Zhiwu Lu
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existin...
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • Mar 4, 2025 • arxiv.org • Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that...
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Paper • Apr 1, 2025 • arxiv.org • Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval ta...
MIEB: Massive Image Embedding Benchmark
Paper • Apr 14, 2025 • arxiv.org • Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embe...
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Paper • May 27, 2025 • arxiv.org • Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou
Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal g...
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Paper • Jul 7, 2025 • arxiv.org • Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-...
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Paper • Jun 18, 2025 • arxiv.org • Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable gener...
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper • Jun 4, 2025 • arxiv.org • Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure ...
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Paper • May 26, 2025 • arxiv.org • Runpeng Yu, Xinyin Ma, Xinchao Wang
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant traini...
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Paper • Oct 17, 2025 • arxiv.org • Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding para...
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Paper • Feb 9, 2025 • arxiv.org • Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi Yang
Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end developm...
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs
Paper • Feb 22, 2025 • arxiv.org • Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, Xiangliang Zhang
Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising po...
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
Paper • Oct 21, 2024 • arxiv.org • Ryan Li, Yanzhe Zhang, Diyi Yang
Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma desig...
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping
Paper • Feb 20, 2025 • arxiv.org • Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, Michael R. Lyu
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task, i.e., generating UI code from UI mock-ups. However, existing benchmarks only contain st...
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent
Paper • Dec 24, 2024 • arxiv.org • Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin
Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this p...
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Paper • Aug 11, 2025 • arxiv.org • Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-b...
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Paper • Feb 24, 2026 • arxiv.org • Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI c...
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Paper • Sep 29, 2025 • arxiv.org • Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation g...
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Paper • Oct 20, 2025 • arxiv.org • Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal...
FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
Paper • Aug 12, 2025 • arxiv.org • Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen
With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks throu...

← PreviousPage 2Next →

FAQ

What does this Awesome List: multimodal page rank?

It ranks public content for Awesome List: multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Awesome List: multimodal

Short answer

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

MIEB: Massive Image Embedding Benchmark

Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping

Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

Top Entities In This Topic

Related Topics

FAQ

What does this Awesome List: multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Awesome List: multimodal?

Can I follow this topic for updates?