Topic: Awesome List: multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

    PaperApr 1, 2025arxiv.orgBangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu

    Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval ta...

  2. MIEB: Massive Image Embedding Benchmark

    PaperApr 14, 2025arxiv.orgChenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff

    Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embe...

  3. Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

    PaperMay 27, 2025arxiv.orgFanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou

    Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal g...

  4. jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

    PaperJul 7, 2025arxiv.orgMichael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao

    We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-...

  5. LaViDa: A Large Diffusion Language Model for Multimodal Understanding

    PaperJun 18, 2025arxiv.orgShufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover

    Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable gener...

  6. LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    PaperJun 4, 2025arxiv.orgZebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li

    In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure ...

  7. Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    PaperOct 17, 2025arxiv.orgQingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

    Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding para...

  8. WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

    PaperFeb 22, 2025arxiv.orgYi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, Xiangliang Zhang

    Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising po...

  9. GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

    PaperDec 24, 2024arxiv.orgKangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin

    Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this p...

  10. DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

    PaperFeb 24, 2026arxiv.orgJingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

    Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI c...

  11. ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

    PaperSep 29, 2025arxiv.orgChenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian

    The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation g...

  12. FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

    PaperAug 12, 2025arxiv.orgFengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen

    With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks throu...

← PreviousPage 2Next →

Top Entities In This Topic

Related Topics

FAQ

What does this Awesome List: multimodal page rank?

It ranks public content for Awesome List: multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.