Topic: Awesome List: multimodal

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. Cross Modal Retrieval with Querybank Normalisation

    PaperApr 18, 2022arxiv.orgSimion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie

    Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. ...

  2. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

    PaperOct 7, 2023arxiv.orgSihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

    Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this...

  3. CoVR: Learning Composed Video Retrieval from Web Video Captions

    PaperMay 30, 2024arxiv.orgLucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

    Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches requi...

  4. UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

    PaperNov 28, 2023arxiv.orgCong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen

    Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for ...

  5. Composed Video Retrieval via Enriched Context and Discriminative Embeddings

    PaperMar 25, 2024arxiv.orgOmkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

    Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video sear...

  6. MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

    PaperJun 24, 2024arxiv.orgKai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang

    Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Rec...

  7. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    PaperFeb 25, 2025arxiv.orgChankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

    Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, ...

  8. E5-V: Universal Embeddings with Multimodal Large Language Models

    PaperJul 17, 2024arxiv.orgTing Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

    Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains la...

  9. MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

    PaperFeb 22, 2025arxiv.orgSheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping

    State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single m...

  10. LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

    PaperDec 2, 2024arxiv.orgYikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie

    With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-languag...

  11. MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

    PaperDec 19, 2024arxiv.orgJunjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

    Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synt...

  12. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    PaperApr 1, 2025arxiv.orgXin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang

    Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. ...

  13. MINIMA: Modality Invariant Image Matching

    PaperMar 29, 2025arxiv.orgJiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, Xiang Bai

    Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges...

  14. CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

    PaperMar 18, 2025arxiv.orgYifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

    Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short de...

  15. MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

    PaperJan 10, 2026arxiv.orgHuaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen

    Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and...

  16. Learning Fine-Grained Representations through Textual Token...

    PaperOct 4, 2024openreview.netYue WU, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, Shuhui Wang

    With the explosive growth of video data, finding videos that meet detailed requirements in large datasets has become a challenge. To address this, the composed video retrieval task has been...

Related Topics

Machine Learning (199)Deep Learning (199)Multimodal (199)llm (7)Multimodal Model (7)