Topic: Awesome List: multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Cross Modal Retrieval with Querybank Normalisation
Paper • Apr 18, 2022 • arxiv.org • Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. ...
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Paper • Oct 7, 2023 • arxiv.org • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this...
CoVR: Learning Composed Video Retrieval from Web Video Captions
Paper • May 30, 2024 • arxiv.org • Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches requi...
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Paper • Nov 28, 2023 • arxiv.org • Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for ...
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Paper • Mar 25, 2024 • arxiv.org • Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video sear...
DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models
Paper • Feb 4, 2025 • arxiv.org • Yimu Wang, Shuai Yuan, Bo Xue, Xiangru Jian, Wei Pang, Mushi Wang, Ning Yu
Recent progress in video-text retrieval has been driven largely by advancements in model architectures and training strategies. However, the representation learning capabilities of videotext retrie...
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Paper • Jun 24, 2024 • arxiv.org • Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Rec...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Paper • Feb 25, 2025 • arxiv.org • Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, ...
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper • Jul 17, 2024 • arxiv.org • Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang
Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains la...
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Paper • Jan 2, 2025 • arxiv.org • Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen
Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developin...
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
Paper • Feb 22, 2025 • arxiv.org • Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping
State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single m...
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
Paper • Dec 2, 2024 • arxiv.org • Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-languag...
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Paper • Dec 20, 2024 • arxiv.org • Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more prec...
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper • Dec 19, 2024 • arxiv.org • Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synt...
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Paper • Apr 1, 2025 • arxiv.org • Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang
Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. ...
MINIMA: Modality Invariant Image Matching
Paper • Mar 29, 2025 • arxiv.org • Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, Xiang Bai
Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges...
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
Paper • Mar 18, 2025 • arxiv.org • Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang
Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short de...
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Paper • Feb 17, 2025 • arxiv.org • Ze Liu, Zhengyang Liang, Junjie Zhou, Zheng Liu, Defu Lian
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Vi...
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
Paper • Jan 10, 2026 • arxiv.org • Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen
Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and...
Learning Fine-Grained Representations through Textual Token...
Paper • Oct 4, 2024 • openreview.net • Yue WU, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, Shuhui Wang
With the explosive growth of video data, finding videos that meet detailed requirements in large datasets has become a challenge. To address this, the composed video retrieval task has been...

← PreviousPage 1Next →

FAQ

What does this Awesome List: multimodal page rank?

It ranks public content for Awesome List: multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Awesome List: multimodal

Short answer

Cross Modal Retrieval with Querybank Normalisation

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

CoVR: Learning Composed Video Retrieval from Web Video Captions

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

E5-V: Universal Embeddings with Multimodal Large Language Models

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

MINIMA: Modality Invariant Image Matching

CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Learning Fine-Grained Representations through Textual Token...

Top Entities In This Topic

Related Topics

FAQ

What does this Awesome List: multimodal page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Awesome List: multimodal?

Can I follow this topic for updates?