Topic: Multimodal

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. MiRA: A Zero-Shot Mixture-of-Reasoning Agents Framework

    PaperFeb 20, 2026arXivSethuraman et al., AAMAS 2026 Main Track

    We propose Mixture-of-Reasoning Agents (MiRA), a zero-shot multimodal framework that decomposes reasoning across three specialized agents: Visual Analyzing, Text Comprehending, and Judge. By consol...

  2. 7 AI Trends Every CTO Needs on Their Radar in 2026

    BlogFeb 17, 2026MediumEvangelist Apps

    In 2026, embedded AI is no longer a differentiator; it's table stakes. This post outlines seven trends reshaping product development, from the rise of 'Vertical AI' (built for specific industries) ...

  3. Cross Modal Retrieval with Querybank Normalisation

    PaperApr 18, 2022arxiv.orgSimion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie

    Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. ...

  4. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

    PaperOct 7, 2023arxiv.orgSihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

    Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this...

  5. CoVR: Learning Composed Video Retrieval from Web Video Captions

    PaperMay 30, 2024arxiv.orgLucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

    Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches requi...

  6. UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

    PaperNov 28, 2023arxiv.orgCong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen

    Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for ...

  7. Composed Video Retrieval via Enriched Context and Discriminative Embeddings

    PaperMar 25, 2024arxiv.orgOmkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

    Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video sear...

  8. MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

    PaperJun 24, 2024arxiv.orgKai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang

    Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Rec...

  9. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    PaperFeb 25, 2025arxiv.orgChankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

    Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, ...

  10. E5-V: Universal Embeddings with Multimodal Large Language Models

    PaperJul 17, 2024arxiv.orgTing Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

    Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains la...

  11. MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

    PaperFeb 22, 2025arxiv.orgSheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping

    State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single m...

  12. LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

    PaperDec 2, 2024arxiv.orgYikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie

    With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-languag...

  13. MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

    PaperDec 19, 2024arxiv.orgJunjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

    Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synt...

  14. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    PaperApr 1, 2025arxiv.orgXin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang

    Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. ...

  15. MINIMA: Modality Invariant Image Matching

    PaperMar 29, 2025arxiv.orgJiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, Xiang Bai

    Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges...

  16. CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

    PaperMar 18, 2025arxiv.orgYifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

    Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short de...

← PreviousPage 1Next →

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal page rank?

It ranks public content for Multimodal using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal topic page on Attendemia and is written so it still makes sense without reading other sections on the page.