Topic: Multimodal Model

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal Model, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Paper • Nov 5, 2024 • arxiv.org • Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-op...
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting
Paper • May 28, 2024 • arxiv.org • Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, Ceyuan Yang
Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lac...
ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance
Paper • Mar 14, 2025 • arxiv.org • Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, Yunchao Wei
Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently te...
Unveiling the Tapestry of Consistency in Large Vision-Language Models
Paper • Oct 6, 2024 • arxiv.org • Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo
Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in di...
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
Paper • Sep 2, 2024 • arxiv.org • Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, Jiashi Feng
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straight...
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Paper • May 2, 2024 • arxiv.org • Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant ch...
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Paper • Apr 29, 2024 • arxiv.org • Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally l...
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Paper • Nov 4, 2024 • arxiv.org • Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao
Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Curren...
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper • Apr 15, 2024 • arxiv.org • Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie
This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building ...
Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion
Paper • Jan 9, 2025 • arxiv.org • Fan Yang, Jianfeng Zhang, Yichun Shi, Bowen Chen, Chenxu Zhang, Huichao Zhang, Xiaofeng Yang, Xiu Li, Jiashi Feng, Guosheng Lin
Benefiting from the rapid development of 2D diffusion models, 3D content generation has witnessed significant progress. One promising solution is to finetune the pre-trained 2D diffusion models to ...
You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs
Paper • Feb 25, 2025 • arxiv.org • Yihong Luo, Xiaolong Chen, Xinghua Qu, Tianyang Hu, Jing Tang
Recently, some works have tried to combine diffusion and Generative Adversarial Networks (GANs) to alleviate the computational cost of the iterative denoising inference in Diffusion Models (DMs). H...
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Paper • Feb 23, 2024 • arxiv.org • Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 ...
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Paper • Mar 2, 2024 • arxiv.org • Shanchuan Lin, Anran Wang, Xiao Yang
We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial dis...
Magic-Me: Identity-Specific Video Customized Diffusion
Paper • Mar 20, 2024 • arxiv.org • Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, Jiashi Feng
Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has ach...
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Paper • Apr 7, 2024 • arxiv.org • Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation mode...
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Paper • Feb 5, 2025 • arxiv.org • Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the st...
Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
Paper • Mar 3, 2025 • arxiv.org • Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang
Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large lan...
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
Paper • Nov 27, 2023 • arxiv.org • Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ t...
Make Pixels Dance: High-Dynamic Video Generation
Paper • Nov 18, 2023 • arxiv.org • Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, Hang Li
Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-...
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Paper • Apr 8, 2024 • arxiv.org • Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of...

← PreviousPage 7Next →

FAQ

What does this Multimodal Model page rank?

It ranks public content for Multimodal Model using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal Model?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: Multimodal Model

Short answer

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Unveiling the Tapestry of Consistency in Large Vision-Language Models

PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Magic-Me: Identity-Specific Video Customized Diffusion

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Make Pixels Dance: High-Dynamic Video Generation

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal Model page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in Multimodal Model?

Can I follow this topic for updates?