Topic: World Model

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for World Model, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

An X-ray Significantly Variable, Luminous, Type 2 Quasar at z = 2.99 with a Massive Host Galaxy
Paper • Sep 3, 2024 • arxiv.org • Xiurui Zhao, Stefano Marchesi, Marco Ajello, Francesca Civano, Roberto Gilli, Giorgio Lanzuisi, Iván E. López, Ross Silver, Nuria Torres-Albà, Peter G. Boorman, Andrealuna Pizzetti
We present a comprehensive X-ray analysis and spectral energy distribution (SED) fitting of WISEA J171419.96+602724.6, an extremely luminous type 2 quasar at $z$ = 2.99. The source was suggested as...
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Paper • Apr 2, 2025 • arxiv.org • Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu
Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallel...
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Paper • Jul 28, 2024 • arxiv.org • Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their appli...
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Paper • Jul 10, 2024 • arxiv.org • Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo
The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenari...
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Paper • Jul 10, 2024 • arxiv.org • Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual informati...
Let the Code LLM Edit Itself When You Edit the Code
Paper • Mar 4, 2025 • arxiv.org • Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He
In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the ...
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Paper • Jan 16, 2025 • arxiv.org • Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communicat...
Depth Anything V2
Paper • Oct 20, 2024 • arxiv.org • Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao
This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, com...
Autoregressive Pretraining with Mamba in Vision
Paper • Jun 11, 2024 • arxiv.org • Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be sign...
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Paper • Feb 26, 2025 • arxiv.org • Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficie...
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • Jun 4, 2024 • arxiv.org • Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a fo...
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Paper • Nov 5, 2024 • arxiv.org • Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-op...
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting
Paper • May 28, 2024 • arxiv.org • Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, Ceyuan Yang
Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lac...
ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance
Paper • Mar 14, 2025 • arxiv.org • Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, Yunchao Wei
Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently te...
Unveiling the Tapestry of Consistency in Large Vision-Language Models
Paper • Oct 6, 2024 • arxiv.org • Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo
Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in di...
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
Paper • Sep 2, 2024 • arxiv.org • Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, Jiashi Feng
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straight...
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Paper • May 2, 2024 • arxiv.org • Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant ch...
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Paper • Apr 29, 2024 • arxiv.org • Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally l...
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Paper • Nov 4, 2024 • arxiv.org • Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao
Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Curren...
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper • Apr 15, 2024 • arxiv.org • Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie
This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building ...

← PreviousPage 5Next →

FAQ

What does this World Model page rank?

It ranks public content for World Model using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to World Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to World Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in World Model?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to World Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to World Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: World Model

Short answer

An X-ray Significantly Variable, Luminous, Type 2 Quasar at z = 2.99 with a Massive Host Galaxy

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Let the Code LLM Edit Itself When You Edit the Code

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Depth Anything V2

Autoregressive Pretraining with Mamba in Vision

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Unveiling the Tapestry of Consistency in Large Vision-Language Models

PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Top Entities In This Topic

Related Topics

FAQ

What does this World Model page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in World Model?

Can I follow this topic for updates?