Topic: Multimodal Model

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Multimodal Model, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

    PaperMar 4, 2025arxiv.orgShulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

    Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models i...

  2. MagicArticulate: Make Your 3D Models Articulation-Ready

    PaperFeb 18, 2025arxiv.orgChaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin

    With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Trad...

  3. Reformulation for Pretraining Data Augmentation

    PaperMay 19, 2025arxiv.orgXintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, Chenggang Li

    Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation ass...

  4. BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving

    PaperOct 9, 2025arxiv.orgRan Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, Kai Shen

    Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underl...

  5. UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    PaperJan 21, 2025arxiv.orgYujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi

    This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing ...

  6. VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

    PaperMar 5, 2025arxiv.orgZhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

    This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). W...

  7. Diffusion Adversarial Post-Training for One-Step Video Generation

    PaperOct 1, 2025arxiv.orgShanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang

    The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the poten...

  8. FullStack Bench: Evaluating LLMs as Full Stack Coders

    PaperMay 12, 2025arxiv.orgBytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y. Peng, Kai Shen, Jiahao Su, Jing Su, Tao Sun, Yifan Sun, Yunzhe Tao, Guoyin Wang, Siwei Wang, Xuwu Wang, Yite Wang, Zihan Wang, Jinxiang Xia, Liang Xiang, Xia Xiao, Yongsheng Xiao, Chenguang Xi, Shulin Xin, Jingjing Xu, Shikun Xu, Hongxia Yang, Jack Yang, Yingxiang Yang, Jianbo Yuan, Jun Zhang, Yufeng Zhang, Yuyu Zhang, Shen Zheng, He Zhu, Ming Zhu

    As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only...

  9. Ultra-Sparse Memory Network

    PaperFeb 6, 2025arxiv.orgZihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou

    It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts ...

  10. Understanding Chain-of-Thought in LLMs through Information Theory

    PaperJul 10, 2025arxiv.orgJean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu

    Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable...

  11. LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

    PaperNov 13, 2024arxiv.orgXiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin Cui

    Larger transformer models always perform better on various tasks but require more costs to scale up the model size. To efficiently enlarge models, the mixture-of-experts (MoE) architecture is widel...

  12. SeedEdit: Align Image Re-Generation to Image Editing

    PaperNov 11, 2024arxiv.orgYichun Shi, Peng Wang, Weilin Huang

    We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining th...

  13. Multi-Reward as Condition for Instruction-based Image Editing

    PaperMar 20, 2025arxiv.orgXin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu

    High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text...

  14. Classification Done Right for Vision-Language Pre-Training

    PaperNov 6, 2024arxiv.orgZilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, Haoqi Fan

    We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperCla...

  15. How Far is Video Generation from World Model: A Physical Law Perspective

    PaperJun 22, 2025arxiv.orgBingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng

    OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such law...

  16. AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

    PaperNov 5, 2024arxiv.orgZiming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang

    Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists...

  17. Why Does the Effective Context Length of LLMs Fall Short?

    PaperOct 24, 2024arxiv.orgChenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

    Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the e...

← PreviousPage 5Next →

Top Entities In This Topic

Related Topics

FAQ

What does this Multimodal Model page rank?

It ranks public content for Multimodal Model using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Multimodal Model?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Multimodal Model topic page on Attendemia and is written so it still makes sense without reading other sections on the page.