Quick answer

Paper2025-10-17•Source ↗•10 attns0 checkouts

Claim

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Authors

Discuss with Grok

Chao Jin·

Ziheng Jiang·

Zhihao Bai·

Zheng Zhong·

Juncai Liu·

Xiang Li·

Ningxin Zheng·

Xi Wang·

Cong Xie·

Qi Huang·

Wen Heng·

Yiyuan Ma·

Wenlei Bao·

Size Zheng·

Yanghua Peng·

Haibin Lin·

Xuanzhe Liu·

Xin Jin·

Xin Liu

ABSTRACT

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

#computer-version/year/2025 #llm/paper/year/2025 #computer-version #multimodal-model #deep-learning/month/202510 #llm/month/202510 #llm/paper #deep-learning/from/bytedance-research #deep-learning/year/2025 #llm/year/2025 #computer-version/month/202510 #llm/paper/month/202510 #world-model #deep-learning #llm ByteDance Research

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful