← Home

Quick answer

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the...

Claim

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Authors
Qiying Yu·
Zheng Zhang·
Ruofei Zhu·
Yufeng Yuan·
Xiaochen Zuo·
Yu Yue·
Weinan Dai·
Tiantian Fan·
Gaohong Liu·
Lingjun Liu·
Xin Liu·
Haibin Lin·
Zhiqi Lin·
Bole Ma·
Guangming Sheng·
Yuxuan Tong·
Chi Zhang·
Mofan Zhang·
Wang Zhang·
Hang Zhu·
Jinhua Zhu·
Jiaze Chen·
Jiangjie Chen·
Chengyi Wang·
Hongli Yu·
Yuxuan Song·
Xiangpeng Wei·
Hao Zhou·
Jingjing Liu·
Wei-Ying Ma·
Ya-Qin Zhang·
Lin Yan·
Mu Qiao·
Yonghui Wu·
Mingxuan Wang

ABSTRACT

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Review Snapshot

Explore ratings

0.0
★★★★★
0 ratings
5 star
0%
4 star
0%
3 star
0%
2 star
0%
1 star
0%

Recommendation

0%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for DAPO: An Open-Source LLM Reinforcement Learning System at Scale.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful