← Home

Quick answer

AI Summary: MASPO is a new way to train AI models to be better at math and logic by making the learning process more stable. It fixes a common problem where AI gets 'confused' by unreliable feedback during training.

Claim

MASPO: Robust and Sample-Efficient LLM Reasoning via Unified Policy Optimization

Authors
Xiaoliang Fu·
Jiaye Lin·
Yangyi Fang

ABSTRACT

Policy optimization for Large Language Models often suffers from gradient instability and reward signal unreliability, particularly in mathematical and verifiable reasoning tasks. We introduce MASPO, a framework that unifies gradient utilization, probability mass, and signal reliability into a single differentiable objective. MASPO consistently outperforms existing methods like GRPO across various model scales, leading to more robust and sample-efficient fine-tuning for autonomous agents. Our experiments demonstrate that MASPO-trained models exhibit a significant reduction in 'alignment drift' during multi-turn interactions.

Review Snapshot

Explore ratings

4.3
★★★★
6 ratings
5 star
50%
4 star
33%
3 star
17%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for MASPO: Robust and Sample-Efficient LLM Reasoning via Unified Policy Optimization.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful