MASPO: Robust and Sample-Efficient LLM Reasoning via Unified Policy Optimization
Paper • Feb 19, 2026 • arXiv • Xiaoliang Fu, Jiaye Lin, Yangyi Fang
Policy optimization for Large Language Models often suffers from gradient instability and reward signal unreliability, particularly in mathematical and verifiable reasoning tasks. We introduce MASP...