KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning
Paper • Feb 4, 2026 • arXiv • Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen
Direct Preference Optimization (DPO) and its variants have revolutionized LLM alignment, yet they struggle when the preferred choice requires deep, multi-step reasoning. We introduce KEPO, a framew...