Quick answer
AI Summary: Enhances DPO by including intermediate reasoning chains in preference datasets, leading to more robust alignment for complex problem solving.
AI Summary: Enhances DPO by including intermediate reasoning chains in preference datasets, leading to more robust alignment for complex problem solving.
Direct Preference Optimization (DPO) and its variants have revolutionized LLM alignment, yet they struggle when the preferred choice requires deep, multi-step reasoning. We introduce KEPO, a framework that augments preference datasets with structured knowledge and intermediate 'reasoning chains'. By optimizing the model to not just choose the right answer, but to follow the right logic path, we show significant gains on the MATH and GSM8K benchmarks without increasing parameter count.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning.