← Home

Quick answer

AI Summary: Enhances DPO by including intermediate reasoning chains in preference datasets, leading to more robust alignment for complex problem solving.

Claim

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

Authors
Fan Yang·
Rui Meng·
Trudi Di Qi·
Ali Ezzati·
Yuxin Wen

ABSTRACT

Direct Preference Optimization (DPO) and its variants have revolutionized LLM alignment, yet they struggle when the preferred choice requires deep, multi-step reasoning. We introduce KEPO, a framework that augments preference datasets with structured knowledge and intermediate 'reasoning chains'. By optimizing the model to not just choose the right answer, but to follow the right logic path, we show significant gains on the MATH and GSM8K benchmarks without increasing parameter count.

Review Snapshot

Explore ratings

4.2
★★★★
5 ratings
5 star
40%
4 star
40%
3 star
20%
2 star
0%
1 star
0%

Recommendation

80%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful