Quick answer

AI Summary: Enhances DPO by including intermediate reasoning chains in preference datasets, leading to more robust alignment for complex problem solving.

Paper2026-02-04•Source ↗•31 attns0 checkouts

Claim

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

Authors

Discuss with Grok

Fan Yang·

Rui Meng·

Trudi Di Qi·

Ali Ezzati·

Yuxin Wen

ABSTRACT

Direct Preference Optimization (DPO) and its variants have revolutionized LLM alignment, yet they struggle when the preferred choice requires deep, multi-step reasoning. We introduce KEPO, a framework that augments preference datasets with structured knowledge and intermediate 'reasoning chains'. By optimizing the model to not just choose the right answer, but to follow the right logic path, we show significant gains on the MATH and GSM8K benchmarks without increasing parameter count.

#preference-optimization #cs-lg #cs-ai #rlhf

Review Snapshot

Explore ratings

4.2

★★★★★

5 ratings

5 star

40%

4 star

40%

3 star

20%

2 star

1 star

Recommendation

80%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful