Causally Robust Reward Learning from Reason-Augmented Preference Feedback
Paper • Mar 4, 2026 • arXiv • Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık
Reward learning from human preferences often suffers from spurious correlations, leading agents to develop brittle and misaligned behaviors. The authors present a framework that integrates causal i...