Quick answer

AI Summary: Quantifies the 'reward hacking' phenomenon in RLHF, providing mathematical scaling laws to predict when a model will start exploiting flaws in its reward signal rather than improving its true performance.

Paper2022-10-19•Source ↗•22 attns299 checkouts

Claim

Scaling Laws for Reward Model Overoptimization

Authors

Discuss with Grok

Leo Gao·

John Schulman·

Jacob Hilton

ABSTRACT

When optimizing a policy against a learned reward model, the policy eventually exploits errors in the reward model, leading to a decline in the true underlying objective. This phenomenon, known as reward hacking or overoptimization (often framed as Goodhart's Law), is a major challenge for Reinforcement Learning from Human Feedback (RLHF). We study this empirically by optimizing a policy against a proxy reward model and measuring the true reward using an objective gold-standard model. We establish empirical scaling laws that predict the onset of overoptimization, demonstrating that the 'alignment tax' scales predictably based on the size of the reward model dataset and the amount of KL-divergence optimized.

#ai-alignment #cs-lg company:openai-research #rlhf #scaling-laws

Review Snapshot

Explore ratings

4.6

★★★★★

5 ratings

5 star

60%

4 star

40%

3 star

2 star

1 star

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Scaling Laws for Reward Model Overoptimization.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful