Quick answer

AI Summary: Proves that Process-Supervised Reward Models (PRMs), which evaluate individual reasoning steps rather than just the final answer, dramatically improve an LLM's ability to solve complex mathematical problems.

Paper2023-05-31•Source ↗•20 attns393 checkouts

Claim

Let's Verify Step by Step

Authors

Discuss with Grok

Hunter Lightman·

Vineet Kosaraju·

Yura Burda·

Harri Edwards·

Bowen Baker·

Teddy Lee·

Jan Leike·

John Schulman·

Ilya Sutskever·

Karl Cobbe

ABSTRACT

Large language models often struggle with multi-step logical reasoning, frequently hallucinating incorrect steps that invalidate the final answer. To improve reasoning capabilities, we compare two reward modeling methods: outcome-supervised reward models (ORMs), which only evaluate the final answer, and process-supervised reward models (PRMs), which provide feedback on each individual step in the chain of thought. We demonstrate that process supervision significantly outperforms outcome supervision on the challenging MATH dataset. PRMs encourage models to follow human-aligned, verifiable logic, drastically reducing logical hallucinations. To catalyze research in this area, we release PRM800K, a dataset of 800,000 human step-level judgments.

#ai-alignment #cs-lg #process-reward-models company:openai-research #reasoning

Review Snapshot

Explore ratings

4.6

★★★★★

5 ratings

5 star

60%

4 star

40%

3 star

2 star

1 star

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Let's Verify Step by Step.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful