Quick answer
AI Summary: Proves that Process-Supervised Reward Models (PRMs), which evaluate individual reasoning steps rather than just the final answer, dramatically improve an LLM's ability to solve complex mathematical problems.
AI Summary: Proves that Process-Supervised Reward Models (PRMs), which evaluate individual reasoning steps rather than just the final answer, dramatically improve an LLM's ability to solve complex mathematical problems.
Large language models often struggle with multi-step logical reasoning, frequently hallucinating incorrect steps that invalidate the final answer. To improve reasoning capabilities, we compare two reward modeling methods: outcome-supervised reward models (ORMs), which only evaluate the final answer, and process-supervised reward models (PRMs), which provide feedback on each individual step in the chain of thought. We demonstrate that process supervision significantly outperforms outcome supervision on the challenging MATH dataset. PRMs encourage models to follow human-aligned, verifiable logic, drastically reducing logical hallucinations. To catalyze research in this area, we release PRM800K, a dataset of 800,000 human step-level judgments.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Let's Verify Step by Step.