Topic: RLHF

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for RLHF, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

Scaling Laws for Reward Model Overoptimization
Paper • Oct 19, 2022 • arXiv • Leo Gao, John Schulman, Jacob Hilton
When optimizing a policy against a learned reward model, the policy eventually exploits errors in the reward model, leading to a decline in the true underlying objective. This phenomenon, known as ...
Deep reinforcement learning from human preferences
Paper • Jun 12, 2017 • arXiv • Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei
For many complex real-world tasks, defining a mathematical reward function is difficult, leading to misaligned AI behavior when optimized. We explore a method for solving reinforcement learning tas...
WebGPT: Browser-assisted question-answering with human feedback
Paper • Dec 16, 2021 • arXiv • Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman
We introduce a method for fine-tuning language models to interact with a text-based web browser to answer open-ended questions. This model, WebGPT, searches the web, navigates through links, and sy...
Learning to summarize from human feedback
Paper • Sep 2, 2020 • arXiv • Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
We show that it is possible to significantly improve the quality of text summaries generated by large language models by training them with reinforcement learning from human feedback. We collect a ...
Improving alignment of dialogue agents via targeted human judgements
Paper • Sep 22, 2022 • arXiv • Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Geoffrey Irving
We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We train our model using reinforcement lea...
Training language models to follow instructions with human feedback
Paper • Mar 4, 2022 • arXiv • Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not he...
KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning
Paper • Feb 4, 2026 • arXiv • Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen
Direct Preference Optimization (DPO) and its variants have revolutionized LLM alignment, yet they struggle when the preferred choice requires deep, multi-step reasoning. We introduce KEPO, a framew...

FAQ

What does this RLHF page rank?

It ranks public content for RLHF using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to RLHF topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to RLHF topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in RLHF?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to RLHF topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to RLHF topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: RLHF

Short answer

Scaling Laws for Reward Model Overoptimization

Deep reinforcement learning from human preferences

WebGPT: Browser-assisted question-answering with human feedback

Learning to summarize from human feedback

Improving alignment of dialogue agents via targeted human judgements

Training language models to follow instructions with human feedback

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

Top Entities In This Topic

Related Topics

FAQ

What does this RLHF page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in RLHF?

Can I follow this topic for updates?