Topic: AI Alignment

Short answer

This page shows the most relevant public items for AI Alignment, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time

← Back to home

  1. Scaling Laws for Reward Model Overoptimization

    PaperOct 19, 2022arXivLeo Gao, John Schulman, Jacob Hilton

    When optimizing a policy against a learned reward model, the policy eventually exploits errors in the reward model, leading to a decline in the true underlying objective. This phenomenon, known as ...

  2. Let's Verify Step by Step

    PaperMay 31, 2023arXivHunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

    Large language models often struggle with multi-step logical reasoning, frequently hallucinating incorrect steps that invalidate the final answer. To improve reasoning capabilities, we compare two ...

  3. Deep reinforcement learning from human preferences

    PaperJun 12, 2017arXivPaul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

    For many complex real-world tasks, defining a mathematical reward function is difficult, leading to misaligned AI behavior when optimized. We explore a method for solving reinforcement learning tas...

  4. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

    PaperDec 14, 2023arXivCollin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu

    As AI models become increasingly capable, we will eventually face the challenge of superalignment: how can humans supervise AI systems that are much smarter than them? To study this empirically tod...

  5. Improving alignment of dialogue agents via targeted human judgements

    PaperSep 22, 2022arXivAmelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Geoffrey Irving

    We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We train our model using reinforcement lea...

Related Topics

company:openai-research (4)cs.LG (4)RLHF (3)Process Reward Models (1)Superalignment (1)