Topic: AI Alignment

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for AI Alignment, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

WeeklyMonthlyAll time
Current monthLast month2 months ago

← Back to home

  1. Agentic Alignment: Inverse Reinforcement Learning from Swarm Behavior

    PaperDec 22, 2025arXivPercy Liang, Thomas K. V., Eleanor Rigby

    Aligning multi-agent systems via traditional human feedback is intractable due to the sheer volume and speed of agent-to-agent interactions. We introduce a novel alignment framework utilizing Inver...

  2. Scaling Laws for Reward Model Overoptimization

    PaperOct 19, 2022arXivLeo Gao, John Schulman, Jacob Hilton

    When optimizing a policy against a learned reward model, the policy eventually exploits errors in the reward model, leading to a decline in the true underlying objective. This phenomenon, known as ...

  3. Let's Verify Step by Step

    PaperMay 31, 2023arXivHunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

    Large language models often struggle with multi-step logical reasoning, frequently hallucinating incorrect steps that invalidate the final answer. To improve reasoning capabilities, we compare two ...

  4. Deep reinforcement learning from human preferences

    PaperJun 12, 2017arXivPaul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

    For many complex real-world tasks, defining a mathematical reward function is difficult, leading to misaligned AI behavior when optimized. We explore a method for solving reinforcement learning tas...

  5. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

    PaperDec 14, 2023arXivCollin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu

    As AI models become increasingly capable, we will eventually face the challenge of superalignment: how can humans supervise AI systems that are much smarter than them? To study this empirically tod...

  6. Improving alignment of dialogue agents via targeted human judgements

    PaperSep 22, 2022arXivAmelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Geoffrey Irving

    We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We train our model using reinforcement lea...

Top Entities In This Topic

Related Topics

FAQ

What does this AI Alignment page rank?

It ranks public content for AI Alignment using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in AI Alignment?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.