Topic: AI Alignment

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for AI Alignment, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

Current month Last month 2 months ago

← Back to home

Agentic Alignment: Inverse Reinforcement Learning from Swarm Behavior
Paper • Dec 22, 2025 • arXiv • Percy Liang, Thomas K. V., Eleanor Rigby
Aligning multi-agent systems via traditional human feedback is intractable due to the sheer volume and speed of agent-to-agent interactions. We introduce a novel alignment framework utilizing Inver...
Scaling Laws for Reward Model Overoptimization
Paper • Oct 19, 2022 • arXiv • Leo Gao, John Schulman, Jacob Hilton
When optimizing a policy against a learned reward model, the policy eventually exploits errors in the reward model, leading to a decline in the true underlying objective. This phenomenon, known as ...
Let's Verify Step by Step
Paper • May 31, 2023 • arXiv • Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
Large language models often struggle with multi-step logical reasoning, frequently hallucinating incorrect steps that invalidate the final answer. To improve reasoning capabilities, we compare two ...
Deep reinforcement learning from human preferences
Paper • Jun 12, 2017 • arXiv • Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei
For many complex real-world tasks, defining a mathematical reward function is difficult, leading to misaligned AI behavior when optimized. We explore a method for solving reinforcement learning tas...
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Paper • Dec 14, 2023 • arXiv • Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
As AI models become increasingly capable, we will eventually face the challenge of superalignment: how can humans supervise AI systems that are much smarter than them? To study this empirically tod...
Aligning Agentic AI: Dynamic Value Grounding in Open-Ended Environments
Paper • Jun 21, 2025 • arXiv • Thomas K. V., Eleanor Rigby, M. Hasan
Traditional AI alignment techniques like RLHF are insufficient for Agentic AI, as autonomous systems frequently encounter novel edge cases in open-ended environments that were absent from their tra...
Improving alignment of dialogue agents via targeted human judgements
Paper • Sep 22, 2022 • arXiv • Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Geoffrey Irving
We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We train our model using reinforcement lea...

FAQ

What does this AI Alignment page rank?

It ranks public content for AI Alignment using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in AI Alignment?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to AI Alignment topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Topic: AI Alignment

Short answer

Agentic Alignment: Inverse Reinforcement Learning from Swarm Behavior

Scaling Laws for Reward Model Overoptimization

Let's Verify Step by Step

Deep reinforcement learning from human preferences

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Aligning Agentic AI: Dynamic Value Grounding in Open-Ended Environments

Improving alignment of dialogue agents via targeted human judgements

Top Entities In This Topic

Related Topics

FAQ

What does this AI Alignment page rank?

How should I use weekly vs monthly vs all-time?

How can I discover organizations active in AI Alignment?

Can I follow this topic for updates?