Best Reinforcement Learning Papers

The highest-signal papers on Reinforcement Learning, ranked by community reviews and momentum.
Canonical intent: topic=reinforcement-learning|type=paper|year=evergreen

Explore Topic Awesome Lists Research Atlas

Top Picks

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch

Jun 7, 2017·495 checkouts·arxiv.org

Source ↗

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Jun 12, 2017·436 checkouts·arxiv.org

Source ↗

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık

Mar 4, 2026·423 checkouts·arxiv.org

Source ↗

Human-level performance in 3D multiplayer games with population-based reinforcement learning

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel

May 31, 2019·409 checkouts·doi.org

Source ↗

Minecraft as a Turing Test: Evaluating Open-Ended Agentic AI

Kevin Zhu, Lara Croft, Julian Bao

Jul 15, 2025·396 checkouts·arxiv.org

Source ↗

Discovering faster matrix multiplication algorithms with reinforcement learning

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Pushmeet Kohli

Oct 5, 2022·389 checkouts·doi.org

Source ↗

Emergent Tool Use From Multi-Agent Autocurricula

Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, Igor Mordatch

Sep 17, 2019·378 checkouts·arxiv.org

Source ↗

KARL: Knowledge Agents via Reinforcement Learning

Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle

Mar 5, 2026·363 checkouts·arxiv.org

Source ↗

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Demis Hassabis

Feb 26, 2015·355 checkouts·doi.org

Source ↗

KLong: Training LLM Agents for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu, Flood Sung

Feb 19, 2026·352 checkouts·arxiv.org

Source ↗

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu

Feb 9, 2018·344 checkouts·arxiv.org

Source ↗

Generative Agents for the Continuous Evolution of Target-Binding Proteins

Samuel H. A. von der Dunk, Liliana M. Dávalos, Ard A. Louis

Feb 25, 2026·340 checkouts·doi.org

Source ↗

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Nando de Freitas

May 12, 2022·325 checkouts·arxiv.org

Source ↗

Agentic Alignment: Inverse Reinforcement Learning from Swarm Behavior

Percy Liang, Thomas K. V., Eleanor Rigby

Dec 22, 2025·317 checkouts·arxiv.org

Source ↗

Agent57: Outperforming the Atari Human Benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell

Mar 31, 2020·316 checkouts·arxiv.org

Source ↗

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané

Jun 21, 2016·267 checkouts·arxiv.org

Source ↗

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Video

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune

Jun 23, 2022·262 checkouts·arxiv.org

Source ↗

Minimax M2.5: Scaling RL for Industrial-Grade Agentic AI

MiniMax Research Team

Feb 16, 2026·261 checkouts·arxiv.org

Source ↗

Magnetic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave, Federico Felici, Jonas Buchli, Martin Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Jung, Abbas Abdolmaleki, Demis Hassabis, Martin Riedmiller

Feb 16, 2022·256 checkouts·doi.org

Source ↗

Mastering Atari, Go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, David Silver

Dec 23, 2020·235 checkouts·doi.org

Source ↗

Mastering the game of Go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis

Oct 18, 2017·227 checkouts·doi.org

Source ↗

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

Jul 20, 2017·194 checkouts·arxiv.org

Source ↗

MASPO: Robust and Sample-Efficient LLM Reasoning via Unified Policy Optimization

Xiaoliang Fu, Jiaye Lin, Yangyi Fang

Feb 19, 2026·166 checkouts·arxiv.org

Source ↗

Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Demis Hassabis

Jan 27, 2016·163 checkouts·doi.org

Source ↗

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

Feb 4, 2016·150 checkouts·arxiv.org

Source ↗

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Ilya Sutskever, et al.

Dec 13, 2019·149 checkouts·arxiv.org

Source ↗

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba

Jul 5, 2017·140 checkouts·arxiv.org

Source ↗

Trust Region Policy Optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, Philipp Moritz

Feb 19, 2015·120 checkouts·arxiv.org

Source ↗

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Hao Li, Richard Peng, Sanjit Singh, Gregory D. Lyng

Feb 8, 2026·0 checkouts·arxiv.org

Source ↗

FAQ

How is this “best Reinforcement Learning Papers” collection ranked?

This page ranks Reinforcement Learning Papers using topic relevance, checkout momentum, source diversity, and freshness signals. Rankings are recalculated as new items and engagement arrive, so readers see resources that are both high quality and currently useful for implementation, research, and practical decision making. Canonical intent key: topic=reinforcement-learning|type=paper|year=evergreen.

How do you prevent duplicate collection pages?

Attendemia maps each slug variant, including best-of and year forms, to one canonical intent key. If two URLs describe the same topic, type, and timeframe, non-canonical versions permanently redirect. This consolidates crawl signals, avoids duplicate content dilution, and helps search engines index the strongest single page.

When does a year page stay separate from evergreen?

A year-specific page stays separate only when its item set is materially different from evergreen and has enough ranking depth. When overlap is high, the year URL redirects to the evergreen canonical page. This avoids thin duplication while preserving genuinely distinct annual collections for search users.

Are these paid recommendations?

No. These recommendations are not paid placements. Attendemia ranks items from public metadata, source quality coverage, and user engagement signals, then orders them by practical usefulness. Sponsorship does not buy rank position, so this page should be interpreted as editorial curation rather than advertising inventory.