Topic: Awesome List: ai-agent-papers-2026

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for Awesome List: ai-agent-papers-2026, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities
Paper • Jan 19, 2026 • arxiv.org • Chen et al.
Despite recent advancements in Large Language Models (LLMs), complex Software Engineering (SE) tasks require more collaborative and specialized approaches. This concept paper systematically reviews...
Toward Architecture-Aware Evaluation Metrics for LLM Agents
Paper • Jan 27, 2026 • arxiv.org • Débora Souza, Patrícia Machado
LLM-based agents are becoming central to software engineering tasks, yet evaluating them remains fragmented and largely model-centric. Existing studies overlook how architectural components, such a...
DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
Paper • Jan 27, 2026 • arxiv.org • Yuheng Tang, Kaijie Zhu, Bonan Ruan, Chuqi Zhang, Michael Yang, Hongwei Li, Suyue Guo, Tianneng Shi, Zekun Li, Christopher Kruegel, Giovanni Vigna, Dawn Song, William Yang Wang, Lun Wang, Yangruibo Ding, Zhenkai Liang, Wenbo Guo
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pur...
Are We All Using Agents the Same Way? An Empirical Study of Core and Peripheral Developers Use of Coding Agents
Paper • Jan 27, 2026 • arxiv.org • Shamse Tasnim Cynthia, Joy Krishan Das, Banani Roy
Autonomous AI agents are transforming software development and redefining how developers collaborate with AI. Prior research shows that the adoption and use of AI-powered tools differ between core ...
Who Writes the Docs in SE 3.0? Agent vs. Human Documentation Pull Requests
Paper • Jan 28, 2026 • arxiv.org • Kazuma Yamasaki, Joseph Ayobami Joshua, Tasha Settewong, Mahmoud Alfadel, Kazumasa Shimari, Kenichi Matsumoto
As software engineering moves toward SE3.0, AI agents are increasingly used to carry out development tasks and contribute changes to software projects. It is therefore important to understand the e...
Interpreting Emergent Extreme Events in Multi-Agent Systems
Paper • Jan 28, 2026 • arxiv.org • Ling Tang, Jilin Mei, Dongrui Liu, Chen Qian, Dawei Cheng, Jing Shao, Xia Hu
Large language model-powered multi-agent systems have emerged as powerful tools for simulating complex human-like systems. The interactions within these systems often lead to extreme events whose o...
Agent Benchmarks Fail Public Sector Requirements
Paper • Jan 28, 2026 • arxiv.org • Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, Chris Russell
Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institut...
The Quiet Contributions: Insights into AI-Generated Silent Pull Requests
Paper • Jan 28, 2026 • arxiv.org • S M Mahedy Hasan, Md Fazle Rabbi, Minhaz Zibran
We present the first empirical study of AI-generated pull requests that are 'silent,' meaning no comments or discussions accompany them. This absence of any comments or discussions associated with ...
More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests
Paper • Jan 29, 2026 • arxiv.org • Haoming Huang, Pongchai Jaisri, Shota Shimizu, Lingfeng Chen, Sota Nakashima, Gema Rodríguez-Pérez
Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code genera...
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Paper • Jan 29, 2026 • arxiv.org • Johannes Kirmayr, Lukas Stappen, Elisabeth André
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in...
Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis
Paper • Jan 29, 2026 • arxiv.org • Evelien Riddell, James Riddell, Gengyi Sun, Michał Antkiewicz, Krzysztof Czarnecki
Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based s...
JAF: Judge Agent Forest
Paper • Jan 29, 2026 • arxiv.org • Sahil Garg, Brad Cheezum, Sridhar Dutta, Vishal Agarwal
Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framew...
Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study
Paper • Jan 29, 2026 • arxiv.org • Khairul Alam, Saikat Mondal, Banani Roy
Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical ...
Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
Paper • Jan 30, 2026 • arxiv.org • Yunpeng Xiong, Ting Zhang
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual...
TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI
Paper • Jan 30, 2026 • arxiv.org • Roham Koohestani, Ateş Görpelioğlu, Egor Klimov, Burcu Kulahcioglu Ozkan, Maliheh Izadi
Agentic AI systems act through tools and evolve their behavior over long, stochastic interaction traces. This setting complicates assurance, because behavior depends on nondeterministic environment...
Benchmarking Agents in Insurance Underwriting Environments
Paper • Jan 31, 2026 • arxiv.org • Amanda Dsouza, Ramya Ramakrishnan, Charles Dickens, Bhavishya Pohani, Christopher M Glaze
As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domain...
HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
Paper • Jan 31, 2026 • arxiv.org • Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin
Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evalua...
ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
Paper • Feb 2, 2026 • arxiv.org • Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang
Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long...
PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences
Paper • Feb 5, 2026 • arxiv.org • Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain
We present an in-depth evaluation of LLMs' ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieAr...
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Paper • Feb 5, 2026 • arxiv.org • Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson
Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed l...

← PreviousPage 1Next →

FAQ

What does this Awesome List: ai-agent-papers-2026 page rank?

It ranks public content for Awesome List: ai-agent-papers-2026 using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to Awesome List: ai-agent-papers-2026 topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to Awesome List: ai-agent-papers-2026 topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in Awesome List: ai-agent-papers-2026?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to Awesome List: ai-agent-papers-2026 topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to Awesome List: ai-agent-papers-2026 topic page on Attendemia and is written so it still makes sense without reading other sections on the page.