Topic: AI Engineering

Track this topic after sign-in.

Short answer

This page shows the most relevant public items for AI Engineering, ranked by trend activity and review signal. Use weekly for fast changes, monthly for more stable patterns, and all-time for evergreen picks.

Weekly Monthly All time

← Back to home

JAF: Judge Agent Forest
Paper • Jan 29, 2026 • arxiv.org • Sahil Garg, Brad Cheezum, Sridhar Dutta, Vishal Agarwal
Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framew...
Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study
Paper • Jan 29, 2026 • arxiv.org • Khairul Alam, Saikat Mondal, Banani Roy
Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical ...
Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
Paper • Jan 30, 2026 • arxiv.org • Yunpeng Xiong, Ting Zhang
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual...
TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI
Paper • Jan 30, 2026 • arxiv.org • Roham Koohestani, Ateş Görpelioğlu, Egor Klimov, Burcu Kulahcioglu Ozkan, Maliheh Izadi
Agentic AI systems act through tools and evolve their behavior over long, stochastic interaction traces. This setting complicates assurance, because behavior depends on nondeterministic environment...
Benchmarking Agents in Insurance Underwriting Environments
Paper • Jan 31, 2026 • arxiv.org • Amanda Dsouza, Ramya Ramakrishnan, Charles Dickens, Bhavishya Pohani, Christopher M Glaze
As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domain...
HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
Paper • Jan 31, 2026 • arxiv.org • Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin
Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evalua...
ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
Paper • Feb 2, 2026 • arxiv.org • Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang
Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long...
PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences
Paper • Feb 5, 2026 • arxiv.org • Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain
We present an in-depth evaluation of LLMs' ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieAr...
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Paper • Feb 5, 2026 • arxiv.org • Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson
Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed l...
Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents
Paper • Feb 5, 2026 • arxiv.org • Stephen Pilli, Vivek Nallur
Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the i...
TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents
Paper • Feb 6, 2026 • arxiv.org • Yibing Liu, Chong Zhang, Zhongyi Han, Hansong Liu, Yong Wang, Yang Yu, Xiaoyan Wang, Yilong Yin
We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filteri...
Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks
Paper • Feb 6, 2026 • arxiv.org • Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly hu...
JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
Paper • Feb 6, 2026 • arxiv.org • Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang
Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate di...
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
Paper • Feb 6, 2026 • arxiv.org • Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from sta...
Agentic Uncertainty Reveals Agentic Overconfidence
Paper • Feb 6, 2026 • arxiv.org • Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner
Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agen...
From Features to Actions: Explainability in Traditional and Agentic AI Systems
Paper • Feb 6, 2026 • arxiv.org • Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza
Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structu...
SimpleMem: Efficient Lifelong Memory for LLM Agents
Paper • Jan 29, 2026 • arxiv.org • Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao
To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via pass...
HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants
Paper • Jan 6, 2026 • arxiv.org • Hailong Li, Feifei Li, Wenhui Que, Xingyu Fan
Large language models (LLMs) power many interactive systems such as chatbots, customer-service agents, and personal assistants. In knowledge-intensive scenarios requiring user-specific personalizat...
MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents
Paper • Jan 6, 2026 • arxiv.org • Dongming Jiang, Yi Li, Guanpeng Li, Bingzhe Li
Memory-Augmented Generation (MAG) extends Large Language Models with external memory to support long-context reasoning, but existing approaches largely rely on semantic similarity over monolithic m...
Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents
Paper • Jan 20, 2026 • arxiv.org • Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang
Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to pres...

← PreviousPage 4Next →

FAQ

What does this AI Engineering page rank?

It ranks public content for AI Engineering using recent discussion, review, and engagement signals so you can triage faster. This guidance is specific to AI Engineering topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How should I use weekly vs monthly vs all-time?

Use weekly for fast-moving updates, monthly for stable trend confirmation, and all-time for evergreen references. This guidance is specific to AI Engineering topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

How can I discover organizations active in AI Engineering?

Use the linked entities section to jump to labs, companies, and experts connected to this topic and explore their timelines. This guidance is specific to AI Engineering topic page on Attendemia and is written so it still makes sense without reading other sections on the page.

Can I follow this topic for updates?

Yes. Use the follow button on this page to subscribe and track new high-signal activity. This guidance is specific to AI Engineering topic page on Attendemia and is written so it still makes sense without reading other sections on the page.