📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06481v1
👥 Authors: Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li (possible past Tsinghua University affiliation), Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation), Zhiqiang Shen
Abstract

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for study...

📄 MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06473v1
👥 Authors: Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li (possible past Google (United States) affiliation), Jie Zhou (possible past Tsinghua University affiliation), Liang He, Bo Zhang (possible past Tencent (China) affiliation), Lei Bai
Abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm di...

📄 Unsupervised Skill Discovery for Agentic Data Analysis
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06416v1
👥 Authors: Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen (possible past Alibaba Group (China) affiliation), Shumin Deng (possible past Alibaba Group (China) affiliation)
Abstract

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guid...

📄 Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06388v1
👥 Authors: Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu (possible past University Of Washington affiliation), Toby Jia-Jun Li, Dakuo Wang (possible past Tencent (China) affiliation), Bingsheng Yao
Abstract

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, a...

📄 OneReason Technical Report
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06260v1
👥 Authors: Onerec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang (possible past Huawei Technologies (China) affiliation), Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang (possible past Tencent (China) affiliation), Yifei Hu, Yingzhi He, Yufei Ye (possible past Carnegie Mellon University affiliation), Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu (possible past Tsinghua University affiliation), Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang (possible past Tencent (China) affiliation), Guowang Zhang, Hao Peng (possible past Tsinghua University affiliation), Jiayao Shen, Jie Chen (possible past Tencent (China) affiliation), Jun Xu (possible past Google (United States) affiliation), Junmin Chen, Kun Zhang (possible past Google (United States) affiliation), Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang (possible past Tencent (China) affiliation), Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang (possible past Tencent (China) affiliation), Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao
Abstract

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM f...

📄 CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06219v1
👥 Authors: Yining Xing, Zehong Ke, Zhiyuan Liu (possible past Tsinghua University affiliation), Yanbo Jiang, Wenhao Yu, Jianqiang Wang (possible past Tsinghua University affiliation)
Abstract

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA ...

📄 TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06218v1
👥 Authors: Dongwon Son, Florian Shkurti, Jason Lee (possible past Stanford University affiliation), Naman Shah, Beomjoon Kim, Dieter Fox (possible past University Of Washington affiliation)
Abstract

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that mus...

📄 DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06217v1
👥 Authors: Tan Zhang, Quanyou Li, Lu Zhang (possible past Tencent (China) affiliation), Jun Liu (possible past Tencent (China) affiliation), Xiaofeng Zhu, Ping Hu (possible past Ibm (United States) affiliation)
Abstract

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a mu...

📄 WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06147v1
👥 Authors: Shengtao Zheng, Kai Li, Weichen Zhang, Yu Meng, Chen Gao, Xinlei Chen (possible past Tsinghua University affiliation), Yong Li (possible past Tsinghua University affiliation), Xiao-Ping Zhang
Abstract

End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such partial observability. To address this, we construct a ...

📄 Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06090v1
👥 Authors: Yaoqi Chen, Haibin Lai, Yuru Feng, Chuyu Han, Qianxi Zhang, Baotong Lu, Menghao Li, Xinjiang Wang, Zhirui Wang, Shusen Xu, Zengzhong Li, Zewen Jin, Hao Wu (possible past Tencent (China) affiliation), Cheng Li (possible past Google (United States) affiliation), Qi Chen (possible past Baidu (China) affiliation)
Abstract

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isola...

📄 LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06087v1
👥 Authors: Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang (possible past Tencent (China) affiliation), Weiwen Liu, Yong Yu (possible past Shanghai Jiao Tong University affiliation), Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Jianghao Lin
Abstract

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, ...

📄 Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06076v1
👥 Authors: Haocheng Luo, Jiahui Liu (possible past Google (United States) affiliation), Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li (possible past Tsinghua University affiliation)
Abstract

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address th...

📄 Towards World Models in Biomedical Research
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05925v1
👥 Authors: Guangyu Wang, Jingkun Yue, Siqi Zhang, Yu Liu, Xiaoyu Wang, Mingyuan Meng, Changwei Ji, Zongbo Han, Yulin Wang, Yang Yue, Frank Fu, Ting Chen (possible past Google (United States) affiliation), Song Wu, Ziwei Liu, Jiangning Song, Ming Li, Gao Huang (possible past Tsinghua University affiliation), Xiaohong Liu (possible past Shanghai Jiao Tong University affiliation), Athanasios Vasilakos, Xingcai Zhang, Ping Zhang, Yong Li (possible past Tsinghua University affiliation)
Abstract

A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven ...

📄 LadderMan: Learning Humanoid Perceptive Ladder Climbing
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05873v1
👥 Authors: Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel (possible past University Of California, Berkeley affiliation), Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu (possible past Stanford University affiliation), Guanya Shi
Abstract

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pi...

📄 Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05843v1
👥 Authors: Ruoxi Sun (possible past Google (United States) affiliation), Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang (possible past Tsinghua University affiliation)
Abstract

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a high...

📄 When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05806v1
👥 Authors: Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang (possible past Baidu (China) affiliation), Lingyong Yan, Dawei Yin (possible past Baidu (China) affiliation)
Abstract

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturba...

📄 SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05761v1
👥 Authors: Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Yu Cheng (possible past National University Of Singapore affiliation), Yang Yang (possible past Tencent (China) affiliation)
Abstract

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-gr...

📄 LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05677v1
👥 Authors: Shiqiang Lang, Jing Liu (possible past Baidu (China) affiliation), Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu (possible past Baidu (China) affiliation), Lan Yang, Longteng Guo, Honggang Zhang
Abstract

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering...

📄 Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05661v1
👥 Authors: Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia (possible past University Of California, Berkeley affiliation), Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation)
Abstract

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-pla...

📄 Answer Presence Drives RAG Rewriting Gains
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05633v1
👥 Authors: Yuejie Li, Yueying Hua, Ke Yang (possible past Google (United States) affiliation), Li Zhang (possible past University Of Oxford affiliation), Yueping He, Yueping He, Ruiqi Li (possible past Tsinghua University affiliation), Bolin Chen, Tao Wang (possible past Stanford University affiliation), Bowen Li, Chengjun Mao
Abstract

Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to th...

📄 A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06104v1
👥 Authors: Chen Hu, Rui Wang (possible past Tencent (China) affiliation), Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng (possible past Tencent (China) affiliation)
Abstract

Electroencephalography (EEG) offers noninvasive, millisecond resolution recordings of neuronal activity and is widely used in neuroscience and healthcare. Many EEG decoding pipelines rely on covariance descriptors for their robustness to noise, but such representations are sensitive to channel-wise scaling. Recent studies have therefore advocated full-rank correlation matrices as a scale-invariant alternative for EEG decoding. In this paper, we propose a general framework for Sliced Wasserstein ...

📄 DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05911v1
👥 Authors: Cunhang Fan, Enrui Liu, Jing Zhou, Jian Kang, Jie Li, Andong Li, Jian Zhou (possible past Tencent (China) affiliation), Zhao Lv, Xuelong Li (possible past Tencent (China) affiliation)
Abstract

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses o...

📄 AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.05597v1
👥 Authors: Hao Bai, Rui Yang, Chenlu Ye, Spencer Whitehead, Aviral Kumar (possible past University Of California, Berkeley affiliation), Tong Zhang (possible past Tencent (China) affiliation)
Abstract

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that tog...

📄 Agents' Last Exam
🗓️ Published: 6/3/2026
🔗 http://arxiv.org/abs/2606.05405v1
👥 Authors: Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu (possible past Meta (United States) affiliation), Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu (possible past Tencent (China) affiliation), Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen (possible past Tencent (China) affiliation), Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li (possible past Google (United States) affiliation), Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu (possible past Tsinghua University affiliation), Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li (possible past University Of Washington affiliation), Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin (possible past National University Of Singapore affiliation), Andyzeyi Liu, Che Liu, Jiaming Liu (possible past Baidu (China) affiliation), Kaiyuan Liu, Xuan Liu (possible past Baidu (China) affiliation), Pan Lu (possible past Baidu (China) affiliation), Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, Haoyang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang (possible past University Of Oxford affiliation), Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "millie" Wu, Yu Wu (possible past Baidu (China) affiliation), Yuchen Wu (possible past Google (United States) affiliation), Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang (possible past Tencent (China) affiliation), Xiaoli Yang, Yushi Yang (possible past Stanford University affiliation), Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang (possible past Peking University affiliation), Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang (possible past Tencent (China) affiliation), Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou (possible past Baidu (China) affiliation), Blake Zhu, Junfan Zhu, Alan Yuille (possible past Google (United States) affiliation), Yang Liu (possible past Tsinghua University affiliation), Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang (possible past Meta (United States) affiliation), Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang (possible past Amazon (United States) affiliation), Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song (possible past University Of California, Berkeley affiliation)
Abstract

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks w...

*Notable papers are those with at least two authors from a "big" AI/ML lab.