πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ BAMI: Training-Free Bias Mitigation in GUI Grounding
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06664v1
πŸ‘₯ Authors: Borui Zhang, Bo Zhang (possible past Tencent (China) affiliation), Bo Wang (possible past Tencent (China) affiliation), Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To ...

πŸ“„ Verifier-Backed Hard Problem Generation for Mathematical Reasoning
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06660v1
πŸ‘₯ Authors: Yuhang Lai, Jiazhan Feng, Yee Whye Teh (possible past University Of Oxford affiliation), Ning Miao (possible past Peking University affiliation)
Abstract

Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifi...

πŸ“„ Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06654v1
πŸ‘₯ Authors: Yuxing Liu, Jianyu Wang (possible past Carnegie Mellon University affiliation), Tong Zhang (possible past Tencent (China) affiliation)
Abstract

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model ...

πŸ“„ AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06651v1
πŸ‘₯ Authors: Daniel Zheng (possible past Deepmind (United Kingdom) affiliation), Ingrid Von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing (possible past Deepmind (United Kingdom) affiliation), Daniel M. Roy, Martin Wattenberg (possible past Google (United States) affiliation), Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie (possible past Deepmind (United Kingdom) affiliation), Fernanda Viegas, Dimitri Kanevsky, Vineet Kahlon, Hartmut Maennel, Sophia Alj, George Holland (possible past Deepmind (United Kingdom) affiliation), Alex Davies (possible past Deepmind (United Kingdom) affiliation), Pushmeet Kohli (possible past Google (United States) affiliation)
Abstract

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and o...

πŸ“„ Recursive Agent Optimization
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06639v1
πŸ‘₯ Authors: Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar (possible past University Of California, Berkeley affiliation), Graham Neubig (possible past Carnegie Mellon University affiliation)
Abstract

We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents ...

πŸ“„ MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06623v1
πŸ‘₯ Authors: Zhexuan Wang, Xuebo Liu, Li Wang (possible past Tesla (United States) affiliation), Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang (possible past Tsinghua University affiliation)
Abstract

Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iterati...

πŸ“„ SkillOS: Learning Skill Curation for Self-Evolving Agents
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06614v1
πŸ‘₯ Authors: Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra (possible past Carnegie Mellon University affiliation), Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, Maohao Shen, Vishy Tirumalashetty, George Lee, Jiawei Han (possible past Google (United States) affiliation), Tomas Pfister (possible past University Of Oxford affiliation), Chen-Yu Lee (possible past Google (United States) affiliation)
Abstract

LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex ...

πŸ“„ WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06407v1
πŸ‘₯ Authors: Guanrou Yang, Tian Tan, Qian Chen (possible past Shanghai Jiao Tong University affiliation), Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang (possible past Tencent (China) affiliation), Wenxi Chen, Qi Chen (possible past Baidu (China) affiliation), Wenrui Liu, Shan Yang (possible past Google (United States) affiliation), Xie Chen
Abstract

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous laten...

πŸ“„ Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06320v1
πŸ‘₯ Authors: Elizabeth Mieczkowski, Alexander Ku (possible past Google (United States) affiliation), Tiwalayo Eisape, Dilip Arumugam, John Matters, Katherine M. Collins, Ilia Sucholutsky, Thomas L. Griffiths (possible past University Of California, Berkeley affiliation)
Abstract

Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evol...

πŸ“„ NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06317v1
πŸ‘₯ Authors: Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li (possible past Tsinghua University affiliation), Jie Tang (possible past Tsinghua University affiliation), Xuemiao Xu
Abstract

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path plann...

πŸ“„ Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06279v1
πŸ‘₯ Authors: Chengjie Wang (possible past Tencent (China) affiliation), Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao (possible past Stanford University affiliation)
Abstract

Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack O...

πŸ“„ Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06230v1
πŸ‘₯ Authors: Xinquan Chen, Zhenyun Yin, Shan He, Bin Huang, Shanzhe Lei, Pengcheng Shi, Kun Cai, Bei Chen, Bangwei Liu, Zeyu Kang, Chao Huang (possible past Tencent (China) affiliation), Yang Zhang (possible past Tsinghua University affiliation), Wenjie Li, Ruijun Ge, Yajie Wang, Tianshun Fang, Tianyang Xu, Yiwen Cong, Meng Jin, Gaolei Li, Xuansheng Wu, Linhan Liu, Zijing He, An Li, Yan Teng, Xin Tan, Chaochao Lu, Ji He, Jie Li, Chunfeng Song, Jinya Xu, Fan Song, Shujie Wang, Jianmin Qian, Jie Hou, Xuhong Wang, Yingchun Wang, Hui Wang, Xia Hu
Abstract

As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelli...

πŸ“„ Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06225v1
πŸ‘₯ Authors: Andy Zeyi Liu, Michael Zhang (possible past Stanford University affiliation), Ilana Greenberg, Adam Alnasser, Lucas Baker (possible past Deepmind (United Kingdom) affiliation), John Sous
Abstract

Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rathe...

πŸ“„ When to Trust Imagination: Adaptive Action Execution for World Action Models
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06222v1
πŸ‘₯ Authors: Rui Wang (possible past Tencent (China) affiliation), Yue Zhang, Jiehong Lin, Kuncheng Luo, Jianan Wang (possible past Deepmind (United Kingdom) affiliation), Zhongrui Wang, Xiaojuan Qi (possible past University Of Oxford affiliation)
Abstract

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute long...

πŸ“„ EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06192v1
πŸ‘₯ Authors: Zhaoyang Yang (possible past Tencent (China) affiliation), Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and f...

πŸ“„ BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06177v1
πŸ‘₯ Authors: Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu (possible past Tencent (China) affiliation), Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu (possible past Peking University affiliation), David A. Clifton (possible past University Of Oxford affiliation)
Abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison o...

πŸ“„ Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06139v1
πŸ‘₯ Authors: Yun Qu, Qi Wang (possible past Tsinghua University affiliation), Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu (possible past Tencent (China) affiliation), Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex a...

πŸ“„ Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06112v1
πŸ‘₯ Authors: Shiao Wang, Xiao Wang (possible past Google (United States) affiliation), Duoqing Yang, Wenhao Zhang, Bo Jiang, Lin Zhu, Yonghong Tian (possible past Peking University affiliation), Bin Luo
Abstract

Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is ...

πŸ“„ Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.05921v1
πŸ‘₯ Authors: Alex BΓ€uerle, Adam Connors, Alexander Novikov (possible past Google (United States) affiliation), Adam Zsolt Wagner, NgΓ’n VΕ©, Fernanda Viegas, Martin Wattenberg (possible past Google (United States) affiliation), Lucas Dixon (possible past Google (United States) affiliation)
Abstract

Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and ...

πŸ“„ Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06522v1
πŸ‘₯ Authors: Xin Wang (possible past University Of Edinburgh affiliation), Haibo Chen, Wenxuan Liu, Wenwu Zhu (possible past Tsinghua University affiliation)
Abstract

Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our posit...

πŸ“„ Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06472v1
πŸ‘₯ Authors: Haoyu Zheng, Fangcheng Fu (possible past Peking University affiliation), Jia Wu, Binhang Yuan, Yongqiang Zhang, Hao Wang (possible past Tsinghua University affiliation), Yuanyuan Zhu, Xiao Yan, Jiawei Jiang (possible past Tencent (China) affiliation)
Abstract

LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced c...

πŸ“„ MINER: Mining Multimodal Internal Representation for Efficient Retrieval
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06460v1
πŸ‘₯ Authors: Weien Li, Rui Song (possible past Peking University affiliation), Zeyu Li (possible past Peking University affiliation), Haochen Liu, Gonghao Zhang, Difan Jiao, Zhenwei Tang, Bowei He, Haolun Wu, Xue Liu, Ye Yuan (possible past Carnegie Mellon University affiliation)
Abstract

Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single fi...

πŸ“„ Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06083v1
πŸ‘₯ Authors: Jun Li, Peifeng Lai, Xuhang Lou, Jinpeng Wang (possible past Tencent (China) affiliation), Yuting Wang, Ke Chen (possible past Tencent (China) affiliation), Yaowei Wang, Shu-Tao Xia
Abstract

Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this,...

πŸ“„ Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
πŸ—“οΈ Published: 5/7/2026
πŸ”— http://arxiv.org/abs/2605.06055v1
πŸ‘₯ Authors: Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang (possible past Tencent (China) affiliation), Xiaoming Bao, Yuxing Li, Wei Wang (possible past University Of Oxford affiliation), Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou\\
Abstract

Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design...

*Notable papers are those with at least two authors from a "big" AI/ML lab.