πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13801v1
πŸ‘₯ Authors: Deepak Pandita, Flip Korn (possible past Google (United States) affiliation), Chris Welty (possible past Ibm (United States) affiliation), Christopher M. Homan
Abstract

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionall...

πŸ“„ MinT: Managed Infrastructure for Training and Serving Millions of LLMs
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13779v1
πŸ‘₯ Authors: Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao (possible past Nvidia (United States) affiliation), Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao (possible past Tencent (China) affiliation), Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang
Abstract

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving...

πŸ“„ AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13724v1
πŸ‘₯ Authors: Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han (possible past Stanford University affiliation), Han Cai, Mike Zheng Shou (possible past National University Of Singapore affiliation)
Abstract

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address th...

πŸ“„ AttenA+: Rectifying Action Inequality in Robotic Foundation Models
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13548v1
πŸ‘₯ Authors: Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang (possible past Tsinghua University affiliation), Xupeng Xie, Jian Guo, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation), Andrew F. Luo, Boyu Zhou, Jun Ma
Abstract

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-veloci...

πŸ“„ Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13534v1
πŸ‘₯ Authors: Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi (possible past Tencent (China) affiliation), Jiancan Wu, Xiang Wang (possible past Tencent (China) affiliation)
Abstract

Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these lim...

πŸ“„ Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13530v1
πŸ‘₯ Authors: Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li (possible past Google (United States) affiliation), Wei Ji (possible past Tencent (China) affiliation), Kai Wang, Shanshan Wang, Weixin Si
Abstract

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this ...

πŸ“„ MMSkills: Towards Multimodal Skills for General Visual Agents
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13527v1
πŸ‘₯ Authors: Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Yong Yu (possible past Shanghai Jiao Tong University affiliation)
Abstract

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimo...

πŸ“„ Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13511v1
πŸ‘₯ Authors: Tsz Ting Chung, Lemao Liu (possible past Tencent (China) affiliation), Mo Yu (possible past Tencent (China) affiliation), Dit-Yan Yeung
Abstract

In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not...

πŸ“„ RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13391v1
πŸ‘₯ Authors: Liangtian Liu, Zeyuan Wang, Ziyu Li, Kai Ouyang, Zichao Tang, Chengfu Liu, Haifeng Li, Hanwen Yu, Wentao Yang (possible past Google (United States) affiliation), Cheng Yang (possible past Tsinghua University affiliation), Dongyang Hou
Abstract

The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechan...

πŸ“„ Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13301v1
πŸ‘₯ Authors: Yafu Li, Runzhe Zhan, Haoran Zhang, Shunkai Zhang, Yizhuo Li, Zhilin Wang, Jiacheng Chen, Futing Wang, Xuyang Hu, Yuchen Fan, Bangjie Xu, Yucheng Su, Xinmiao Han, Chenxi Li, Haodi Lei, Yufeng Zhao, Zejin Lin, Qianjia Cheng, Tong Zhu (possible past Nvidia (United States) affiliation), Xiaoye Qu, Ganqu Cui (possible past Tsinghua University affiliation), Peng Ye, Yun Luo, Zhouchen Lin (possible past Peking University affiliation), Yu Qiao (possible past Shanghai Artificial Intelligence Laboratory affiliation), Bowen Zhou, Ning Ding (possible past Tsinghua University affiliation), Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill ri...

πŸ“„ Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13296v1
πŸ‘₯ Authors: Yuanzhe Wang, Tian Zhi, Zihang Wei, Hongguang Wang, Jiaming Guo, Yang Zhao (possible past Google (United States) affiliation), Zisheng Liu, Shiyu Quan, Xing Hu (possible past Baidu (China) affiliation), Zidong Du, Yunji Chen
Abstract

Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid fra...

πŸ“„ Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13255v1
πŸ‘₯ Authors: Junlong Ke, Zichen Wen, Weijia Li (possible past Tsinghua University affiliation), Conghui He (possible past Tsinghua University affiliation), Linfeng Zhang
Abstract

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three...

πŸ“„ Teacher-Guided Policy Optimization for LLM Distillation
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13230v1
πŸ‘₯ Authors: Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang (possible past Google (United States) affiliation), Xin Chen (possible past Tencent (China) affiliation), Jingang Wang, Tong Xiao, Jingbo Zhu
Abstract

The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an...

πŸ“„ Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13171v1
πŸ‘₯ Authors: Moritz Firsching, Paul Lezeau, Salvatore Mercuri, MiklΓ³s Z. HorvΓ‘th, YaΓ«l Dillies, Calle SΓΆnne, Eric Wieser (possible past University Of Cambridge affiliation), Fred Zhang, Thomas Hubert (possible past Deepmind (United Kingdom) affiliation), Blaise AgΓΌera Y Arcas (possible past Google (United States) affiliation), Pushmeet Kohli (possible past Google (United States) affiliation)
Abstract

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved p...

πŸ“„ Context Training with Active Information Seeking
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13050v1
πŸ‘₯ Authors: Zeyu Huang, Adhiguna Kuncoro (possible past University Of Oxford affiliation), Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam (possible past Meta (United States) affiliation), Marc'aurelio Ranzato
Abstract

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browse...

πŸ“„ No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13044v1
πŸ‘₯ Authors: Ying Li (possible past Meta (United States) affiliation), Hongbo Wen, Yanju Chen, Hanzhi Liu, Yuan Tian, Yu Feng (possible past University Of California, Berkeley affiliation)
Abstract

LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented...

πŸ“„ Position: Agentic AI System Is a Foreseeable Pathway to AGI
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.12966v1
πŸ‘₯ Authors: Junwei Liao, Shuai Li, Muning Wen, Jun Wang (possible past Tencent (China) affiliation), Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to gener...

πŸ“„ Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.12953v1
πŸ‘₯ Authors: Chao Hao, Jun Xu (possible past Google (United States) affiliation), Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun (possible past Tencent (China) affiliation), Guangcong Wang, Xubin Zheng, Zitong Yu
Abstract

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-th...

πŸ“„ Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.12894v1
πŸ‘₯ Authors: Harshita Chopra, Kshitish Ghate, Aylin Caliskan, Tadayoshi Kohno (possible past University Of Washington affiliation), Chirag Shah, Natasha Jaques (possible past University Of California, Berkeley affiliation)
Abstract

Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail u...

πŸ“„ From Generalist to Specialist Representation
πŸ—“οΈ Published: 5/12/2026
πŸ”— http://arxiv.org/abs/2605.12733v1
πŸ‘₯ Authors: Yujia Zheng, Fan Feng, Yuke Li, Shaoan Xie, Kevin Murphy (possible past Google (United States) affiliation), Kun Zhang (possible past Google (United States) affiliation)
Abstract

Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time s...

πŸ“„ Building Interactive Real-Time Agents with Asynchronous I/O and Speculative Tool Calling
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13360v1
πŸ‘₯ Authors: Coleman Hooper, Minwoo Kang, Suhong Moon, Nicholas Lee, Eric Wen, John Wawrzynek, Michael W. Mahoney (possible past Stanford University affiliation), Yakun Sophia Shao (possible past University Of California, Berkeley affiliation), Amir Gholami, Kurt Keutzer (possible past University Of California, Berkeley affiliation)
Abstract

There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add can add...

πŸ“„ EMO: Frustratingly Easy Progressive Training of Extendable MoE
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13247v1
πŸ‘₯ Authors: Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu (possible past Tencent (China) affiliation), Eric Xing, Xuezhe Ma (possible past Carnegie Mellon University affiliation)
Abstract

Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though ear...

πŸ“„ A Hybrid Tucker-LSTM Tensor Network Model for SOC Prediction in Electric Vehicles
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.13200v1
πŸ‘₯ Authors: Han Wang (possible past Peking University affiliation), Ying Wang (possible past Tsinghua University affiliation), Bing Wang
Abstract

Accurate state of charge estimation is critical for the success of electric vehicle battery management strategies, but it is well known that conventional estimators suffer from two fundamental shortcomings: cumulative errors that grow over time and reliance on simplified battery models that do not reflect real world dynamics. Therefore, this paper presents a novel hybrid approach combining Tucker tensor decomposition with LSTM networks, using full - lifecycle EV field data for SOC prediction. Th...

πŸ“„ F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.12995v1
πŸ‘₯ Authors: Rohan Surana, Gagan Mundada, Junda Wu, Xintong Li, Yizhu Jiao, Bowen Jin, Sizhe Zhou, Tong Yu (possible past Carnegie Mellon University affiliation), Ritwik Sinha, Jiawei Han (possible past Google (United States) affiliation), Jingbo Shang, Julian Mcauley
Abstract

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full...

πŸ“„ The Efficiency Gap in Byte Modeling
πŸ—“οΈ Published: 5/13/2026
πŸ”— http://arxiv.org/abs/2605.12928v1
πŸ‘₯ Authors: Celine Lee, Jing Nathan Yan, Chen Liang, Jiaxin Shi, Yin Zhang, Jeremiah Liu, Pengcheng Yin (possible past Google (United States) affiliation), Fernando Pereira (possible past Google (United States) affiliation), Ed Chi, Derek Cheng, Alexander M. Rush (possible past University Of Oxford affiliation), Ruoxi Wang (possible past Stanford University affiliation)
Abstract

Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agno...

*Notable papers are those with at least two authors from a "big" AI/ML lab.