📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11918v1
👥 Authors: Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari (possible past Google (United States) affiliation), Leonidas Guibas (possible past Stanford University affiliation)
Abstract

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D a...

📄 AutoMine Solution for AV2 2026 Scenario Mining Challenge
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11874v1
👥 Authors: Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li (possible past Tsinghua University affiliation), Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang (possible past Tsinghua University affiliation), Yue Zhang, Bing Wang, Guang Chen, Hao Lu, Hangjun Ye
Abstract

With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-w...

📄 Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11830v1
👥 Authors: Qianyu Yao, Fei Sun (possible past Meta (United States) affiliation), Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen (possible past Baidu (China) affiliation), Wenjie Xu, Bo Li (possible past Tencent (China) affiliation), Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang
Abstract

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung ca...

📄 AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11751v1
👥 Authors: Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song (possible past Tencent (China) affiliation), Haoyang Huang, Nan Duan, Feng Zhao (possible past Microsoft (United States) affiliation)
Abstract

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term ...

📄 TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11662v1
👥 Authors: Zhuofan Shi, Mingzhe Ma, Lu Wang (possible past University Of Washington affiliation), Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang (possible past Tsinghua University affiliation), Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
Abstract

Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time fr...

📄 TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11637v1
👥 Authors: Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu (possible past Tencent (China) affiliation), Jie Hao (possible past Tencent (China) affiliation), Ce Hao, Weihao Yuan, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of trans...

📄 Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11634v1
👥 Authors: Kai Liu (possible past Baidu (China) affiliation), Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang (possible past Baidu (China) affiliation), Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids p...

📄 Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11499v1
👥 Authors: Vedant Badoni, Danqi Chen (possible past Stanford University affiliation), Xinyi Wang (possible past Carnegie Mellon University affiliation)
Abstract

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the ...

📄 TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11119v1
👥 Authors: Heming Zou, Qi Wang (possible past Tsinghua University affiliation), Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu (possible past Tencent (China) affiliation), Kai Yang, Saiyong Yang, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources ...

📄 Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11042v1
👥 Authors: Liya Zhu, Jingzhe Ding, Jian Zhang (possible past Tencent (China) affiliation), Jianbo Xue, Shihao Liang, Ge Zhang, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Yi Zhu, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Zhiyong Wu (possible past Tsinghua University affiliation), Shen Yan, Yujia Qin, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang
Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user i...

📄 AuRA: Internalizing Audio Understanding into LLMs as LoRA
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11033v1
👥 Authors: Bo Cheng, Lei Shi (possible past Baidu (China) affiliation), Zhanyu Ma, Yuan Wu, Jun Xu (possible past Google (United States) affiliation), Jiuchong Gao, Jinghua Hao, Renqing He
Abstract

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present A...

📄 Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10956v1
👥 Authors: Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou (possible past Baidu (China) affiliation), Shaohan Huang, Nan Yang, Li Dong, Lei Cui (possible past Tsinghua University affiliation), Furu Wei
Abstract

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National...

📄 Human-AI Teaming Through the Lens of Calibration
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10906v1
👥 Authors: Eric Nalisnick (possible past Google (United States) affiliation), Chi Zhang (possible past Peking University affiliation), Sophia Qian, Yixin Wang
Abstract

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empir...

📄 Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10819v1
👥 Authors: Miaoxin Cai, Guanqun Wang, Wei Zhang (possible past Tsinghua University affiliation), Guangyao Zhou, Yin Zhuang, Tong Zhang (possible past Tencent (China) affiliation), Hao Wang (possible past Tsinghua University affiliation), He Chen, Jun Li
Abstract

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a s...

📄 Exploring the Design Space of Reward Backpropagation for Flow Matching
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11075v1
👥 Authors: Ruoyu Wang (possible past University Of Edinburgh affiliation), Boye Niu, Xiangxin Zhou, Yushi Huang, Tongliang Liu, Chi Zhang (possible past Peking University affiliation)
Abstract

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highli...

*Notable papers are those with at least two authors from a "big" AI/ML lab.