📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12406v1
👥 Authors: Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Deepak Pathak (possible past University Of California, Berkeley affiliation)
Abstract

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy le...

📄 DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12402v1
👥 Authors: Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Marco Pavone (possible past Stanford University affiliation)
Abstract

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. ...

📄 Redesign Mixture-of-Experts Routers with Manifold Power Iteration
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12397v1
👥 Authors: Songhao Wu, Ang Lv, Ruobing Xie (possible past Tencent (China) affiliation), Yankai Lin (possible past Tsinghua University affiliation)
Abstract

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to ...

📄 TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12387v1
👥 Authors: Zhiyi Chen, Jie Song (possible past Eth Zurich affiliation), Peng Li (possible past Tsinghua University affiliation)
Abstract

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development...

📄 CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12352v1
👥 Authors: Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn (possible past University Of California, Berkeley affiliation), Jeannette Bohg (possible past Stanford University affiliation)
Abstract

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference ...

📄 DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12245v1
👥 Authors: Kangning Zhang, Yingjie Qin, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Yong Yu (possible past Shanghai Jiao Tong University affiliation), Jianghao Lin
Abstract

Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbf{seesaw dilemma}: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbf{distributional disparity}: warm item embeddings occupy a complex ``behavio...

📄 Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12217v1
👥 Authors: Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge (possible past Tencent (China) affiliation), Xihui Liu (possible past University Of California, Berkeley affiliation)
Abstract

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction re...

📄 The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11918v1
👥 Authors: Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari (possible past Google (United States) affiliation), Leonidas Guibas (possible past Stanford University affiliation)
Abstract

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D a...

📄 AutoMine Solution for AV2 2026 Scenario Mining Challenge
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11874v1
👥 Authors: Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li (possible past Tsinghua University affiliation), Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang (possible past Tsinghua University affiliation), Yue Zhang, Bing Wang, Guang Chen, Hao Lu, Hangjun Ye
Abstract

With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-w...

📄 Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11830v1
👥 Authors: Qianyu Yao, Fei Sun (possible past Meta (United States) affiliation), Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen (possible past Baidu (China) affiliation), Wenjie Xu, Bo Li (possible past Tencent (China) affiliation), Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang
Abstract

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung ca...

📄 AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11751v1
👥 Authors: Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song (possible past Tencent (China) affiliation), Haoyang Huang, Nan Duan, Feng Zhao (possible past Microsoft (United States) affiliation)
Abstract

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term ...

📄 TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11662v1
👥 Authors: Zhuofan Shi, Mingzhe Ma, Lu Wang (possible past University Of Washington affiliation), Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang (possible past Tsinghua University affiliation), Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
Abstract

Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time fr...

📄 TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11637v1
👥 Authors: Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu (possible past Tencent (China) affiliation), Jie Hao (possible past Tencent (China) affiliation), Ce Hao, Weihao Yuan, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of trans...

📄 Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.11634v1
👥 Authors: Kai Liu (possible past Baidu (China) affiliation), Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang (possible past Baidu (China) affiliation), Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids p...

📄 Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11499v1
👥 Authors: Vedant Badoni, Danqi Chen (possible past Stanford University affiliation), Xinyi Wang (possible past Carnegie Mellon University affiliation)
Abstract

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the ...

📄 TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11119v1
👥 Authors: Heming Zou, Qi Wang (possible past Tsinghua University affiliation), Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu (possible past Tencent (China) affiliation), Kai Yang, Saiyong Yang, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources ...

📄 Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11042v2
👥 Authors: Liya Zhu, Jingzhe Ding, Jian Zhang (possible past Tencent (China) affiliation), Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang (possible past Google (United States) affiliation), Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu (possible past Tsinghua University affiliation), Yang Liu (possible past Tsinghua University affiliation), Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang
Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user i...

📄 AuRA: Internalizing Audio Understanding into LLMs as LoRA
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11033v1
👥 Authors: Bo Cheng, Lei Shi (possible past Baidu (China) affiliation), Zhanyu Ma, Yuan Wu, Jun Xu (possible past Google (United States) affiliation), Jiuchong Gao, Jinghua Hao, Renqing He
Abstract

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present A...

📄 Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12370v1
👥 Authors: Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang (possible past Google (United States) affiliation), Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men (possible past Peking University affiliation), Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou
Abstract

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-t...

📄 Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12360v1
👥 Authors: Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger (possible past Stanford University affiliation), Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas Mcgrath (possible past Google (United States) affiliation), Ekdeep Singh Lubana
Abstract

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization ...

📄 Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12344v1
👥 Authors: Mengyu Zheng, Kai Han, Boxun Li (possible past Tsinghua University affiliation), Haiyang Xu, Yuchuan Tian, Wei He (possible past Baidu (China) affiliation), Hang Zhou (possible past Baidu (China) affiliation), Jianyuan Guo, Hailin Hu, Lin Ma (possible past Tencent (China) affiliation), Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei (possible past National University Of Singapore affiliation), Yunhe Wang, Yu Wang (possible past Tsinghua University affiliation)
Abstract

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, worksp...

📄 Re-evaluating Confidence Remasking in Masked Diffusion Language Models
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12232v1
👥 Authors: Stipe Frkovic, Metod Jazbec, Dan Zhang (possible past Google (United States) affiliation), Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick (possible past Google (United States) affiliation)
Abstract

Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One ap...

📄 Exploring the Design Space of Reward Backpropagation for Flow Matching
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11075v1
👥 Authors: Ruoyu Wang (possible past University Of Edinburgh affiliation), Boye Niu, Xiangxin Zhou, Yushi Huang, Tongliang Liu, Chi Zhang (possible past Peking University affiliation)
Abstract

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highli...

*Notable papers are those with at least two authors from a "big" AI/ML lab.