πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18237v1
πŸ‘₯ Authors: Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang (possible past Microsoft (United States) affiliation), Ameet Talwalkar (possible past University Of California, Berkeley affiliation)
Abstract

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction bloc...

πŸ“„ EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18235v1
πŸ‘₯ Authors: Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma (possible past University Of Oxford affiliation), Guosheng Lin, Hao Wang (possible past Tsinghua University affiliation)
Abstract

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from...

πŸ“„ Looped World Models
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18208v1
πŸ‘₯ Authors: Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang (possible past Tencent (China) affiliation), Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen (possible past Baidu (China) affiliation), Hebin Wang, Zeyu Gao, Yang Li (possible past Google (United States) affiliation), Wei Zhao (possible past Tencent (China) affiliation), Qimin Zhong, Siqi Liu (possible past University Of Oxford affiliation), Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam
Abstract

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive compu...

πŸ“„ RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18203v1
πŸ‘₯ Authors: Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu (possible past Tsinghua University affiliation), Daniel Mcduff (possible past Google (United States) affiliation), Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally
Abstract

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an...

πŸ“„ Knowledge Reutilization in Meta-Reinforcement Learning
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18132v1
πŸ‘₯ Authors: Yuan Meng, Bo Wang (possible past Tencent (China) affiliation), Juan De Los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun (possible past Tsinghua University affiliation), Alois Knoll
Abstract

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non...

πŸ“„ Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18062v1
πŸ‘₯ Authors: Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer (possible past Carnegie Mellon University affiliation), Nicolas Christin (possible past Carnegie Mellon University affiliation)
Abstract

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LL...

πŸ“„ LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18023v1
πŸ‘₯ Authors: Jian Yang, Shawn Guo, Wei Zhang (possible past Tsinghua University affiliation), Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao (possible past Baidu (China) affiliation), Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai
Abstract

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional...

πŸ“„ StepGuard: Guarding Web Navigation via Single-Step Calibration
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17871v1
πŸ‘₯ Authors: Zhihao Cui, Yuchen Zhang (possible past University Of California, Berkeley affiliation), Xiyang Sun, Yaxiong Wang (possible past Tencent (China) affiliation), Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu
Abstract

Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for q...

πŸ“„ FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17856v1
πŸ‘₯ Authors: Bihao Zhan, Zongsheng Cao, Jie Zhou (possible past Tsinghua University affiliation), Bo Zhang (possible past Tencent (China) affiliation), Liang He
Abstract

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, ...

πŸ“„ LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17727v1
πŸ‘₯ Authors: Yi Zhao, Zhen Yang (possible past Tsinghua University affiliation), Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang (possible past Tsinghua University affiliation)
Abstract

Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional eva...

πŸ“„ SuCo: Sufficiency-guided Continuous Adaptive Reasoning
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17687v1
πŸ‘₯ Authors: Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min Zhang (possible past Tsinghua University affiliation), Jing Li (possible past Tencent (China) affiliation), Xuelong Li (possible past Tencent (China) affiliation)
Abstract

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for pr...

πŸ“„ Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17657v1
πŸ‘₯ Authors: Zirui Cheng, Zeyu Shen, Thomas L. Griffiths (possible past University Of California, Berkeley affiliation), Peter Henderson (possible past Stanford University affiliation)
Abstract

People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Be...

πŸ“„ From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17648v1
πŸ‘₯ Authors: Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang (possible past Tsinghua University affiliation), Junxin Li, Yujia Li (possible past University Of Toronto affiliation), Yu Huo, Ruixuan Wang
Abstract

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure...

πŸ“„ Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17645v1
πŸ‘₯ Authors: Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li (possible past Baidu (China) affiliation), Bolin Ding, Mosharaf Chowdhury (possible past University Of California, Berkeley affiliation)
Abstract

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one ...

πŸ“„ Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17539v1
πŸ‘₯ Authors: Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang (possible past Tsinghua University affiliation), Zhaojing Yang, Wei Huang (possible past Google (United States) affiliation), Ka Chun Cheung, Song Han (possible past Stanford University affiliation), Vidya Nariyambut Murali, Pavlo Molchanov (possible past Nvidia (United States) affiliation), Jan Kautz (possible past Nvidia (United States) affiliation), Simon See (possible past Nvidia (United States) affiliation), Hongxu Yin, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation), Sifei Liu (possible past Nvidia (United States) affiliation)
Abstract

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial...

πŸ“„ MagicSim: A Unified Infrastructure for Executable Embodied Interaction
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17511v1
πŸ‘₯ Authors: Haoran Lu, Songling Liu, Yue Chen (possible past Google (United States) affiliation), Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu (possible past Tsinghua University affiliation)
Abstract

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime ...

πŸ“„ AUTOGATE: Automated Clock Gating via Toggling-Aware LLM-based RTL Rewriting
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17461v1
πŸ‘₯ Authors: Yiting Wang, Chenhui Deng, Chia-Tung Ho, Yanqing Zhang (possible past Nvidia (United States) affiliation), Zhuo Feng, Cunxi Yu, Ang Li (possible past Google (United States) affiliation), Gang Qu, Brucek Khailany (possible past Nvidia (United States) affiliation)
Abstract

Fine-grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual. Recent LLM-based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness. In this work, we present AUTOGATE, the first agentic framework for indust...

πŸ“„ Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17437v1
πŸ‘₯ Authors: Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang (possible past Google (United States) affiliation), Yujia Yang, Yun Dai, Jian Liu, Jie Wang (possible past Tsinghua University affiliation)
Abstract

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterog...

πŸ“„ DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.17362v1
πŸ‘₯ Authors: Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang (possible past University Of California, Berkeley affiliation), Sanja Fidler (possible past University Of Toronto affiliation), Kashyap Chitta, Jose M. Alvarez
Abstract

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce Dri...

πŸ“„ Nothing from Something: Can a Language Model Discover 0?
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.17289v1
πŸ‘₯ Authors: Phoebe Zeng, Thomas L. Griffiths (possible past University Of California, Berkeley affiliation), Brenden M. Lake (possible past Meta (United States) affiliation)
Abstract

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalization...

πŸ“„ OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18105v1
πŸ‘₯ Authors: Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu (possible past Peking University affiliation), Xuan Liu (possible past Baidu (China) affiliation), Dong Zhang (possible past Nvidia (United States) affiliation), Chunming Wu, Xiang Chen (possible past Tencent (China) affiliation)
Abstract

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dyna...

πŸ“„ From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.18089v1
πŸ‘₯ Authors: Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Kun Zhang (possible past Google (United States) affiliation), Eric P. Xing, Zhengzhong Liu (possible past Tencent (China) affiliation)
Abstract

Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including bo...

πŸ“„ EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17680v1
πŸ‘₯ Authors: Zhitong Wang, Songze Li, Hao Peng (possible past Tsinghua University affiliation), Shuzheng Si, Yi Wang, Maosong Sun (possible past Tsinghua University affiliation), Juanzi Li
Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, ...

πŸ“„ Domain-Validity-Gated Metamorphic Testing of Scientific ML Surrogates
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17529v1
πŸ‘₯ Authors: Meng Li (possible past Meta (United States) affiliation), Xiaohua Yang, Jie Liu (possible past Tencent (China) affiliation), Shiyu Yan
Abstract

Scientific machine-learning (SciML) surrogates approximate expensive simulations, but exact expected outputs for arbitrary inputs are unavailable (the oracle problem). Metamorphic testing checks relations across executions, yet a candidate relation is not automatically valid: its preconditions, output mapping, and the numerical floor of the scoring operator determine whether a violation is meaningful. We study how candidate metamorphic relations (MRs) can be screened for domain validity and turn...

πŸ“„ ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation
πŸ—“οΈ Published: 6/16/2026
πŸ”— http://arxiv.org/abs/2606.17462v1
πŸ‘₯ Authors: Chongru Fan, Wei Wang (possible past University Of Oxford affiliation), Wentao Huang, Zhenquan Ding, Jinqiao Shi, Lei Cui (possible past Tsinghua University affiliation), Zhiyu Hao, Xiaochun Yun
Abstract

While Website Fingerprinting (WF) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real-world environments due to spatio-temporal drift, browser heterogeneity, proxy obfuscation and etc. This limitation stems from their sole reliance on low-level traffic features that are noisy and highly sensitive to environmental perturbations. To address this problem, we propose \textbf{ResAware}, a cross-environment resource-aware distillation framework und...

πŸ“„ ProCUA-SFT Technical Report
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.17321v1
πŸ‘₯ Authors: Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang (possible past Tencent (China) affiliation), Jin Xu (possible past Tencent (China) affiliation), Amala Sanjay Deshmukh, Karan Sapra (possible past Nvidia (United States) affiliation), Andrew Tao (possible past Nvidia (United States) affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation), Jan Kautz (possible past Nvidia (United States) affiliation), Mingjie Liu, Yi Dong
Abstract

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M st...

πŸ“„ KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.17034v1
πŸ‘₯ Authors: Mufei Li, Shikun Liu, Dongqi Fu, Haoyu Wang (possible past Tencent (China) affiliation), Yinglong Xia, Hong Li, Hong Yan, Pan Li (possible past Baidu (China) affiliation)
Abstract

Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, makin...

πŸ“„ ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.17011v1
πŸ‘₯ Authors: Wei Xiao, Weiliang Tang, Yuying Ge, Hui Zhou, Yao Mu, Li Zhang (possible past University Of Oxford affiliation), Yixiao Ge (possible past Tencent (China) affiliation)
Abstract

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algo...

πŸ“„ Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16952v1
πŸ‘₯ Authors: Kareem Amin, Rudrajit Das, Alessandro Epasto (possible past Google (United States) affiliation), Adel Javanmard, Dennis Kraft, MΓ³nica Ribero, Sergei Vassilvitskii (possible past Google (United States) affiliation)
Abstract

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguis...

πŸ“„ Factorized Neural Operators Decompose Dynamic and Persistent Responses
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16900v1
πŸ‘₯ Authors: Hao Tang, Yuechen Duan, Jiongyu Zhu, Zimeng Feng, Hao Li (possible past Tsinghua University affiliation), Chao Li (possible past Baidu (China) affiliation)
Abstract

Physical systems often exhibit heterogeneous mechanisms, where rapidly evolving dynamics coexist with persistent structures. Capturing such multiscale physical behavior remains challenging for existing neural operators, which typically rely on single dominant inductive bias and therefore couple distinct physical responses into a shared representation. We introduce the Unified Green's Function Framework across domains and propose the Factorized Neural Operators (FaNO), which decompose spectral re...

πŸ“„ Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16899v1
πŸ‘₯ Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma (possible past Stanford University affiliation), Percy Liang (possible past Stanford University affiliation)
Abstract

Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B p...

πŸ“„ MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16748v1
πŸ‘₯ Authors: Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh (possible past Google (United States) affiliation), Ruslan Salakhutdinov (possible past University Of Toronto affiliation)
Abstract

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.