📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11119v1
👥 Authors: Heming Zou, Qi Wang (possible past Tsinghua University affiliation), Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu (possible past Tencent (China) affiliation), Kai Yang, Saiyong Yang, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources ...

📄 Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11042v1
👥 Authors: Liya Zhu, Jingzhe Ding, Jian Zhang (possible past Tencent (China) affiliation), Jianbo Xue, Shihao Liang, Ge Zhang, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Yi Zhu, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Zhiyong Wu (possible past Tsinghua University affiliation), Shen Yan, Yujia Qin, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang
Abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user i...

📄 AuRA: Internalizing Audio Understanding into LLMs as LoRA
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11033v1
👥 Authors: Bo Cheng, Lei Shi (possible past Baidu (China) affiliation), Zhanyu Ma, Yuan Wu, Jun Xu (possible past Google (United States) affiliation), Jiuchong Gao, Jinghua Hao, Renqing He
Abstract

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present A...

📄 Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10956v1
👥 Authors: Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou (possible past Baidu (China) affiliation), Shaohan Huang, Nan Yang, Li Dong, Lei Cui (possible past Tsinghua University affiliation), Furu Wei
Abstract

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National...

📄 Human-AI Teaming Through the Lens of Calibration
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10906v1
👥 Authors: Eric Nalisnick (possible past Google (United States) affiliation), Chi Zhang (possible past Peking University affiliation), Sophia Qian, Yixin Wang
Abstract

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empir...

📄 Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10819v1
👥 Authors: Miaoxin Cai, Guanqun Wang, Wei Zhang (possible past Tsinghua University affiliation), Guangyao Zhou, Yin Zhuang, Tong Zhang (possible past Tencent (China) affiliation), Hao Wang (possible past Tsinghua University affiliation), He Chen, Jun Li
Abstract

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a s...

📄 Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10738v1
👥 Authors: Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang (possible past Google (United States) affiliation), Yuxiang Wang, Wei Liu (possible past Tsinghua University affiliation), Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao
Abstract

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with lim...

📄 Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10677v1
👥 Authors: Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding (possible past Tsinghua University affiliation), Zhenhua Zhu, Boxun Li (possible past Tsinghua University affiliation), Guohao Dai, Yu Wang (possible past Tsinghua University affiliation)
Abstract

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for coll...

📄 Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10571v1
👥 Authors: Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen (possible past University Of Oxford affiliation), Yifei Huang, Bo Liu (possible past Meta (United States) affiliation)
Abstract

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogat...

📄 LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10531v1
👥 Authors: Haoyu Wang (possible past Tencent (China) affiliation), Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han (possible past Tsinghua University affiliation)
Abstract

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that repre...

📄 Advancing the State-of-the-Art in Empirical Privacy Auditing
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10481v1
👥 Authors: Nicole Mitchell, Galen Andrew (possible past Google (United States) affiliation), Arun Ganesh, Brendan Mcmahan (possible past Google (United States) affiliation), Peter Kairouz (possible past Google (United States) affiliation)
Abstract

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prom...

📄 ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10479v1
👥 Authors: Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin (possible past Peking University affiliation), Yu Qiao (possible past Shanghai Artificial Intelligence Laboratory affiliation), Ning Ding (possible past Tsinghua University affiliation), Yafu Li, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench co...

📄 ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10461v1
👥 Authors: Xianlin Zeng, Fan Xia (possible past Tencent (China) affiliation), Xiangyu Chen (possible past Shanghai Artificial Intelligence Laboratory affiliation)
Abstract

Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation dri...

📄 Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10389v1
👥 Authors: Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin (possible past Baidu (China) affiliation), Dou Shen (possible past Baidu (China) affiliation)
Abstract

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; ...

📄 A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10366v1
👥 Authors: Shuo Wang (possible past Nvidia (United States) affiliation), Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao (possible past Tencent (China) affiliation)
Abstract

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a sy...

📄 What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10267v1
👥 Authors: Jiaheng Hu, Mohit Shridhar (possible past University Of Washington affiliation), Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang (possible past Google (United States) affiliation), Jie Tan (possible past Google (United States) affiliation), Annie Xie
Abstract

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are repres...

📄 Exploring the Design Space of Reward Backpropagation for Flow Matching
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.11075v1
👥 Authors: Ruoyu Wang (possible past University Of Edinburgh affiliation), Boye Niu, Xiangxin Zhou, Yushi Huang, Tongliang Liu, Chi Zhang (possible past Peking University affiliation)
Abstract

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highli...

📄 How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs
🗓️ Published: 6/9/2026
🔗 http://arxiv.org/abs/2606.10646v1
👥 Authors: Zhichen Dong, Yang Li (possible past Google (United States) affiliation), Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng (possible past National University Of Singapore affiliation), Bo Zheng, Junchi Yan (possible past Shanghai Jiao Tong University affiliation)
Abstract

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answe...

📄 A Theory on Flow Matching with Neural Networks
🗓️ Published: 6/8/2026
🔗 http://arxiv.org/abs/2606.10089v1
👥 Authors: Yihan He, Qishuo Yin, Yuan Cao (possible past Google (United States) affiliation), Jianqing Fan, Han Liu (possible past Tsinghua University affiliation)
Abstract

In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establish convergence guarantees for gradient descent in the over-parameterized 2-layered ReLU neural network regime. We derive generalization bounds for the conditional velocity-field matching objective. Building on these results, we provide Wasserstein-distance guarantees for the samples generated by the induced flow. Our analysis is based on generalization bound ...

📄 GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation
🗓️ Published: 6/8/2026
🔗 http://arxiv.org/abs/2606.10025v1
👥 Authors: Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan (possible past Tsinghua University affiliation), Shubham Tulsiani (possible past University Of California, Berkeley affiliation), David Held (possible past University Of California, Berkeley affiliation)
Abstract

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goal...

*Notable papers are those with at least two authors from a "big" AI/ML lab.