πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners
πŸ—“οΈ Published: 6/28/2026
πŸ”— http://arxiv.org/abs/2606.29296v1
πŸ‘₯ Authors: Chao Wang (possible past Google (United States) affiliation), Hongtao Tian, Tao Yang, Yunsheng Shi (possible past Baidu (China) affiliation), Ting Yao, Wenbo Ding (possible past Tsinghua University affiliation)
Abstract

Group Relative Policy Optimization (GRPO) is a default recipe for process-supervised reinforcement learning of LLM reasoners, and dense process supervision -- via learned process reward models (PRMs) or on-policy-distillation KL signals -- is a common way to densify its otherwise weak outcome reward. Layering such a step-level signal on top of GRPO's group-standardized advantage, however, exposes three structural pathologies: \emph{channel contamination} between the pooled process, outcome, and ...

πŸ“„ A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis
πŸ—“οΈ Published: 6/28/2026
πŸ”— http://arxiv.org/abs/2606.29193v1
πŸ‘₯ Authors: Yuanhong Cai, Xiaohui Nie (possible past Tsinghua University affiliation), Kanglin Yin, Changhua Pei, Yongqian Sun (possible past Tsinghua University affiliation), Shenglin Zhang, Haibin Liu, Guiyang Liu, Xidao Wen, Fang Situ, Dan Pei (possible past Tsinghua University affiliation)
Abstract

LLM-based agents are reshaping microservice operations into AgentOps, where benchmarks are key to evaluating failure diagnosis over multimodal observability data. However, existing benchmarks remain largely outcome-oriented: they score only the final answer and fail to assess the systematic reasoning process in failure diagnosis. We address this gap by introducing two large-scale datasets (AIOps2025 and RCA100) under a reasoning-process evaluation paradigm that assesses agentic diagnostic capabi...

πŸ“„ Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis
πŸ—“οΈ Published: 6/27/2026
πŸ”— http://arxiv.org/abs/2606.29020v1
πŸ‘₯ Authors: Chenghao Qian, Nedko Savov, Lingdong Kong, Yeying Jin, Rui Song (possible past Peking University affiliation), Wenjing Li, Zhun Zhong, Jiaqi Ma, Gustav Markkula, Luc Van Gool (possible past Google (United States) affiliation)
Abstract

Weather synthesis aims to add weather effects to input videos while preserving scene identity, structure, and motion. The key limitation of existing methods is the lack of diversity in weather appearance and effective control over weather dynamics (e.g., temporal evolution and particle motion). Most approaches rely on text prompts, which are inherently underspecified and often fail to produce detailed weather characteristics. Additionally, general-purpose video editors optimized for clean and ae...

πŸ“„ X-Mind: Efficient Visual Chain-of-Thought via Predictive World Model for End-to-End Driving
πŸ—“οΈ Published: 6/27/2026
πŸ”— http://arxiv.org/abs/2606.28758v1
πŸ‘₯ Authors: Bohao Zhao, Chengrui Wei, Guangfeng Jiang, Ruixin Liu, Xuejie Lv, Liu Liang, Sutao Deng, Xiuyang Fan, Pengkun Zheng, Jinyun Zhou, Rui Guo, Hanpeng Liu, Yutong Zheng, Yi Guo, Xinlong Zheng, Qingyu Luo, Zhuangzhuang Ding, Yu Zhang (possible past Google (United States) affiliation), Hang Zhang (possible past Amazon (United States) affiliation), Xianming Liu (possible past Meta (United States) affiliation)
Abstract

Predicting future states is essential for autonomous agents, yet current Vision-Language-Action (VLA) models fundamentally lack this capability, relying instead on reactive perception-action mapping. While integrating Predictive World Models (PWMs) addresses this gap, existing approaches either incur prohibitive cascaded latency or act as shallow terminal tasks that fail to deeply embed forward-looking reasoning. To endow VLA models with this reasoning capability, we propose X-Mind. Rather than ...

πŸ“„ ComMem: Complementary Memory Systems for Test-Time Adaptation of Vision-Language Models
πŸ—“οΈ Published: 6/27/2026
πŸ”— http://arxiv.org/abs/2606.28719v1
πŸ‘₯ Authors: Guanglong Sun, Shuang Cui, Bo Lei, Liyuan Wang, Zihan Zhai, Hongwei Yan, Hang Su (possible past Tsinghua University affiliation), Jun Zhu (possible past Tsinghua University affiliation), Yi Zhong
Abstract

Test-time adaptation (TTA) of vision-language models (VLMs) is essential for their robust deployment in dynamic, real-world environments. However, existing TTA methods often adapt locally without accumulating knowledge over time, or operating within a single modality without exploiting VLMs' inherently multi-modal nature. Inspired by the \textbf{Com}plementary \textbf{Mem}ory systems of the biological brain, we propose \textbf{ComMem}, an innovative approach that mimics the distinct but cooperat...

πŸ“„ The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28529v1
πŸ‘₯ Authors: Yujin Wang, Junli Chen, Yixuan Li (possible past Meta (United States) affiliation), Shunan Dong, Huazhong Yang (possible past Tsinghua University affiliation), Yongpan Liu (possible past Tsinghua University affiliation), Hongyang Jia
Abstract

Embodied foundation models have recently been widely used to improve robot generalization and task success rates. Previous works apply lossy efficient-inference techniques such as quantization, pruning, and asynchronous inference, accepting small action quality degradation in exchange for lower per-step computation cost and inter-action latency. However, unlike traditional static ML tasks, embodied tasks involve repeated interaction with the environment, and task-level performance is determined ...

πŸ“„ TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28480v1
πŸ‘₯ Authors: Shoufa Chen, Luyuan Wang, Xuan Yang (possible past Stanford University affiliation), Zhiheng Liu, Yuren Cong, Yuanfeng Ji, Feiyan Zhou, Xiaohui Zhang (possible past Meta (United States) affiliation), Fanny Yang, Belinda Zeng
Abstract

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically ...

πŸ“„ Agentic Hardware Design as Repository-Level Code Evolution
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28279v1
πŸ‘₯ Authors: Cunxi Yu, Chenhui Deng, Nathaniel Pinckney (possible past Nvidia (United States) affiliation), Brucek Khailany (possible past Nvidia (United States) affiliation)
Abstract

We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hard...

πŸ“„ Towards Automating Scientific Review with Google's Paper Assistant Tool
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28277v1
πŸ‘₯ Authors: Rajesh Jayaram, Drew Tyler, David Woodruff, Corinna Cortes (possible past Google (United States) affiliation), Yossi Matias (possible past Google (United States) affiliation), Vahab Mirrokni (possible past Google (United States) affiliation), Vincent Cohen-Addad
Abstract

Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy c...

πŸ“„ Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28186v1
πŸ‘₯ Authors: Chenguang Wang (possible past Amazon (United States) affiliation), Ming Li, Xinyue Zeng, Zhuochun Li, Hong Jiao, Tianyi Zhou (possible past University Of Washington affiliation), Dawei Zhou
Abstract

Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes that make items difficult. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem-solving burden an item induces. Large Rea...

πŸ“„ Tandem Reinforcement Learning with Verifiable Rewards
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28166v1
πŸ‘₯ Authors: Difan Jiao, Raghav Singhal, Robert West (possible past Stanford University affiliation), Ashton Anderson (possible past Stanford University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this ...

πŸ“„ JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM/VLM-Centric Solution for Item Understanding, Management, and Applications
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28070v1
πŸ‘₯ Authors: Oxygen Aiic, Chan Long, Chao Liu, Chaofan Chen, Chaohui Dong, Chunyuan Guo, Danping Liu, Debin Liu, Deping Xiang, Fulai Xu, Guangyue Liu, Hao Li (possible past Tsinghua University affiliation), Huichun Hu, Jian Yang, Jianan Wang (possible past Deepmind (United Kingdom) affiliation), Jianbo Zhao, Jiaoyang Li, Jiaxing Wang, Jinglong Li, Jinjin Guo, Jun Fang, Jun Liu (possible past Tencent (China) affiliation), Kai Zhou, Li Wang (possible past Tesla (United States) affiliation), Lili Gao, Liying Chen, Luning Yang, Mengdi Zhou, Pengzhang Liu, Qi Lv, Qianyun Wang, Qixia Jiang, Ruyue Li, Shimu Liang, Shuxing Wang, Sijie Zhang, Siqi Li, Tianhao Gao, Wang Ke, Weihu Huang, Wencan Lai, Wenjie Zhang, Xiaohui Zhang (possible past Meta (United States) affiliation), Xiaojing Dong, Ya Liu, Yifeng Zhang, Yixiang Wang, Yongtai Zhang, Yongyi Liao, Zhaoru Chen, Zhen Chen, Zhiyong Ma, Zhiyuan Liu (possible past Tsinghua University affiliation), Zhongwei Liu, Ziyan Xing
Abstract

JD.com, one of the world's largest e-commerce platforms, serves over 700 million active users and millions of merchants, with a catalog of tens of billions of SKUs. At this scale, high-quality, structured item knowledge underpins a better consumer experience, lower management costs, and higher operational efficiency-yet producing and serving it poses three industrial-scale challenges: fast-emerging concepts, high-quality knowledge production for massive SKUs, and diverse downstream requirements....

πŸ“„ Event-Conditioned Diagnostics of Kinematic, Contact, and Object-Permanence Fields in Passive Object-State World Models
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28455v1
πŸ‘₯ Authors: Yang Liu (possible past Tsinghua University affiliation), Yuming Chen (possible past University Of Washington affiliation)
Abstract

World models can predict future physical states, but prediction accuracy alone does not explain how physical information is organized and used inside their latent dynamics. We introduce a controlled diagnostic protocol for studying event-conditioned latent physical structure in passive object-state world models. The protocol tests whether hidden representations encode event-regime information, whether event contexts reweight non-exclusive physical field readouts, and whether field-aligned repres...

πŸ“„ DataComp-VLM: Improved Open Datasets for Vision-Language Models
πŸ—“οΈ Published: 6/26/2026
πŸ”— http://arxiv.org/abs/2606.28551v1
πŸ‘₯ Authors: Matteo Farina, Vishaal Udandarao, Thao Nguyen (possible past Google (United States) affiliation), Selim Kuzucu, Maximilian BΓΆther, Andreas Hochlehnert, Adhiraj Ghosh, Marianna Nezhurina, Karsten Roth, Joschka Struber, Yuhui Zhang, Sebastian Dziadzio, Elaine Sui, Soumya Jahagirdar, Dhruba Ghosh, Hasan Hammoud, Thomas De Min, Simone Caldarella, Jehanzeb Mirza, Sedrick Keh, Mehdi Cherti, Hilde Kuehne, Bernt Schiele, Serena Yeung-Levy, Muhammad Ferjad Naeem, Federico Tombari (possible past Google (United States) affiliation), Ana Klimovic (possible past Stanford University affiliation), Elisa Ricci, Matthias Bethge, Sewoong Oh, Ameya Prabhu, Alessio Tonioni, Jenia Jitsev, Massimiliano Mancini, Ludwig Schmidt (possible past University Of Washington affiliation), Nikhil Parthasarathy
Abstract

Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies. We introduce DataComp for VLMs (DCVLM), a benchmark for controlled data-centric experiments to improve VLM training. As part of DCVLM, we collect 160 datasets spanning four data types -- image-caption pairs, multimodal interleaved documents, text-only, and instruction-tuning data -- into a corpus of 6T ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.