πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Skill Reuse as Compression in Agentic RL
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31509v1
πŸ‘₯ Authors: Zhikun Xu (possible past Google (United States) affiliation), Yu Feng (possible past University Of California, Berkeley affiliation), Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou
Abstract

Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, decomposed into a small set of reusable abstract patterns. To formalize this, we introduce ReuseRL, which grounds agentic RL in the Minimum Description Length (MDL) principle. ReuseRL extracts a shared skill dictionary from successful trajectories and augments the RL object...

πŸ“„ PithTrain: A Compact and Agent-Native MoE Training System
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31463v1
πŸ‘₯ Authors: Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry (possible past Carnegie Mellon University affiliation), Chenyan Xiong, Tianqi Chen (possible past University Of Washington affiliation)
Abstract

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, inv...

πŸ“„ DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31377v1
πŸ‘₯ Authors: Siyuan Qi (possible past Google (United States) affiliation), Xinyuan Wang, Yingxuan Yang, Haochuan Guo, Jianghao Lin, Weiwen Liu, Yong Yu (possible past Shanghai Jiao Tong University affiliation), Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic RAG methods often couple semantic expansion with retrieval decisions in short-horizon inference loops, leading to high inference cost and limited suitability for time-sensitive news retrieval. We propose DynaTree, a two-stage framework for efficient and adaptive news retrieval. In the offline stage, DynaTree uses coordinated agents to construct a reusable ret...

πŸ“„ ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31251v1
πŸ‘₯ Authors: Kaiwen Xue, Tao Wei (possible past Baidu (China) affiliation), Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng (possible past University Of California, Berkeley affiliation), Yifan Zhu, Haoran Luo
Abstract

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchm...

πŸ“„ SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31148v1
πŸ‘₯ Authors: Tianhui Liu, Jie Feng (possible past Tsinghua University affiliation), Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li (possible past Tsinghua University affiliation), Pan Hui
Abstract

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we in...

πŸ“„ A Unified and Reproducible Experimentation Framework for Speech Understanding
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.30899v1
πŸ‘₯ Authors: Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li, Yi Yang (possible past Baidu (China) affiliation), Yixuan Wang, Xiaoyu Gu, Guanyu Chen, Yucheng Wang, Jiang Li, Zhangjie Zhao, Haoran Wang, Wenming Tu, Haoyu Li, Duo Ma, Lirong Qian, Yu Xi, Wen Wen, Jiaqi Guo, Hui Zhang, Shuai Fan, Wenbin Jiang, Shuai Wang, Kai Yu (possible past Baidu (China) affiliation)
Abstract

Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on repr...

πŸ“„ DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.30859v1
πŸ‘₯ Authors: Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu (possible past Peking University affiliation), Bin Cui (possible past Peking University affiliation)
Abstract

Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist o...

πŸ“„ GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.30818v1
πŸ‘₯ Authors: Zhiwei Chen, Yijie Li, Yimo Zhang, Shiyun Shao, Yichao Chen, Dian Ding, Liang Wang (possible past Tencent (China) affiliation), Haiwei Wu, Liwei Guo, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Xiaosong Zhang, Yongzhao Zhang
Abstract

Non-contact material identification enables adaptive interaction for embodied intelligence yet faces challenges from geometry-induced variations (e.g., orientation, shape, distance) and single-modality ambiguities. In this paper, we present GaMi, a multimodal material identification system integrating mmWave and acoustic sensing to robustly operate under unconstrained geometric conditions. By leveraging the insight of shared geometric consistency between co-located bimodal sensors, GaMi employs ...

πŸ“„ Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.30789v1
πŸ‘₯ Authors: Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Dingdong Wang, Tianhe Wu, Junjie Wang, Yujiu Yang (possible past Tsinghua University affiliation), Yu Qiao (possible past Shanghai Artificial Intelligence Laboratory affiliation), Ruihang Chu
Abstract

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as ...

πŸ“„ Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.30698v1
πŸ‘₯ Authors: Yuhan Wang (possible past Tencent (China) affiliation), Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang (possible past Mila - Quebec Artificial Intelligence Institute affiliation)
Abstract

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring ...

πŸ“„ EUDAIMONIA: Evaluating Undesirable Dynamics in AI
πŸ—“οΈ Published: 5/28/2026
πŸ”— http://arxiv.org/abs/2605.30654v1
πŸ‘₯ Authors: Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer (possible past Meta (United States) affiliation), Robin Jia (possible past Stanford University affiliation)
Abstract

Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engageme...

πŸ“„ Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
πŸ—“οΈ Published: 5/28/2026
πŸ”— http://arxiv.org/abs/2605.30621v1
πŸ‘₯ Authors: Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang (possible past Tencent (China) affiliation), Xiang Zhang, Benoit Dumoulin, Cihang Xie (possible past Google (United States) affiliation), Yuyin Zhou, Suhang Wang, Hanqing Lu
Abstract

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them...

πŸ“„ Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
πŸ—“οΈ Published: 5/28/2026
πŸ”— http://arxiv.org/abs/2605.30611v1
πŸ‘₯ Authors: Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen (possible past Google (United States) affiliation), Xiaotong Li, Zhixiang Liang, Maosong Sun (possible past Tsinghua University affiliation), Minjia Zhang
Abstract

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete...

πŸ“„ VLM3: Vision Language Models Are Native 3D Learners
πŸ—“οΈ Published: 5/28/2026
πŸ”— http://arxiv.org/abs/2605.30561v1
πŸ‘₯ Authors: Zhipeng Cai, Zhuang Liu (possible past University Of California, Berkeley affiliation), Yunyang Xiong, Zechun Liu, Vikas Chandra (possible past Meta (United States) affiliation), Yangyang Shi
Abstract

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you ne...

πŸ“„ Geometry-based SchrΓΆdinger Bridges for Trustworthy Multimodal Fusion
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31193v1
πŸ‘₯ Authors: Jiayu Xiong, Jing Wang (possible past Google (United States) affiliation), Qi Zhang (possible past Tencent (China) affiliation), Wanlong Wang, Jun Xue
Abstract

Real-world multimodal systems must be robust against low-quality data, such as sensor noise, incomplete multimodal data and conflicting inputs. However, existing trustworthy fusion methods rely on the model's own prediction confidence to judge data quality. This creates a circular dependency: when a model is confident but wrong, these methods fail to detect the error. To break this loop, we propose Geometry-based Multimodal Fusion (GMF). Instead of relying on predictions, we evaluate reliability...

πŸ“„ HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters
πŸ—“οΈ Published: 5/29/2026
πŸ”— http://arxiv.org/abs/2605.31000v1
πŸ‘₯ Authors: Yuejie Wang, Tao Chang, Yuanyuan Zhao (possible past Tsinghua University affiliation), Yulong Ao, Zeyu Gu, Zhiyu Li, Yanmin Jia, Yan Zhang, Mingjun Zhang, He Liu (possible past Google (United States) affiliation), Yongzhe He, Yonghua Lin, Guyue Liu
Abstract

Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur heavy overhead in the data path. This paper presen...

*Notable papers are those with at least two authors from a "big" AI/ML lab.