📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training
🗓️ Published: 6/17/2026
🔗 http://arxiv.org/abs/2606.19004v1
👥 Authors: Ruiqi Lai, Dakai An, Wei Gao (possible past Peking University affiliation), Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot...

📄 G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment
🗓️ Published: 6/17/2026
🔗 http://arxiv.org/abs/2606.18989v1
👥 Authors: Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao (possible past Tencent (China) affiliation), Derek F. Wong (possible past Tencent (China) affiliation)
Abstract

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and ...

📄 ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch
🗓️ Published: 6/17/2026
🔗 http://arxiv.org/abs/2606.18803v1
👥 Authors: Tengfei Lyu, Zirui Yuan, Xu Liu (possible past Massachusetts Institute Of Technology affiliation), Kai Wan, Zihao Lu, Li Ma, Hao Liu (possible past Tencent (China) affiliation)
Abstract

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, mill...

📄 Guava: An Effective and Universal Harness for Embodied Manipulation
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18363v1
👥 Authors: Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou (possible past University Of Washington affiliation), Jia-Bin Huang, Furong Huang, Jiayuan Mao (possible past Tsinghua University affiliation)
Abstract

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide rang...

📄 SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18356v1
👥 Authors: Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan (possible past Carnegie Mellon University affiliation), Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang (possible past Tsinghua University affiliation)
Abstract

Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security ...

📄 Self-CTRL: Self-Consistency Training with Reinforcement Learning
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18327v1
👥 Authors: Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li (possible past University Of Washington affiliation), Jacob Andreas (possible past University Of California, Berkeley affiliation)
Abstract

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic re...

📄 ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18237v1
👥 Authors: Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang (possible past Microsoft (United States) affiliation), Ameet Talwalkar (possible past University Of California, Berkeley affiliation)
Abstract

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction bloc...

📄 EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18235v1
👥 Authors: Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma (possible past University Of Oxford affiliation), Guosheng Lin, Hao Wang (possible past Tsinghua University affiliation)
Abstract

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from...

📄 Looped World Models
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18208v1
👥 Authors: Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang (possible past Tencent (China) affiliation), Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen (possible past Baidu (China) affiliation), Hebin Wang, Zeyu Gao, Yang Li (possible past Google (United States) affiliation), Wei Zhao (possible past Tencent (China) affiliation), Qimin Zhong, Siqi Liu (possible past University Of Oxford affiliation), Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam
Abstract

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive compu...

📄 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18203v1
👥 Authors: Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu (possible past Tsinghua University affiliation), Daniel Mcduff (possible past Google (United States) affiliation), Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally
Abstract

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an...

📄 Knowledge Reutilization in Meta-Reinforcement Learning
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18132v1
👥 Authors: Yuan Meng, Bo Wang (possible past Tencent (China) affiliation), Juan De Los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun (possible past Tsinghua University affiliation), Alois Knoll
Abstract

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non...

📄 Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18062v1
👥 Authors: Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer (possible past Carnegie Mellon University affiliation), Nicolas Christin (possible past Carnegie Mellon University affiliation)
Abstract

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LL...

📄 LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18023v1
👥 Authors: Jian Yang, Shawn Guo, Wei Zhang (possible past Tsinghua University affiliation), Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao (possible past Baidu (China) affiliation), Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai
Abstract

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional...

📄 Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization
🗓️ Published: 6/17/2026
🔗 http://arxiv.org/abs/2606.18961v1
👥 Authors: Lanqing Li (possible past Tencent (China) affiliation), Shentong Mo (possible past Baidu (China) affiliation), Yang Yu, Pheng-Ann Heng
Abstract

Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic...

📄 PACT: Preserving Anchored Cores in Task-vectors for Model Merging
🗓️ Published: 6/17/2026
🔗 http://arxiv.org/abs/2606.18627v1
👥 Authors: Ningyuan Shi, Zhipeng Zhou, Hao Wang (possible past Tsinghua University affiliation), Chunyan Miao, Peilin Zhao (possible past Tencent (China) affiliation)
Abstract

Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task Arithmetic paradigm, which decomposes fine-tuned weights into pre-trained parameters and task vectors, and performs merging exclusively in the task-vector space. The effectiveness of this paradigm implicitly relies on the assumption that task-specific knowledge is encoded ...

📄 Compact Geometric Representations of Hierarchies
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18520v1
👥 Authors: Prashant Gokhale, Piotr Indyk (possible past Massachusetts Institute Of Technology affiliation), Yuhao Liu (possible past Baidu (China) affiliation), Sandeep Silwal, Tony Chang Wang, Haike Xu
Abstract

Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, thes...

📄 OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18105v2
👥 Authors: Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu (possible past Peking University affiliation), Xuan Liu (possible past Baidu (China) affiliation), Dong Zhang (possible past Nvidia (United States) affiliation), Chunming Wu, Xiang Chen (possible past Tencent (China) affiliation)
Abstract

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dyna...

📄 From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
🗓️ Published: 6/16/2026
🔗 http://arxiv.org/abs/2606.18089v1
👥 Authors: Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Kun Zhang (possible past Google (United States) affiliation), Eric P. Xing, Zhengzhong Liu (possible past Tencent (China) affiliation)
Abstract

Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including bo...

*Notable papers are those with at least two authors from a "big" AI/ML lab.