πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12246v1
πŸ‘₯ Authors: Yixin Liu, Yue Yu, Dijia Su, Sid Wang, Xuewei Wang, Song Jiang (possible past Peking University affiliation), Bo Liu (possible past Meta (United States) affiliation), Arman Cohan, Yuandong Tian (possible past Openai (United States) affiliation), Zhengxing Chen
Abstract

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoni...

πŸ“„ Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12180v1
πŸ‘₯ Authors: Łukasz Borchmann, Jordy Van Landeghem, MichaΕ‚ Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, ClΓ©mentine Fourrier, Siwei Han, Huaxiu Yao, Artemis LlabrΓ©s, Yiming Xu, Dimosthenis Karatzas, Hao Zhang (possible past Tencent (China) affiliation), Anupam Datta (possible past Carnegie Mellon University affiliation)
Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic ...

πŸ“„ IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12151v1
πŸ‘₯ Authors: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu (possible past Tencent (China) affiliation), Eric Xing, Virginia Smith (possible past Carnegie Mellon University affiliation), Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Zhiting Hu, Taylor Killian, Aviral Kumar (possible past University Of California, Berkeley affiliation)
Abstract

While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of para...

πŸ“„ Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12038v1
πŸ‘₯ Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang (possible past Stanford University affiliation), Kim-Chuan Toh, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense...

πŸ“„ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11975v1
πŸ‘₯ Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang (possible past Tsinghua University affiliation), Jun Xu (possible past Google (United States) affiliation)
Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these spe...

πŸ“„ AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11873v1
πŸ‘₯ Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang (possible past Baidu (China) affiliation), Linghe Kong, Guihai Chen (possible past Shanghai Jiao Tong University affiliation), Dawei Yin (possible past Baidu (China) affiliation)
Abstract

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation ...

πŸ“„ MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11625v1
πŸ‘₯ Authors: Shengyuan Liu, Zanting Ye, Yunrui Lin, Chen Hu, Wanting Geng, Xu Han (possible past Tsinghua University affiliation), Bulat Ibragimov, Yefeng Zheng (possible past Tencent (China) affiliation), Yixuan Yuan
Abstract

While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To add...

πŸ“„ The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.11266v1
πŸ‘₯ Authors: Raj Sanjay Shah, Jing Huang (possible past Meta (United States) affiliation), Keerthiram Murugesan, Nathalie Baracaldo, Diyi Yang (possible past Stanford University affiliation)
Abstract

Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We...

πŸ“„ Mind the Sim2Real Gap in User Simulation for Agentic Tasks
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.11245v1
πŸ‘₯ Authors: Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang (possible past Microsoft (United States) affiliation), Graham Neubig (possible past Carnegie Mellon University affiliation), Sherry Tongshuang Wu, Maarten Sap
Abstract

As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $Ο„$-bench protocol with real humans (451 participants, 165 tas...

πŸ“„ COMIC: Agentic Sketch Comedy Generation
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.11048v1
πŸ‘₯ Authors: Susung Hong, Brian Curless (possible past University Of Washington affiliation), Ira Kemelmacher-Shlizerman (possible past University Of Washington affiliation), Steve Seitz (possible past Google (United States) affiliation)
Abstract

We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a c...

πŸ“„ Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.10887v1
πŸ‘₯ Authors: Yixiu Mao, Yun Qu, Qi Wang (possible past Tsinghua University affiliation), Heming Zou, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL fin...

πŸ“„ Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.10846v1
πŸ‘₯ Authors: Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan (possible past Shanghai Jiao Tong University affiliation), Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Ying Wen, Bo Tang, Muning Wen
Abstract

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework tha...

πŸ“„ Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12255v1
πŸ‘₯ Authors: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li (possible past Tsinghua University affiliation), Han Hu, Yongming Rao, Yueqi Duan (possible past Stanford University affiliation)
Abstract

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), wh...

πŸ“„ Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12248v1
πŸ‘₯ Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li (possible past University Of California, Berkeley affiliation), Nicolo Fusi, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Sham M. Kakade, Carles Domingo-Enrich
Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose...

πŸ“„ Temporal Straightening for Latent Planning
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12231v1
πŸ‘₯ Authors: Ying Wang (possible past Tsinghua University affiliation), Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann Lecun (possible past Meta (United States) affiliation), Mengye Ren (possible past University Of Toronto affiliation)
Abstract

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant -- or even detrimental -- to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straighten...

πŸ“„ Sharpness-Aware Minimization for Generalized Embedding Learning in Federated Recommendation
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11503v1
πŸ‘₯ Authors: Fengyuan Yu, Xiaohua Feng, Yuyuan Li, Changwang Zhang (possible past Tencent (China) affiliation), Jun Wang (possible past Tencent (China) affiliation), Chaochao Chen
Abstract

Federated recommender systems enable collaborative model training while keeping user interaction data local and sharing only essential model parameters, thereby mitigating privacy risks. However, existing methods overlook a critical issue, i.e., the stable learning of a generalized item embedding throughout the federated recommender system training process. Item embedding plays a central role in facilitating knowledge sharing across clients. Yet, under the cross-device setting, local data distri...

πŸ“„ Meta-Reinforcement Learning with Self-Reflection for Agentic Search
πŸ—“οΈ Published: 3/11/2026
πŸ”— http://arxiv.org/abs/2603.11327v1
πŸ‘₯ Authors: Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert (possible past University Of California, Berkeley affiliation), Pradeep Dasigi, Noah A. Smith (possible past University Of Washington affiliation), Hannaneh Hajishirzi (possible past University Of Washington affiliation)
Abstract

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cr...

*Notable papers are those with at least two authors from a "big" AI/ML lab.