πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13228v1
πŸ‘₯ Authors: Yangsong Zhang (possible past Tencent (China) affiliation), Anujith Muraleedharan, Rikhat Akizhanov, Abdul Ahad Butt, GΓΌl Varol (possible past University Of Oxford affiliation), Pascal Fua, Fabio Pizzati, Ivan Laptev
Abstract

Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To a...

πŸ“„ SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13024v1
πŸ‘₯ Authors: Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He (possible past Nvidia (United States) affiliation), Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Pengfei Guo, Daguang Xu (possible past Nvidia (United States) affiliation), Mathias Unberath
Abstract

A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting t...

πŸ“„ daVinci-Env: Open SWE Environment Synthesis at Scale
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13023v1
πŸ‘₯ Authors: Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li (possible past Baidu (China) affiliation), Jiarui Hu, Liming Liu, Jinlong Hou (possible past Tencent (China) affiliation), Pengfei Liu
Abstract

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framewor...

πŸ“„ ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13019v1
πŸ‘₯ Authors: Bangjun Xiao, Yihao Zhao, Xiangwei Deng, Shihua Yu, Yuxing Xiang, Huaqiu Liu, Qiying Wang, Liang Zhao (possible past Baidu (China) affiliation), Hailin Zhang, Xuanzhe Liu, Xin Jin, Fuli Luo (possible past Peking University affiliation)
Abstract

Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied t...

πŸ“„ Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12933v1
πŸ‘₯ Authors: Xudong Wang, Chaoning Zhang, Jiaquan Zhang, Chenghao Li, Qigan Sun, Sung-Ho Bae, Peng Wang (possible past Peking University affiliation), Ning Xie, Jie Zou, Yang Yang (possible past Tencent (China) affiliation), Hengtao Shen
Abstract

Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality--cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited co...

πŸ“„ Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12893v1
πŸ‘₯ Authors: David Mcallister, Miika Aittala (possible past Massachusetts Institute Of Technology affiliation), Tero Karras (possible past Nvidia (United States) affiliation), Janne Hellsten, Angjoo Kanazawa (possible past University Of California, Berkeley affiliation), Timo Aila, Samuli Laine (possible past Nvidia (United States) affiliation)
Abstract

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step...

πŸ“„ Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12793v1
πŸ‘₯ Authors: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li (possible past Meta (United States) affiliation), Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang (possible past Nvidia (United States) affiliation), Yukun Yan, Xu Han (possible past Tsinghua University affiliation), Qiang Ma, Wei Ke, Liang Wang (possible past Tencent (China) affiliation), Zhiyuan Liu (possible past Tsinghua University affiliation), Maosong Sun (possible past Tsinghua University affiliation)
Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image g...

πŸ“„ MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12743v1
πŸ‘₯ Authors: Chenyang Zhu, Hongxiang Li, Xiu Li (possible past Tsinghua University affiliation), Long Chen (possible past Tencent (China) affiliation)
Abstract

Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge with...

πŸ“„ LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12645v1
πŸ‘₯ Authors: Jiawei Hao, Zhiwei Hao, Jianyuan Guo, Li Shen (possible past Tencent (China) affiliation), Yong Luo (possible past Tsinghua University affiliation), Han Hu, Dan Zeng
Abstract

Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their deployment is often constrained by substantial memory demands, primarily due to the need to load numerous expert modules. While existing expert compression techniques like pruning or merging attempt to mitigate this, they often suffer from irreversible knowledge loss or high training overhead. In this paper, we propose a novel expert compression paradi...

πŸ“„ Towards unified brain-to-text decoding across speech production and perception
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12628v1
πŸ‘₯ Authors: Zhizhang Yuan, Yang Yang (possible past Tencent (China) affiliation), Gaorui Zhang, Baowen Cheng, Zehan Wu, Yuhao Xu, Xiaoying Liu, Liang Chen (possible past Google (United States) affiliation), Ying Mao, Meng Li (possible past Meta (United States) affiliation)
Abstract

Speech production and perception are the main ways humans communicate daily. Prior brain-to-text decoding studies have largely focused on a single modality and alphabetic languages. Here, we present a unified brain-to-sentence decoding framework for both speech production and perception in Mandarin Chinese. The framework exhibits strong generalization ability, enabling sentence-level decoding when trained only on single-character data and supporting characters and syllables unseen during trainin...

πŸ“„ Test-Time Strategies for More Efficient and Accurate Agentic RAG
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12396v1
πŸ‘₯ Authors: Brian Zhang (possible past Deepmind (United Kingdom) affiliation), Deepti Guntur, Zhiyang Zuo, Abhinav Sharma (possible past Stanford University affiliation), Shreyas Chaudhari, Wenlong Zhao, Franck Dernoncourt, Puneet Mathur, Ryan Rossi, Nedim Lipka
Abstract

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrie...

πŸ“„ Efficient Reasoning with Balanced Thinking
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12372v1
πŸ‘₯ Authors: Yulin Li (possible past Baidu (China) affiliation), Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li (possible past Tsinghua University affiliation), Zhuotao Tian
Abstract

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reas...

πŸ“„ Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12246v1
πŸ‘₯ Authors: Yixin Liu, Yue Yu, Dijia Su, Sid Wang, Xuewei Wang, Song Jiang (possible past Peking University affiliation), Bo Liu (possible past Meta (United States) affiliation), Arman Cohan, Yuandong Tian (possible past Openai (United States) affiliation), Zhengxing Chen
Abstract

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoni...

πŸ“„ Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12180v1
πŸ‘₯ Authors: Łukasz Borchmann, Jordy Van Landeghem, MichaΕ‚ Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, ClΓ©mentine Fourrier, Siwei Han, Huaxiu Yao, Artemis LlabrΓ©s, Yiming Xu, Dimosthenis Karatzas, Hao Zhang (possible past Tencent (China) affiliation), Anupam Datta (possible past Carnegie Mellon University affiliation)
Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic ...

πŸ“„ IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12151v1
πŸ‘₯ Authors: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu (possible past Tencent (China) affiliation), Eric Xing, Virginia Smith (possible past Carnegie Mellon University affiliation), Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Zhiting Hu, Taylor Killian, Aviral Kumar (possible past University Of California, Berkeley affiliation)
Abstract

While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of para...

πŸ“„ Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12038v1
πŸ‘₯ Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang (possible past Stanford University affiliation), Kim-Chuan Toh, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense...

πŸ“„ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11975v2
πŸ‘₯ Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang (possible past Tsinghua University affiliation), Jun Xu (possible past Google (United States) affiliation)
Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these spe...

πŸ“„ AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.11873v1
πŸ‘₯ Authors: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang (possible past Baidu (China) affiliation), Linghe Kong, Guihai Chen (possible past Shanghai Jiao Tong University affiliation), Dawei Yin (possible past Baidu (China) affiliation)
Abstract

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation ...

πŸ“„ Representation Learning for Spatiotemporal Physical Systems
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13227v1
πŸ‘₯ Authors: Helen Qu, Rudy Morel, Michael Mccabe, Alberto Bietti, FranΓ§ois Lanusse, Shirley Ho (possible past Carnegie Mellon University affiliation), Yann Lecun (possible past Meta (United States) affiliation)
Abstract

Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system's evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation...

πŸ“„ 3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.13049v1
πŸ‘₯ Authors: Jun Liu (possible past Tencent (China) affiliation), Xiaohui Zhong, Kai Zheng, Jiarui Li, Yifei Li, Tao Zhou, Wenxu Qian, Shun Dai, Ruian Tie, Yangyang Zhao, Hao Li (possible past Tsinghua University affiliation)
Abstract

Tropical cyclone (TC) intensity forecasting remains challenging as current numerical and AI-based weather models fail to satisfactorily represent extreme TC structure and intensity. Although intensity time-series forecasting has achieved significant advances, it outputs intensity sequences rather than the three-dimensional inner-core fine-scale structure and physical mechanisms governing TC evolution. High-resolution numerical simulations can capture these features but remain computationally exp...

πŸ“„ SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking
πŸ—“οΈ Published: 3/13/2026
πŸ”— http://arxiv.org/abs/2603.12749v1
πŸ‘₯ Authors: Zheng Gao, Yifan Yang (possible past Tencent (China) affiliation), Xiaoyu Li (possible past Tencent (China) affiliation), Xiaoyan Feng, Haoran Fan, Yang Song (possible past Stanford University affiliation), Jiaojiao Jiang
Abstract

Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy s...

πŸ“„ Learning Pore-scale Multiphase Flow from 4D Velocimetry
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12516v1
πŸ‘₯ Authors: Chunyang Wang, Linqi Zhu, Yuxuan Gu, Robert Van Der Merwe, Xin Ju, Catherine Spurin, Samuel Krevor, Rex Ying (possible past Stanford University affiliation), Tobias Pfaff (possible past Google (United States) affiliation), Martin J. Blunt, Tom Bultreys, Gege Wen
Abstract

Multiphase flow in porous media underpins subsurface energy and environmental technologies, including geological CO$_2$ storage and underground hydrogen storage, yet pore-scale dynamics in realistic three-dimensional materials remain difficult to characterize and predict. Here we introduce a multimodal learning framework that infers multiphase pore-scale flow directly from time-resolved four-dimensional (4D) micro-velocimetry measurements. The model couples a graph network simulator for Lagrangi...

πŸ“„ Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12255v1
πŸ‘₯ Authors: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li (possible past Tsinghua University affiliation), Han Hu, Yongming Rao, Yueqi Duan (possible past Stanford University affiliation)
Abstract

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), wh...

πŸ“„ Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12248v1
πŸ‘₯ Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li (possible past University Of California, Berkeley affiliation), Nicolo Fusi, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Sham M. Kakade, Carles Domingo-Enrich
Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose...

πŸ“„ Temporal Straightening for Latent Planning
πŸ—“οΈ Published: 3/12/2026
πŸ”— http://arxiv.org/abs/2603.12231v1
πŸ‘₯ Authors: Ying Wang (possible past Tsinghua University affiliation), Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann Lecun (possible past Meta (United States) affiliation), Mengye Ren (possible past University Of Toronto affiliation)
Abstract

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant -- or even detrimental -- to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straighten...

*Notable papers are those with at least two authors from a "big" AI/ML lab.