📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 MoEless: Efficient MoE LLM Serving via Serverless Computing
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06350v1
👥 Authors: Hanfei Yu, Bei Ouyang, Shwai He, Ang Li (possible past Google (United States) affiliation), Hao Wang (possible past Tsinghua University affiliation)
Abstract

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a f...

📄 Offline Materials Optimization with CliqueFlowmer
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06082v1
👥 Authors: Jakub Grudzien Kuba, Benjamin Kurt Miller, Sergey Levine (possible past University Of Washington affiliation), Pieter Abbeel (possible past University Of California, Berkeley affiliation)
Abstract

Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model-based optimization (MBO)...

📄 MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06007v1
👥 Authors: Yang Liu (possible past Tsinghua University affiliation), Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li (possible past Google (United States) affiliation), Chen Qian (possible past Shanghai Jiao Tong University affiliation), Chuan Shi, Cheng Yang (possible past Tsinghua University affiliation)
Abstract

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heter...

📄 Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05963v1
👥 Authors: Siyuan Yang, Jun Liu (possible past Tencent (China) affiliation), Hao Cheng (possible past Tencent (China) affiliation), Chong Wang (possible past Google (United States) affiliation), Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot
Abstract

Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches prese...

📄 CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05911v1
👥 Authors: Yuxin Xie, Yuming Chen (possible past University Of Washington affiliation), Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretabi...

📄 The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05910v1
👥 Authors: Guangrui Li, Yaochen Xie, Yi Liu (possible past Google (United States) affiliation), Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais (possible past Amazon (United States) affiliation), Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song (possible past University Of California, Berkeley affiliation)
Abstract

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-worl...

📄 Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05900v1
👥 Authors: Xuan Li (possible past Baidu (China) affiliation), Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong (possible past Tencent (China) affiliation), Lu Zhang (possible past Tencent (China) affiliation), Bo Han
Abstract

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under simila...

📄 Computational Pathology in the Era of Emerging Foundation and Agentic AI -- International Expert Perspectives on Clinical Integration and Translational Readiness
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05884v1
👥 Authors: Qian Da, Yijiang Chen, Min Ju, Zheyi Ji, Albert Zhou, Wenwen Wang, Matthew A Abikenari, Philip Chikontwe, Guillaume Larghero, Bowen Chen, Peter Neiglinger, Dingrong Zhong, Shuhao Wang, Wei Xu (possible past Tencent (China) affiliation), Drew Williamson, German Corredor, Sen Yang (possible past Tencent (China) affiliation), Le Lu, Xiao Han (possible past Tencent (China) affiliation), Kun-Hsing Yu, Jun-Zhou Huang, Laura Barisoni, Geert Litjens, Anant Madabhushi, Lifeng Zhu, Chaofu Wang, Junhan Zhao, Weiguo Hu
Abstract

Recent breakthroughs in artificial intelligence through foundation models and agents have accelerated the evolution of computational pathology. Demonstrated performance gains reported across academia in benchmarking datasets in predictive tasks such as diagnosis, prognosis, and treatment response have ignited substantial enthusiasm for clinical application. Despite this development momentum, real world adoption has lagged, as implementation faces economic, technical, and administrative challenge...

📄 Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05723v1
👥 Authors: Erin Van Liemt, Renee Shelby, Andrew Smart (possible past Google (United States) affiliation), Sinchana Kumbale, Richard Zhang, Neha Dixit, Qazi Mamunur Rashid, Jamila Smith-Loud (possible past Google (United States) affiliation)
Abstract

There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of ...

📄 Reasoning Models Struggle to Control their Chains of Thought
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05706v1
👥 Authors: Chen Yueh-Han, Robert Mccarthy, Bruce W. Lee, He He (possible past Stanford University affiliation), Ian Kivlichan (possible past Google (United States) affiliation), Bowen Baker (possible past Openai (United States) affiliation), Micah Carroll, Tomek Korbak
Abstract

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question with...

📄 RealWonder: Real-Time Physical Action-Conditioned Video Generation
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05449v1
👥 Authors: Wei Liu (possible past Tsinghua University affiliation), Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (o...

📄 Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05308v1
👥 Authors: Qiao Jin, Yin Fang, Lauren He, Yifan Yang (possible past Tencent (China) affiliation), Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng (possible past Stanford University affiliation), Zhiyong Lu
Abstract

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data ne...

📄 AI+HW 2035: Shaping the Next Decade
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05225v1
👥 Authors: Deming Chen, Jason Cong, Azalia Mirhoseini (possible past Google (United States) affiliation), Christos Kozyrakis (possible past Stanford University affiliation), Subhasish Mitra (possible past Stanford University affiliation), Jinjun Xiong, Cliff Young (possible past Google (United States) affiliation), Anima Anandkumar (possible past Nvidia (United States) affiliation), Michael Littman, Aron Kirschen, Sophia Shao, Serge Leef, Naresh Shanbhag, Dejan Milojicic, Michael Schulte, Gert Cauwenberghs, Jerry M. Chow, Tri Dao, Kailash Gopalakrishnan, Richard Ho, Hoshik Kim, Kunle Olukotun, David Z. Pan, Mark Ren, Dan Roth, Aarti Singh (possible past Carnegie Mellon University affiliation), Yizhou Sun, Yusu Wang, Yann Lecun (possible past Meta (United States) affiliation), Ruchir Puri
Abstract

Artificial intelligence (AI) and hardware (HW) are advancing at unprecedented rates, yet their trajectories have become inseparably intertwined. The global research community lacks a cohesive, long-term vision to strategically coordinate the development of AI and HW. This fragmentation constrains progress toward holistic, sustainable, and adaptive AI systems capable of learning, reasoning, and operating efficiently across cloud, edge, and physical environments. The future of AI depends not only ...

📄 KARL: Knowledge Agents via Reinforcement Learning
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05218v1
👥 Authors: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu (possible past Google (United States) affiliation), Peyton Deniro, Wen Sun, Michael Bendersky (possible past Google (United States) affiliation), Jonathan Frankle (possible past Massachusetts Institute Of Technology affiliation)
Abstract

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over tech...

📄 UniPAR: A Unified Framework for Pedestrian Attribute Recognition
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05114v1
👥 Authors: Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan, Xiao Wang (possible past Google (United States) affiliation), Chiawei Chu, Yu Li (possible past Tencent (China) affiliation)
Abstract

Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a...

📄 Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05092v1
👥 Authors: Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang (possible past Google (United States) affiliation), Huan Zhang, Yuhui Liu, Zhongyi Pei, Jianmin Wang (possible past Tsinghua University affiliation)
Abstract

Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviati...

📄 Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06397v1
👥 Authors: Pengcheng Jiang, Judith Yue Li, Moonkyung Ryu, R. Lily Hu, Kun Su, Zhong Yi Wan, Liam Hebert, Hao Peng (possible past Tsinghua University affiliation), Jiawei Han (possible past Google (United States) affiliation), Dima Kuzmin, Craig Boutilier (possible past Google (United States) affiliation)
Abstract

Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate ...

📄 Dynamic Momentum Recalibration in Online Gradient Learning
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06120v1
👥 Authors: Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li (possible past Meta (United States) affiliation), Yu Zhang (possible past Google (United States) affiliation), Dazhou Li
Abstract

Stochastic Gradient Descent (SGD) and its momentum variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer ...

📄 Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.06009v1
👥 Authors: Michael Beukman, Khimya Khetarpal (possible past Deepmind (United Kingdom) affiliation), Zeyu Zheng, Will Dabney (possible past Google (United States) affiliation), Jakob Foerster (possible past University Of Oxford affiliation), Michael Dennis, Clare Lyle
Abstract

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments on...

📄 Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls
🗓️ Published: 3/6/2026
🔗 http://arxiv.org/abs/2603.05829v1
👥 Authors: Shubhangi Upasani, Chen Wu (possible past Google (United States) affiliation), Jay Rainton, Bo Li (possible past Tencent (China) affiliation), Changran Hu, Qizheng Zhang, Urmish Thakker
Abstract

Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-...

📄 Making Reconstruction FID Predictive of Diffusion Generation FID
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05630v1
👥 Authors: Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang (possible past Stanford University affiliation), Kai Zhao, Chao Zhou (possible past Tencent (China) affiliation), Ya-Qin Zhang, Yan Wang (possible past Tencent (China) affiliation)
Abstract

It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the origi...

📄 RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05026v1
👥 Authors: Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, Xin Zhang (possible past Google (United States) affiliation), Zijian Jin, Bowen Li, Chaoyun Zhang (possible past University Of Edinburgh affiliation), Yu Kang, Yufan Huang, Elsie Nallipogu, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Abstract

Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset c...

📄 Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04971v1
👥 Authors: Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang (possible past Baidu (China) affiliation), Yu Sun (possible past Baidu (China) affiliation), Hua Wu (possible past Baidu (China) affiliation), Haifeng Wang (possible past Google (United States) affiliation)
Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a...

📄 BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04918v1
👥 Authors: Yuan Li (possible past Google (United States) affiliation), Bo Wang (possible past Tencent (China) affiliation), Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.