📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 RealWonder: Real-Time Physical Action-Conditioned Video Generation
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05449v1
👥 Authors: Wei Liu (possible past Tsinghua University affiliation), Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (o...

📄 Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05308v1
👥 Authors: Qiao Jin, Yin Fang, Lauren He, Yifan Yang (possible past Tencent (China) affiliation), Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng (possible past Stanford University affiliation), Zhiyong Lu
Abstract

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data ne...

📄 AI+HW 2035: Shaping the Next Decade
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05225v1
👥 Authors: Deming Chen, Jason Cong, Azalia Mirhoseini (possible past Google (United States) affiliation), Christos Kozyrakis (possible past Stanford University affiliation), Subhasish Mitra (possible past Stanford University affiliation), Jinjun Xiong, Cliff Young (possible past Google (United States) affiliation), Anima Anandkumar (possible past Nvidia (United States) affiliation), Michael Littman, Aron Kirschen, Sophia Shao, Serge Leef, Naresh Shanbhag, Dejan Milojicic, Michael Schulte, Gert Cauwenberghs, Jerry M. Chow, Tri Dao, Kailash Gopalakrishnan, Richard Ho, Hoshik Kim, Kunle Olukotun, David Z. Pan, Mark Ren, Dan Roth, Aarti Singh (possible past Carnegie Mellon University affiliation), Yizhou Sun, Yusu Wang, Yann Lecun (possible past Meta (United States) affiliation), Ruchir Puri
Abstract

Artificial intelligence (AI) and hardware (HW) are advancing at unprecedented rates, yet their trajectories have become inseparably intertwined. The global research community lacks a cohesive, long-term vision to strategically coordinate the development of AI and HW. This fragmentation constrains progress toward holistic, sustainable, and adaptive AI systems capable of learning, reasoning, and operating efficiently across cloud, edge, and physical environments. The future of AI depends not only ...

📄 KARL: Knowledge Agents via Reinforcement Learning
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05218v1
👥 Authors: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu (possible past Google (United States) affiliation), Peyton Deniro, Wen Sun, Michael Bendersky (possible past Google (United States) affiliation), Jonathan Frankle (possible past Massachusetts Institute Of Technology affiliation)
Abstract

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over tech...

📄 UniPAR: A Unified Framework for Pedestrian Attribute Recognition
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05114v1
👥 Authors: Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan, Xiao Wang (possible past Google (United States) affiliation), Chiawei Chu, Yu Li (possible past Tencent (China) affiliation)
Abstract

Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a...

📄 Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05092v1
👥 Authors: Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang (possible past Google (United States) affiliation), Huan Zhang, Yuhui Liu, Zhongyi Pei, Jianmin Wang (possible past Tsinghua University affiliation)
Abstract

Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviati...

📄 Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04971v1
👥 Authors: Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang (possible past Baidu (China) affiliation), Yu Sun (possible past Baidu (China) affiliation), Hua Wu (possible past Baidu (China) affiliation), Haifeng Wang (possible past Google (United States) affiliation)
Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a...

📄 MPCEval: A Benchmark for Multi-Party Conversation Generation
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04969v1
👥 Authors: Minxing Zhang, Yi Yang (possible past Baidu (China) affiliation), Zhuofan Jia, Xuan Yang (possible past Stanford University affiliation), Jian Pei, Yuchen Zang, Xingwang Deng, Xianglong Chen
Abstract

Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking su...

📄 BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04918v1
👥 Authors: Yuan Li (possible past Google (United States) affiliation), Bo Wang (possible past Tencent (China) affiliation), Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Abstract

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). ...

📄 EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04900v1
👥 Authors: Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li (possible past Tencent (China) affiliation), Mohammad Reza Ghasemi Madani, Eduard Hovy (possible past Carnegie Mellon University affiliation)
Abstract

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-u...

📄 Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04896v1
👥 Authors: Lianyu Wang, Meng Wang (possible past Google (United States) affiliation), Huazhu Fu (possible past Inception Institute Of Artificial Intelligence affiliation), Daoqiang Zhang
Abstract

The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, ...

📄 On Multi-Step Theorem Prediction via Non-Parametric Structural Priors
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04852v1
👥 Authors: Junbo Zhao, Ting Zhang (possible past Meta (United States) affiliation), Can Li, Wei He (possible past Baidu (China) affiliation), Jingdong Wang (possible past Baidu (China) affiliation), Hua Huang
Abstract

Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing ...

📄 Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04791v1
👥 Authors: Yong Liu, Xingjian Su, Shiyu Wang, Haoran Zhang, Haixuan Liu, Yuxuan Wang (possible past Google (United States) affiliation), Zhou Ye, Yang Xiang, Jianmin Wang (possible past Tsinghua University affiliation), Mingsheng Long (possible past Tsinghua University affiliation)
Abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic train...

📄 TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04772v1
👥 Authors: Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu (possible past Tsinghua University affiliation), Changwang Zhang (possible past Tencent (China) affiliation), Jun Wang (possible past Tencent (China) affiliation), Li Li (possible past Google (United States) affiliation)
Abstract

Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing...

📄 MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04771v1
👥 Authors: Linda Wei, Chang Liu, Wenran Zhang, Yuxuan Hu, Ruiyang Li, Feng Qi, Changyao Tian, Ke Wang (possible past Google (United States) affiliation), Yuanyuan Wang, Shaoting Zhang (possible past Baidu (China) affiliation), Dimitris Metaxas, Hongsheng Li
Abstract

Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by ina...

📄 Stacked from One: Multi-Scale Self-Injection for Context Window Extension
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04759v1
👥 Authors: Wei Han (possible past Google (United States) affiliation), Pan Zhou, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-con...

📄 Evaluating the Search Agent in a Parallel World
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.04751v1
👥 Authors: Jiawei Chen (possible past Tencent (China) affiliation), Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma (possible past Meta (United States) affiliation), Tao Wei (possible past Baidu (China) affiliation), Pan Zhou, Kun Zhan
Abstract

Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degr...

📄 ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
🗓️ Published: 3/4/2026
🔗 http://arxiv.org/abs/2603.04385v1
👥 Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao (possible past Google (United States) affiliation), Jonathan T. Barron (possible past Google (United States) affiliation), Noah Snavely (possible past Google (United States) affiliation), Aleksander Holynski (possible past University Of Washington affiliation)
Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpa...

📄 CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
🗓️ Published: 3/4/2026
🔗 http://arxiv.org/abs/2603.04291v1
👥 Authors: Lingen Li, Guangzhi Wang, Xiaoyu Li (possible past Tencent (China) affiliation), Zhaoyang Zhang, Qi Dou, Jinwei Gu (possible past Shanghai Artificial Intelligence Laboratory affiliation), Tianfan Xue (possible past Massachusetts Institute Of Technology affiliation), Ying Shan (possible past Tencent (China) affiliation)
Abstract

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diff...

📄 RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
🗓️ Published: 3/5/2026
🔗 http://arxiv.org/abs/2603.05026v1
👥 Authors: Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, Xin Zhang (possible past Google (United States) affiliation), Zijian Jin, Bowen Li, Chaoyun Zhang (possible past University Of Edinburgh affiliation), Yu Kang, Yufan Huang, Elsie Nallipogu, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Abstract

Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset c...

📄 Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
🗓️ Published: 3/4/2026
🔗 http://arxiv.org/abs/2603.04553v1
👥 Authors: Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar (possible past University Of California, Berkeley affiliation), Deepak Pathak (possible past University Of California, Berkeley affiliation), David Held (possible past University Of California, Berkeley affiliation)
Abstract

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particl...

*Notable papers are those with at least two authors from a "big" AI/ML lab.