πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29844v1
πŸ‘₯ Authors: Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge (possible past Tencent (China) affiliation), Xihui Liu (possible past University Of California, Berkeley affiliation)
Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduc...

πŸ“„ Reinforced Reasoning for End-to-End Retrosynthetic Planning
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29723v1
πŸ‘₯ Authors: Chenyang Zuo, Siqi Fan, Yizhen Luo (possible past Tsinghua University affiliation), Zaiqing Nie (possible past Tsinghua University affiliation)
Abstract

Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To bridge this gap and embed sophisticated strategic foresight directly into the model's chemical reasoning,...

πŸ“„ FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29557v1
πŸ‘₯ Authors: Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad-Rokny, Hui Li (possible past Baidu (China) affiliation), Yuan Lin, Min Yang (possible past Baidu (China) affiliation)
Abstract

Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFl...

πŸ“„ Hallucination-aware intermediate representation edit in large vision-language models
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29405v1
πŸ‘₯ Authors: Wei Suo, Hanzu Zhang, Lijun Zhang, Ji Ma (possible past Google (United States) affiliation), Peng Wang (possible past Peking University affiliation), Yanning Zhang
Abstract

Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These f...

πŸ“„ PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29318v1
πŸ‘₯ Authors: Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang (possible past Baidu (China) affiliation), Yang Liu (possible past Tsinghua University affiliation), Quanming Yao, Zhen Wang
Abstract

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metri...

πŸ“„ IMPASTO: Integrating Model-Based Planning with Learned Dynamics Models for Robotic Oil Painting Reproduction
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29315v1
πŸ‘₯ Authors: Yingke Wang, Hao Li (possible past Tsinghua University affiliation), Yifeng Zhu, Hong-Xing Yu, Ken Goldberg (possible past University Of California, Berkeley affiliation), Li Fei-Fei (possible past Stanford University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Yunzhu Li, Ruohan Zhang
Abstract

Robotic reproduction of oil paintings using soft brushes and pigments requires force-sensitive control of deformable tools, prediction of brushstroke effects, and multi-step stroke planning, often without human step-by-step demonstrations or faithful simulators. Given only a sequence of target oil painting images, can a robot infer and execute the stroke trajectories, forces, and colors needed to reproduce it? We present IMPASTO, a robotic oil-painting system that integrates learned pixel dynami...

πŸ“„ Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29252v1
πŸ‘₯ Authors: Tao Chen, Kun Zhang (possible past Google (United States) affiliation), Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji (possible past Tencent (China) affiliation)
Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question....

πŸ“„ Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29211v1
πŸ‘₯ Authors: Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li (possible past Tencent (China) affiliation), Jun Gao (possible past Nvidia (United States) affiliation)
Abstract

In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content...

πŸ“„ Route-Induced Density and Stability (RIDE): Controlled Intervention and Mechanism Analysis of Routing-Style Meta Prompts on LLM Internal States
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29206v1
πŸ‘₯ Authors: Dianxing Zhang, Gang Li (possible past Tsinghua University affiliation), Sheng Li (possible past Google (United States) affiliation)
Abstract

Routing is widely used to scale large language models, from Mixture-of-Experts gating to multi-model/tool selection. A common belief is that routing to a task ``expert'' activates sparser internal computation and thus yields more certain and stable outputs (the Sparsity--Certainty Hypothesis). We test this belief by injecting routing-style meta prompts as a textual proxy for routing signals in front of frozen instruction-tuned LLMs. We quantify (C1) internal density via activation sparsity, (C2)...

πŸ“„ Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29148v1
πŸ‘₯ Authors: Guan Wang, Shuyin Xia, Lei Qian, Guoyin Wang, Yi Liu (possible past Google (United States) affiliation), Yi Wang, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Graph Convolutional Network (GCN) is a model that can effectively handle graph data tasks and has been successfully applied. However, for large-scale graph datasets, GCN still faces the challenge of high computational overhead, especially when the number of convolutional layers in the graph is large. Currently, there are many advanced methods that use various sampling techniques or graph coarsening techniques to alleviate the inconvenience caused during training. However, among these methods, so...

πŸ“„ Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.29010v1
πŸ‘₯ Authors: Siva Kumar Sastry Hari (possible past Nvidia (United States) affiliation), Vignesh Balaji, Sana Damani, Qijing Huang, Christos Kozyrakis (possible past Stanford University affiliation)
Abstract

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of dim...

πŸ“„ AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28696v1
πŸ‘₯ Authors: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang (possible past Tencent (China) affiliation), Alexander Mathis, Marc Pollefeys (possible past Google (United States) affiliation)
Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control ...

πŸ“„ MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28590v1
πŸ‘₯ Authors: Han Wang (possible past Peking University affiliation), Yifan Sun (possible past Baidu (China) affiliation), Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen (possible past Tsinghua University affiliation), Vedant Jolly, Huan Zhang
Abstract

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating...

πŸ“„ Towards a Medical AI Scientist
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28589v1
πŸ‘₯ Authors: Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao (possible past Microsoft (United States) affiliation), Lei Xing (possible past Stanford University affiliation), Lichao Sun, Yixuan Yuan
Abstract

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autono...

πŸ“„ Next-Token Prediction and Regret Minimization
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28499v1
πŸ‘₯ Authors: Mehryar Mohri (possible past Google (United States) affiliation), Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu (possible past Carnegie Mellon University affiliation)
Abstract

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context wi...

πŸ“„ Think Anywhere in Code Generation
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29957v1
πŸ‘₯ Authors: Xue Jiang, Tianyu Zhang, Ge Li (possible past Peking University affiliation), Mengyang Liu, Taozhi Chen, Zhenhua Xu, Binhua Li, Wenpin Jiao, Zhi Jin (possible past Peking University affiliation), Yongbin Li, Yihong Dong
Abstract

Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we pro...

πŸ“„ Big2Small: A Unifying Neural Network Framework for Model Compression
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29768v1
πŸ‘₯ Authors: Jing-Xiao Liao, Haoran Wang, Tao Li (possible past Baidu (China) affiliation), Daoming Lyu, Yi Zhang (possible past Google (United States) affiliation), Chengjun Cai, Feng-Lei Fan
Abstract

With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each...

πŸ“„ Disentangled Graph Prompting for Out-Of-Distribution Detection
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29644v1
πŸ‘₯ Authors: Cheng Yang (possible past Tsinghua University affiliation), Yu Hao, Qi Zhang (possible past Tencent (China) affiliation), Chuan Shi
Abstract

When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction...

πŸ“„ LGFNet: Local-Global Fusion Network with Fidelity Gap Delta Learning for Multi-Source Aerodynamics
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29303v1
πŸ‘₯ Authors: Qinye Zhu, Yu Xiang (possible past University Of Washington affiliation), Jun Zhang (possible past Tencent (China) affiliation), Wenyong Wang
Abstract

The precise fusion of computational fluid dynamic (CFD) data, wind tunnel tests data, and flight tests data in aerodynamic area is essential for obtaining comprehensive knowledge of both localized flow structures and global aerodynamic trends across the entire flight envelope. However, existing methodologies often struggle to balance high-resolution local fidelity with wide-range global dependency, leading to either a loss of sharp discontinuities or an inability to capture long-range topologica...

πŸ“„ Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses
πŸ—“οΈ Published: 3/31/2026
πŸ”— http://arxiv.org/abs/2603.29182v1
πŸ‘₯ Authors: Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang, Kafeng Wang, Cheng-Zhong Xu, Hang Su (possible past Tsinghua University affiliation), Jun Zhu (possible past Tsinghua University affiliation)
Abstract

Adversarial robustness evaluation faces a critical challenge as new defense paradigms emerge that can exploit limitations in existing assessment methods. This paper reveals that Dummy Classes-based defenses, which introduce an additional "dummy" class as a safety sink for adversarial examples, achieve significantly overestimated robustness under conventional evaluation strategies like AutoAttack. The fundamental limitation stems from these attacks' singular focus on misleading the true class lab...

πŸ“„ Optimistic Online LQR via Intrinsic Rewards
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28938v1
πŸ‘₯ Authors: Marcell Bartos, Bruce D. Lee, Lenart Treven, Andreas Krause (possible past Eth Zurich affiliation), Florian DΓΆrfler, Melanie N. Zeilinger (possible past Eth Zurich affiliation)
Abstract

Optimism in the face of uncertainty is a popular approach to balance exploration and exploitation in reinforcement learning. Here, we consider the online linear quadratic regulator (LQR) problem, i.e., to learn the LQR corresponding to an unknown linear dynamical system by adapting the control policy online based on closed-loop data collected during operation. In this work, we propose Intrinsic Rewards LQR (IR-LQR), an optimistic online LQR algorithm that applies the idea of intrinsic rewards or...

πŸ“„ Rethinking Language Model Scaling under Transferable Hypersphere Optimization
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28743v1
πŸ‘₯ Authors: Liliang Ren, Yang Liu (possible past Tsinghua University affiliation), Yelong Shen (possible past Tencent (China) affiliation), Weizhu Chen
Abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal...

πŸ“„ See it to Place it: Evolving Macro Placements with Vision-Language Models
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28733v1
πŸ‘₯ Authors: Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee (possible past Google (United States) affiliation), Joe Wenjie Jiang (possible past Google (United States) affiliation), Vijay Janapa Reddi, Vincent Zhuang
Abstract

We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a nove...

πŸ“„ HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28458v1
πŸ‘₯ Authors: Yufei Xu, Fanxu Meng, Fan Jiang (possible past Shanghai Jiao Tong University affiliation), Yuxuan Wang (possible past Google (United States) affiliation), Ruijie Zhou, Jiexi Wu, Zhixin Pan, Zhaohui Wang, Xiaojuan Tang, Wenjie Pei (possible past Tencent (China) affiliation), Tongxuan Liu, Di Yin, Xing Sun (possible past Tencent (China) affiliation), Muhan Zhang (possible past Meta (United States) affiliation)
Abstract

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchi...

πŸ“„ Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28316v1
πŸ‘₯ Authors: Yuanqiao Zhang, Tiantian He, Yuan Gao (possible past Tencent (China) affiliation), Yixin Wang, Yew-Soon Ong, Maoguo Gong, A. K. Qin, Hui Li (possible past Baidu (China) affiliation)
Abstract

In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provab...

πŸ“„ MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28254v1
πŸ‘₯ Authors: Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li (possible past Tencent (China) affiliation), Ruijie Zhang, Yao Lu (possible past Google (United States) affiliation), Yongxiang Liu, Ganzhao Yuan
Abstract

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before...

πŸ“„ Skillful Kilometer-Scale Regional Weather Forecasting via Global and Regional Coupling
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28173v1
πŸ‘₯ Authors: Weiqi Chen, Wenwei Wang, Qilong Yuan, Lefei Shen, Bingqing Peng, Jiawei Chen (possible past Tencent (China) affiliation), Bo Wu (possible past Tencent (China) affiliation), Liang Sun
Abstract

Data-driven weather models have advanced global medium-range forecasting, yet high-resolution regional prediction remains challenging due to unresolved multiscale interactions between large-scale dynamics and small-scale processes such as terrain-induced circulations and coastal effects. This paper presents a global-regional coupling framework for kilometer-scale regional weather forecasting that synergistically couples a pretrained Transformer-based global model with a high-resolution regional ...

πŸ“„ Neural Federated Learning for Livestock Growth Prediction
πŸ—“οΈ Published: 3/30/2026
πŸ”— http://arxiv.org/abs/2603.28117v2
πŸ‘₯ Authors: Shoujin Wang, Mingze Ni, Wei Liu (possible past Tsinghua University affiliation), Victor W. Chu, Bryan Zheng, Ayush Kanwal, Roy Jing Yang, Kenny Sabir, Fang Chen (possible past Tencent (China) affiliation)
Abstract

Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the...

*Notable papers are those with at least two authors from a "big" AI/ML lab.