📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.24326v1
👥 Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang (possible past Carnegie Mellon University affiliation), Jing Zhang (possible past University Of Washington affiliation), Jun Zhang (possible past Tencent (China) affiliation), Xing Wei, Yi Liu (possible past Google (United States) affiliation), Dianhai Yu (possible past Baidu (China) affiliation), Yanjun Ma (possible past Baidu (China) affiliation)
Abstract

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coar...

📄 Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.24208v1
👥 Authors: Xin Zhang (possible past Google (United States) affiliation), Jianyang Xu, Hao Peng (possible past Tsinghua University affiliation), Dongjing Wang, Jingyuan Zheng, Yu Li (possible past Tencent (China) affiliation), Yuyu Yin, Hongbo Wang
Abstract

Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teach...

📄 A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.24202v1
👥 Authors: Cansu Sancaktar, David Zhang (possible past Meta (United States) affiliation), Gabriel Synnaeve (possible past Meta (United States) affiliation), Taco Cohen
Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured ...

📄 Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.24093v1
👥 Authors: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang (possible past Meta (United States) affiliation), Ran Tao, Bryan Dai, Wayne Xin Zhao (possible past Baidu (China) affiliation), Jian Yang, Hongteng Xu
Abstract

Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable kn...

📄 The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.23971v1
👥 Authors: Lingjiao Chen, Chi Zhang (possible past Peking University affiliation), Yeye He, Ion Stoica (possible past University Of California, Berkeley affiliation), Matei Zaharia (possible past University Of California, Berkeley affiliation), James Zou
Abstract

Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually i...

📄 Self-Distillation for Multi-Token Prediction
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.23911v1
👥 Authors: Guoliang Zhao, Ruobing Xie (possible past Tencent (China) affiliation), An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun (possible past Baidu (China) affiliation)
Abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head ac...

📄 AgentChemist: A Multi-Agent Experimental Robotic Platform Integrating Chemical Perception and Precise Control
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.23886v1
👥 Authors: Xiangyi Wei, Fei Wang, Haotian Zhang (possible past Stanford University affiliation), Xin An, Haitian Zhu, Lianrui Hu, Yang Li (possible past Google (United States) affiliation), Changbo Wang, Xiao He
Abstract

Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long-tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving operations that fall outside predefined protocols. This mismatch prevents existing systems from generalizing to novel reaction conditions, uncommon instrument configurations, and unexpected procedural ...

📄 Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness
🗓️ Published: 3/25/2026
🔗 http://arxiv.org/abs/2603.23860v1
👥 Authors: Yunrui Yu, Hang Su (possible past Tsinghua University affiliation), Jun Zhu (possible past Tsinghua University affiliation)
Abstract

This work investigates the critical role of activation function curvature -- quantified by the maximum second derivative $\max|σ''|$ -- in adversarial robustness. Using the Recursive Curvature-Tunable Activation Family (RCT-AF), which enables precise control over curvature through parameters $α$ and $β$, we systematically analyze this relationship. Our study reveals a fundamental trade-off: insufficient curvature limits model expressivity, while excessive curvature amplifies the normalized Hessi...

📄 The Diminishing Returns of Early-Exit Decoding in Modern LLMs
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23701v1
👥 Authors: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li (possible past Tencent (China) affiliation), Zhaozhuo Xu, Hao Wang (possible past Tsinghua University affiliation)
Abstract

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a mo...

📄 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23447v1
👥 Authors: Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu (possible past National University Of Defense Technology affiliation), Hongchao Fan, Hao Wu (possible past Tencent (China) affiliation)
Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we i...

📄 PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23231v1
👥 Authors: Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang (possible past Stanford University affiliation), Chao Zhang, Derong Xu, Jia Li (possible past Google (United States) affiliation), Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen (possible past Baidu (China) affiliation), Tong Xu (possible past Baidu (China) affiliation)
Abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate ...

📄 ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23184v1
👥 Authors: Hao Wang (possible past Tsinghua University affiliation), Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin (possible past Peking University affiliation)
Abstract

Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit pre...

📄 MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23085v1
👥 Authors: Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai (possible past Google (United States) affiliation), Yuan Xue (possible past Google (United States) affiliation)
Abstract

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive...

📄 StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23571v1
👥 Authors: Zhiyuan Chen (possible past Google (United States) affiliation), Yuxuan Zhong, Fan Wang (possible past Baidu (China) affiliation), Bo Yu (possible past Baidu (China) affiliation), Pengtao Shao, Shaoshan Liu, Ning Ding (possible past Tsinghua University affiliation)
Abstract

Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves re...

📄 JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22978v1
👥 Authors: Yuhui Wang, Zhixiong Yang, Ming Zhang (possible past Peking University affiliation), Shihan Dou, Zhiheng Xi, Enyu Zhou, Senjie Jin, Yujiong Shen, Dingwei Zhu, Yi Dong, Tao Gui, Qi Zhang (possible past Tencent (China) affiliation), Xuanjing Huang
Abstract

In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model's ability to assist in malfuncti...

📄 Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22869v1
👥 Authors: Yang Li (possible past Google (United States) affiliation), Yule Liu, Xinlei He, Youjian Zhao (possible past Tsinghua University affiliation), Qi Li, Ke Xu
Abstract

Large Language Models (LLMs) have become core cognitive components in modern artificial intelligence (AI) systems, combining internal knowledge with external context to perform complex tasks. However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries. This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises. Exi...

📄 UniQueR: Unified Query-based Feedforward 3D Reconstruction
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22851v1
👥 Authors: Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka (possible past University Of California, Berkeley affiliation), Wei Zhan (possible past University Of California, Berkeley affiliation)
Abstract

We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric querie...

📄 UAV-DETR: DETR for Anti-Drone Target Detection
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22841v1
👥 Authors: Jun Yang (possible past Tsinghua University affiliation), Dong Wang (possible past Tsinghua University affiliation), Hongxu Yin, Hongpeng Li, Jianxiong Yu
Abstract

Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection cap...

*Notable papers are those with at least two authors from a "big" AI/ML lab.