πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14140v1
πŸ‘₯ Authors: Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish Mcdonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr (possible past University Of Oxford affiliation), Natasha Jaques (possible past University Of California, Berkeley affiliation), Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder De Witt (possible past University Of Oxford affiliation)
Abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Proble...

πŸ“„ HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14125v1
πŸ‘₯ Authors: Tianshuo Yang, Guanyu Chen, Yutian Chen (possible past Deepmind (United Kingdom) affiliation), Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang (possible past Shanghai Artificial Intelligence Laboratory affiliation), Yao Mu, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation)
Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs tas...

πŸ“„ TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14116v1
πŸ‘₯ Authors: Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun (possible past Tencent (China) affiliation), Wenran Liu, Kai Chen (possible past Shanghai Jiao Tong University affiliation), Yining Li
Abstract

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formula...

πŸ“„ GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13888v1
πŸ‘₯ Authors: Bo Yu (possible past Baidu (China) affiliation), Cheng Yang (possible past Tsinghua University affiliation), Dongyang Hou, Chengfu Liu, Jiayao Liu, Chi Wang (possible past Microsoft (United States) affiliation), Zhiming Zhang, Haifeng Li, Wentao Yang (possible past Google (United States) affiliation)
Abstract

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and in...

πŸ“„ Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13715v1
πŸ‘₯ Authors: Yanfeng Shi, Pengfei Cai, Jun Liu (possible past Tencent (China) affiliation), Qing Gu, Nan Jiang (possible past Stanford University affiliation), Lirong Dai, Ian Mcloughlin, Yan Song (possible past Tencent (China) affiliation)
Abstract

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode tim...

πŸ“„ Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13688v1
πŸ‘₯ Authors: Yizhao Xu, Hongyuan Zhu (possible past Tencent (China) affiliation), Caiyun Liu, Tianfu Wang, Keyu Chen, Sicheng Xu, Jiaolong Yang, Nicholas Jing Yuan, Qi Zhang (possible past Tencent (China) affiliation)
Abstract

3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that ca...

πŸ“„ SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13630v1
πŸ‘₯ Authors: Xixun Lin, Yang Liu (possible past Tsinghua University affiliation), Yancheng Chen, Yongxuan Wu, Yucheng Ning, Yilong Liu, Nan Sun, Shun Zhang, Bin Chong, Chuan Zhou, Yanan Cao, Li Guo (possible past Google (United States) affiliation)
Abstract

The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to...

πŸ“„ A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13448v1
πŸ‘₯ Authors: Lemeng Wang, Qinqian Lei, Vidhi Bakshi, Daniel Yi, Yifan Liu, Jiacheng Hou, Asher Seng Hao, Zheda Mai, Wei-Lun Chao, Robby T. Tan (possible past National University Of Singapore affiliation), Bo Wang (possible past Tencent (China) affiliation)
Abstract

Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the fail...

πŸ“„ WebXSkill: Skill Learning for Autonomous Web Agents
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.13318v1
πŸ‘₯ Authors: Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang (possible past University Of Edinburgh affiliation), Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao (possible past Microsoft (United States) affiliation), Huaxiu Yao
Abstract

Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framewo...

πŸ“„ 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.13244v1
πŸ‘₯ Authors: Benjamin Kiefer, Jan Lukas Augustin, Jon Muhovič, Mingi Jeong, Arnold Wiliem, Janez Pers, Matej Kristan, Alberto Quattrini Li, Matija TerΕ‘ek, Josip Ε ariΔ‡, Arpita Vats, Dominik Hildebrand, Rafia Rahim, Mahmut Karaaslan, Arpit Vaishya, Steve Xie, Ersin Kaya, Akib Mashrur, Tze-Hsiang Tang, Chun-Ming Tsai, Jun-Wei Hsieh, Ming-Ching Chang (possible past Nvidia (United States) affiliation), Wonwoo Jo, Doyeon Lee, Yusi Cao, Lingling Li, Vinayak Nageli, Arshad Jamal, Gorthi Rama Krishna Sai Subrahmanyam, Jemo Maeng, Seongju Lee, Kyoobin Lee, Xu Liu (possible past Massachusetts Institute Of Technology affiliation), Licheng Jiao, Jannik Sheikh, Martin Weinmann, Ivan MartinoviΔ‡, Jose Mateus Raitz Persch, Rahul Harsha Cheppally, Mehmet E. Belviranli, Dimitris Gahtidis, Hyewon Chun, Sangmun Lee, Philipp Gorczak, Hansol Kim, Jeeyeon Jeon, Borja Carrillo Perez, Jiahui Wang, Sangmin Park, Andreas Michel, Jannick Kuester, Bettina Felten, Wolfgang Gross, Yuan Feng, Justin Davis
Abstract

The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams t...

πŸ“„ Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.13016v2
πŸ‘₯ Authors: Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-Ang Gao, Wenkai Yang, Zhiyuan Liu (possible past Tsinghua University affiliation), Ning Ding (possible past Tsinghua University affiliation)
Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities b...

πŸ“„ Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12663v1
πŸ‘₯ Authors: Rui Wang (possible past Tencent (China) affiliation), Yi Zheng, Dongxin Wang, Haiping Huang, Yuanzhi Yao, Yuxiang Zhou, Jialin Yu, Philip Torr (possible past University Of Oxford affiliation)
Abstract

Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user's underlying intent. We introduce Human-centric Topic Modeling, \emph{Human-TM}), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \textbf{G}oal-promp...

πŸ“„ KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12627v1
πŸ‘₯ Authors: Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin (possible past Baidu (China) affiliation), Yu Sun (possible past Baidu (China) affiliation), Hua Wu (possible past Baidu (China) affiliation)
Abstract

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-suffi...

πŸ“„ Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12625v1
πŸ‘₯ Authors: Jianhui Wu, Jian Zhou (possible past Tencent (China) affiliation), Zhi Zhou, Zhangjin Huang, Chao Li (possible past Baidu (China) affiliation)
Abstract

High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead. To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap se...

πŸ“„ KumoRFM-2: Scaling Foundation Models for Relational Learning
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12596v1
πŸ‘₯ Authors: Valter Hudovernik, Federico LΓ³pez, Vid Kocijan, Akihiro Nitta, Jan Eric Lenssen (possible past Meta (United States) affiliation), Jure Leskovec (possible past Stanford University affiliation), Matthias Fey
Abstract

We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus...

πŸ“„ NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12512v1
πŸ‘₯ Authors: Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-Nan Guan, Hui Zeng, Lei Zhang, Radu Timofte (possible past Eth Zurich affiliation), Jianhui Sun, Xinli Yue, Tao Shao, Huan Hou, Wenjie Liao, Shuhao Han, Jieyu Yuan, Chunle Guo, Chongyi Li, Zewen Chen, Yunze Liu, Jian Guo, Juan Wang, Yun Zeng, Bing Li, Weiming Hu, Hesong Li, Dehua Liu, Xinjie Zhang, Qiang Li, Li Yan, Wei Dong, Qingsen Yan, Xingcan Li, Shenglong Zhou, Manjiang Yin, Yinxiang Zhang, Hongbo Wang, Jikai Xu, Zhaohui Fan, Dandan Zhu, Wei Sun (possible past Google (United States) affiliation), Weixia Zhang, Kun Zhu, Nana Zhang, Kaiwei Zhang, Qianqian Zhang, Zhihan Zhang, William Gordon, Linwei Wu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi
Abstract

In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is supe...

πŸ“„ DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13902v1
πŸ‘₯ Authors: Xiaofan Li, Ming Yang (possible past Meta (United States) affiliation), Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng (possible past National University Of Singapore affiliation), Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma (possible past Shanghai Jiao Tong University affiliation), Yuan Xie
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strateg...

πŸ“„ RPS: Information Elicitation with Reinforcement Prompt Selection
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13817v1
πŸ‘₯ Authors: Tao Wang (possible past Stanford University affiliation), Jingyao Lu, Xibo Wang, Haonan Huang, Su Yao, Zhiqiang Hu (possible past Peking University affiliation), Xingyan Chen, Enmao Diao
Abstract

Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather compl...

πŸ“„ Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13634v1
πŸ‘₯ Authors: Xuwen Zhou, Fangxin Liu, Chao Wang (possible past Google (United States) affiliation), Xiao Zheng, Hao Zheng, Min He, Li Jiang (possible past Tencent (China) affiliation), Haibing Guan
Abstract

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probabil...

πŸ“„ Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13598v1
πŸ‘₯ Authors: Qin Zhou, Guoyan Liang, Qianyi Yang, Jingyuan Chen (possible past National University Of Singapore affiliation), Sai Wu, Chang Yao, Zhe Wang (possible past Deepmind (United Kingdom) affiliation)
Abstract

Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Rewar...

πŸ“„ Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.13327v1
πŸ‘₯ Authors: Hongyi Jin, Bohan Hou, Guanjie Wang, Ruihang Lai, Jinqi Chen, Zihao Ye, Yaxing Cai, Yixin Dong, Xinhao Cheng, Zhihao Zhang, Yilong Zhao, Yingyi Huang, Lijie Yang, Jinchen Jiang, Gabriele Oliaro, Jianan Ji, Xupeng Miao, Vinod Grover, Todd C. Mowry (possible past Carnegie Mellon University affiliation), Zhihao Jia, Tianqi Chen (possible past University Of Washington affiliation)
Abstract

Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor ...

πŸ“„ Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12968v1
πŸ‘₯ Authors: Tong Zhang (possible past Tencent (China) affiliation), Jiangning Zhang (possible past Tencent (China) affiliation), Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu (possible past Tencent (China) affiliation), Yabiao Wang (possible past Tencent (China) affiliation), Yong Liu, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection a...

πŸ“„ VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
πŸ—“οΈ Published: 4/14/2026
πŸ”— http://arxiv.org/abs/2604.12887v1
πŸ‘₯ Authors: Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm (possible past Microsoft (United States) affiliation), David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir (possible past University Of California, Berkeley affiliation)
Abstract

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict...

*Notable papers are those with at least two authors from a "big" AI/ML lab.