Recent Notable AI/ML arXiv Papers

📄 Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.18002v1
👥 Authors: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang (possible past Tencent (China) affiliation), Marc Pollefeys (possible past Google (United States) affiliation)

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial co...

📄 How do LLMs Compute Verbal Confidence

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17839v1
👥 Authors: Dharshan Kumaran (possible past Google (United States) affiliation), Arthur Conmy, Federico Barbero, Simon Osindero (possible past Google (United States) affiliation), Viorica Patraucean, Petar Velickovic

Abstract

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer q...

📄 AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17441v1
👥 Authors: Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen (possible past Tencent (China) affiliation), Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang (possible past Tencent (China) affiliation), Chenghao Jiang, Borui Zhang, Jiwen Lu (possible past Tsinghua University affiliation)

Abstract

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding....

📄 Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17312v1
👥 Authors: Yuelin Zhang, Sijie Cheng, Chen Li (possible past Tencent (China) affiliation), Zongzhao Li, Yuxin Huang, Yang Liu (possible past Tsinghua University affiliation), Wenbing Huang (possible past Tsinghua University affiliation)

Abstract

Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-L...

📄 Demystifing Video Reasoning

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16870v1
👥 Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li (possible past Tencent (China) affiliation), Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang (possible past Google (United States) affiliation)

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative a...

📄 ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16866v1
👥 Authors: Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen (possible past Google (United States) affiliation), Huan-Ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation)

Abstract

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generat...

📄 SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16864v1
👥 Authors: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang (possible past Google (United States) affiliation), Balu Adsumilli (possible past Google (United States) affiliation), Zhengzhong Tu (possible past Google (United States) affiliation)

Abstract

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a smal...

📄 SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16859v1
👥 Authors: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang (possible past Google (United States) affiliation), Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji (possible past Tencent (China) affiliation)

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimension...

📄 SOMA: Unifying Parametric Human Body Models

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16858v1
👥 Authors: Jun Saito, Jiefeng Li, Michael De Ruyter, Miguel Guerrero, Edy Lim, Ehsan Hassani, Roger Blanco Ribera, Hyejin Moon, Magdalena Dadela, Marco Di Lucca, Qiao Wang, Xueting Li, Jan Kautz (possible past Nvidia (United States) affiliation), Simon Yuen, Umar Iqbal (possible past Nvidia (United States) affiliation)

Abstract

Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology...

📄 Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16822v1
👥 Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He (possible past Nvidia (United States) affiliation), Dong Yang (possible past Nvidia (United States) affiliation), Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu (possible past Nvidia (United States) affiliation), Qi Dou, Yueming Jin

Abstract

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal d...

📄 InCoder-32B: Code Foundation Model for Industrial Scenarios

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16790v1
👥 Authors: Jian Yang, Wei Zhang (possible past Tsinghua University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui (possible past Tsinghua University affiliation), Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao (possible past Baidu (China) affiliation), Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv

Abstract

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded sys...

📄 TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16759v1
👥 Authors: Victoria Graf, Valentina Pyatkin, Nouha Dziri, Nathan Lambert (possible past University Of California, Berkeley affiliation), Hannaneh Hajishirzi (possible past University Of Washington affiliation)

Abstract

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through p...

📄 IQuest-Coder-V1 Technical Report

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16733v1
👥 Authors: Jian Yang, Wei Zhang (possible past Tsinghua University affiliation), Shawn Guo, Zhengmao Ye, Lin Jing, Shark Liu, Yizhi Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Cening Liu, X. Ma, Yuyang Song, Siwei Wu, Yuwen Li, L. Liao, T. Zheng, Ziling Huang, Zelong Huang, Che Liu, Yan Xing, Renyuan Li, Qingsong Cai, Hanxu Yan, Siyue Wang, Shikai Li, Jason Klein Liu, An Huang, Yongsheng Kang, Jinxing Zhang, Chuan Hao, Haowen Wang, Weicheng Gu, Ran Tao, Mingjie Tang, Peihao Wu, Jianzhou Wang, Xianglong Liu, Weifeng Lv, Bryan Dai

Abstract

In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, ...

📄 When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16673v1
👥 Authors: Jun Liu (possible past Tencent (China) affiliation), Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang (possible past Tencent (China) affiliation), Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang

Abstract

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a funda...

📄 Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16664v1
👥 Authors: Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang (possible past Tencent (China) affiliation), Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie (possible past Google (United States) affiliation), Yuyin Zhou

Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework...

📄 Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17750v1
👥 Authors: Qi Liu (possible past Tencent (China) affiliation), Laure Zanna, Joan Bruna (possible past University Of California, Berkeley affiliation)

Abstract

Recent advances in autoregressive neural surrogate models have enabled orders-of-magnitude speedups in simulating dynamical systems. However, autoregressive models are generally prone to distribution drift: compounding errors in autoregressive rollouts that severely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly leveraging the inherent trade-off between short-time accuracy and long-time consistency through hyperparameter tuning. In ...

📄 HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17573v1
👥 Authors: Zihao Zheng, Zhihao Mao, Sicheng Tian, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei (possible past Peking University affiliation), Xiang Chen (possible past Tencent (China) affiliation)

Abstract

Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlle...

📄 ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17435v1
👥 Authors: Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li (possible past Peking University affiliation), Weile Luo, Qiang Wang, Wei Wang (possible past University Of Oxford affiliation), Xiaowen Chu

Abstract

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a ...

📄 Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

🗓️ Published: 3/18/2026
🔗 http://arxiv.org/abs/2603.17354v1
👥 Authors: Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang (possible past Tsinghua University affiliation), Ruobing Xie (possible past Tencent (China) affiliation), Lei Jiang, Hayden Kwok-Hay So, Ngai Wong

Abstract

Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Du...

📄 Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.17191v1
👥 Authors: Sophie Kearney, Shu Yang, Zixuan Wen, Weimin Lyu, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Chao Chen (possible past Tencent (China) affiliation), Li Shen (possible past Tencent (China) affiliation)

Abstract

Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer's Prediction GPT, a domain-adapted tabular LLM framework built on Table...

📄 Deep Tabular Representation Corrector

🗓️ Published: 3/17/2026
🔗 http://arxiv.org/abs/2603.16569v1
👥 Authors: Hangting Ye, Peng Wang (possible past Peking University affiliation), Wei Fan (possible past Tencent (China) affiliation), Xiaozhuang Song, He Zhao (possible past Tencent (China) affiliation), Dandan Gun, Yi Chang

Abstract

Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representati...

📄 Notable* Recent AI/ML arXiv Papers

📄 Notable^* Recent AI/ML arXiv Papers