📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23447v1
👥 Authors: Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu (possible past National University Of Defense Technology affiliation), Hongchao Fan, Hao Wu (possible past Tencent (China) affiliation)
Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we i...

📄 PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23231v1
👥 Authors: Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang (possible past Stanford University affiliation), Chao Zhang, Derong Xu, Jia Li (possible past Google (United States) affiliation), Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen (possible past Baidu (China) affiliation), Tong Xu (possible past Baidu (China) affiliation)
Abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate ...

📄 ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23184v1
👥 Authors: Hao Wang (possible past Tsinghua University affiliation), Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin (possible past Peking University affiliation)
Abstract

Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit pre...

📄 MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.23085v1
👥 Authors: Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai (possible past Google (United States) affiliation), Yuan Xue (possible past Google (United States) affiliation)
Abstract

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive...

📄 JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22978v1
👥 Authors: Yuhui Wang, Zhixiong Yang, Ming Zhang (possible past Peking University affiliation), Shihan Dou, Zhiheng Xi, Enyu Zhou, Senjie Jin, Yujiong Shen, Dingwei Zhu, Yi Dong, Tao Gui, Qi Zhang (possible past Tencent (China) affiliation), Xuanjing Huang
Abstract

In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model's ability to assist in malfuncti...

📄 Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22869v1
👥 Authors: Yang Li (possible past Google (United States) affiliation), Yule Liu, Xinlei He, Youjian Zhao (possible past Tsinghua University affiliation), Qi Li, Ke Xu
Abstract

Large Language Models (LLMs) have become core cognitive components in modern artificial intelligence (AI) systems, combining internal knowledge with external context to perform complex tasks. However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries. This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises. Exi...

📄 UniQueR: Unified Query-based Feedforward 3D Reconstruction
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22851v1
👥 Authors: Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka (possible past University Of California, Berkeley affiliation), Wei Zhan (possible past University Of California, Berkeley affiliation)
Abstract

We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric querie...

📄 UAV-DETR: DETR for Anti-Drone Target Detection
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22841v1
👥 Authors: Jun Yang (possible past Tsinghua University affiliation), Dong Wang (possible past Tsinghua University affiliation), Hongxu Yin, Hongpeng Li, Jianxiong Yu
Abstract

Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection cap...

📄 Generalizing Dynamics Modeling More Easily from Representation Perspective
🗓️ Published: 3/24/2026
🔗 http://arxiv.org/abs/2603.22655v1
👥 Authors: Yiming Wang, Zhengnan Zhang, Genghe Zhang, Jiawen Dan, Changchun Li, Chenlong Hu, Chris Nugent, Jun Liu (possible past Tencent (China) affiliation), Ximing Li, Bo Yang (possible past Tencent (China) affiliation)
Abstract

Learning system dynamics from observations is a critical problem in many applications over various real-world complex systems, e.g., climate, ecology, and fluid systems. Recently, neural dynamics modeling method have become a prevalent solution that embeds the object's observations into a latent space before learning dynamics using neural methods such as neural Ordinary Differential Equations (ODE). Existing dynamics modeling methods induce a specific model for each observation of different comp...

📄 Bridging the Know-Act Gap via Task-Level Autoregressive Reasoning
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22619v1
👥 Authors: Jihyun Janice Ahn, Ryo Kamoi, Berk Atil, Renze Lou, Wonwoo Kang, Heehyun Park, Sarkar Snigdha Sarathi Das, Zhuoyang Zou, Xiaoxin Lu, Yusen Zhang, Asfahan Shah, Ridwanul Hasan Tanvir, Lingxiao Zhao, Hongxi Huang, Vignesh Venkatesh, Dianjun Lin, Hamid Shah, Wentao Wang (possible past Baidu (China) affiliation), Zhanpeng Song, Joshua Reed Bassin, Dax Patel, Ishan Appareddy Agrahar, Sahil Pardasani, Xin Dong (possible past Tsinghua University affiliation), Fatemeh Rahbari, Benjamin David Rishel, Soochan Andrew Lee, Yuv Boghani, Ali B. Alnaseeb, Pranav Suby, Seokhyeon Bae, Shreya Buddharaju, Damien Kula, Soumyadeep Das, Hanyang Frank Liu, Faye Mo, Wenpeng Yin
Abstract

LLMs often generate seemingly valid answers to flawed or ill-posed inputs. This is not due to missing knowledge: under discriminative prompting, the same models can mostly identify such issues, yet fail to reflect this in standard generative responses. This reveals a fundamental know-act gap between discriminative recognition and generative behavior. Prior work largely characterizes this issue in narrow settings, such as math word problems or question answering, with limited focus on how to inte...

📄 Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22529v1
👥 Authors: Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara (possible past Google (United States) affiliation), Maria Wang, Jindong Chen (possible past Google (United States) affiliation), Mohit Bansal, Boqing Gong (possible past Tencent (China) affiliation)
Abstract

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundi...

📄 LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22519v1
👥 Authors: Michael Hind, Basel Shbita, Bo Wu (possible past Tencent (China) affiliation), Farhan Ahmed, Chad Deluca, Nathan Fulton, David Cox (possible past Ibm (United States) affiliation), Dan Gutfreund
Abstract

Textual Large Language Models (LLMs) provide a simple and familiar interface: a string of text is used for both input and output. However, the information conveyed to an LLM often has a richer structure and semantics, which is not conveyed in a string. For example, most prompts contain both instructions ("Summarize this paper into a paragraph") and data (the paper to summarize), but these are usually not distinguished when passed to the model. This can lead to model confusion and security risks,...

📄 CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22435v1
👥 Authors: Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang (possible past Stanford University affiliation), Fei-Fei Li, Guanya Shi, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Shankar Sastry, Yuke Zhu (possible past Stanford University affiliation), Ken Goldberg (possible past University Of California, Berkeley affiliation), Linxi "jim" Fan
Abstract

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. B...

📄 WorldCache: Content-Aware Caching for Accelerated Video World Models
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22286v1
👥 Authors: Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation), Fahad Shahbaz Khan (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsiste...

📄 End-to-End Training for Unified Tokenization and Latent Denoising
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22283v1
👥 Authors: Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba (possible past Massachusetts Institute Of Technology affiliation), Phillip Isola (possible past University Of California, Berkeley affiliation), William T. Freeman (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is...

📄 ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22281v1
👥 Authors: Haichao Zhang (possible past Baidu (China) affiliation), Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li (possible past Google (United States) affiliation), Yun Fu
Abstract

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning ove...

📄 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22279v1
👥 Authors: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang (possible past Tsinghua University affiliation), Sifei Liu (possible past Nvidia (United States) affiliation), Kaichun Mo (possible past Stanford University affiliation), Chuang Gan (possible past Tsinghua University affiliation), Subhashree Radhakrishnan
Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text...

📄 On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22117v1
👥 Authors: Kexin Huang (possible past Stanford University affiliation), Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang (possible past Tencent (China) affiliation), Xiangnan He (possible past National University Of Singapore affiliation), Guoyin Wang, Jingren Zhou
Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability differen...

📄 Symbolic Graph Networks for Robust PDE Discovery from Noisy Sparse Data
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22380v1
👥 Authors: Xingyu Chen (possible past Tencent (China) affiliation), Junxiu An, Jun Guo, Yuqian Zhou (possible past Google (United States) affiliation)
Abstract

Data-driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which poses significant challenges to existing approaches based on numerical differentiation or integral formulations. In this work, we propose a Symbolic Graph Network (SGN) framework for PDE discovery under noisy and sparse condition...

📄 The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22278v1
👥 Authors: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau (possible past Google (United States) affiliation), Antonio Torralba (possible past Massachusetts Institute Of Technology affiliation), Tamar Rott Shaham (possible past Technion – Israel Institute Of Technology affiliation)
Abstract

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding ...

📄 Gumbel Distillation for Parallel Text Generation
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22216v1
👥 Authors: Chi Zhang (possible past Peking University affiliation), Xixi Hu, Bo Liu (possible past Meta (United States) affiliation), Qiang Liu
Abstract

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic ma...

📄 Causal Evidence that Language Models use Confidence to Drive Behavior
🗓️ Published: 3/23/2026
🔗 http://arxiv.org/abs/2603.22161v1
👥 Authors: Dharshan Kumaran (possible past Google (United States) affiliation), Nathaniel Daw, Simon Osindero (possible past Google (United States) affiliation), Petar Velickovic, Viorica Patraucean
Abstract

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention o...

*Notable papers are those with at least two authors from a "big" AI/ML lab.