📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.22190v1
👥 Authors: Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang (possible past Google (United States) affiliation), Hao Cheng (possible past Tencent (China) affiliation), Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao (possible past Microsoft (United States) affiliation), Tong Zhang (possible past Tencent (China) affiliation)
Abstract

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where mult...

📄 DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21864v1
👥 Authors: Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen (possible past Tencent (China) affiliation), Hua Liu, James Kwok, Yu Zhang (possible past Google (United States) affiliation)
Abstract

Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, re...

📄 ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21858v1
👥 Authors: Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang (possible past Tsinghua University affiliation), Haofei Sun, Changpeng Yang, Yang Li (possible past Google (United States) affiliation), Peng Zhou (possible past Tencent (China) affiliation), Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang
Abstract

Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complex...

📄 Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21779v1
👥 Authors: Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang (possible past Peking University affiliation), Xuelong Li (possible past Tencent (China) affiliation)
Abstract

Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs w...

📄 From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21551v1
👥 Authors: Zhihao Li, Yu Feng (possible past University Of California, Berkeley affiliation), Zhilu Lai, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Learning PDE dynamics for fluids increasingly relies on neural operators and Transformer-based models, yet these approaches often lack interpretability and struggle with localized, high-frequency structures while incurring quadratic cost in spatial samples. We propose representing fields with a Gaussian basis, where learned atoms carry explicit geometry (centers, anisotropic scales, weights) and form a compact, mesh-agnostic, directly visualizable state. Building on this representation, we intro...

📄 ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21534v1
👥 Authors: Xiaoxuan Wang, Han Zhang (possible past Tsinghua University affiliation), Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic...

📄 Test-Time Training with KV Binding Is Secretly Linear Attention
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21204v1
👥 Authors: Junchen Liu, Sven Elflein, Or Litany (possible past Stanford University affiliation), Zan Gojcic, Ruilong Li (possible past Tsinghua University affiliation)
Abstract

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model...

📄 Aletheia tackles FirstProof autonomously
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21201v1
👥 Authors: Tony Feng, Junehyuk Jung, Sang-Hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh (possible past Google (United States) affiliation), Vahab Mirrokni (possible past Google (United States) affiliation), Quoc V. Le (possible past Stanford University affiliation), Thang Luong (possible past Stanford University affiliation)
Abstract

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as...

📄 Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21198v1
👥 Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei (possible past Stanford University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflec...

📄 Cooperative-Competitive Team Play of Real-World Craft Robots
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21119v1
👥 Authors: Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang (possible past Tencent (China) affiliation), Lei Han (possible past Tencent (China) affiliation)
Abstract

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement l...

📄 LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21044v1
👥 Authors: Yanrui Wu, Lingling Zhang (possible past Google (United States) affiliation), Xinyu Zhang (possible past Baidu (China) affiliation), Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu (possible past Tencent (China) affiliation)
Abstract

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework ...

📄 CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20980v1
👥 Authors: Yang Zhang (possible past Tsinghua University affiliation), Danyang Li, Yuxuan Li, Xin Zhang (possible past Google (United States) affiliation), Tianyu Xie, Mingming Cheng, Xiang Li
Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent...

📄 AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20720v1
👥 Authors: Che Wang, Jiaming Zhang, Ziqi Zhang (possible past Tencent (China) affiliation), Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei (possible past Baidu (China) affiliation), Zhong Chen, Wei Yang Bryan Lim
Abstract

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast-evolving nature of modern AI agents. We introduce AdapTools,...

📄 CAMEL: Confidence-Gated Reflection for Reward Modeling
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20670v1
👥 Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu (possible past Tsinghua University affiliation), Yang You (possible past University Of California, Berkeley affiliation)
Abstract

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance di...

📄 Personal Information Parroting in Language Models
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20580v1
👥 Authors: Nishant Subramani (possible past Allen Institute For Artificial Intelligence affiliation), Kshitish Ghate, Mona Diab (possible past Carnegie Mellon University affiliation)
Abstract

Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b m...

📄 From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20558v1
👥 Authors: Yucheng Shi, Ying Li (possible past Meta (United States) affiliation), Yu Wang (possible past Tsinghua University affiliation), Yesu Feng, Arjun Rao, Rein Houthooft (possible past University Of California, Berkeley affiliation), Shradha Sehgal, Jin Wang, Hao Zhen, Ninghao Liu, Linas Baltrunas
Abstract

Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalizat...

📄 PreScience: A Benchmark for Forecasting Scientific Contributions
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20459v1
👥 Authors: Anirudh Ajith, Amanpreet Singh (possible past Meta (United States) affiliation), Jay Deyoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord, James Evans, Daniel S. Weld (possible past University Of Washington affiliation), Tom Hope (possible past Tencent (China) affiliation), Doug Downey (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generatio...

📄 RABot: Reinforcement-Guided Graph Augmentation for Imbalanced and Noisy Social Bot Detection
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21749v1
👥 Authors: Longlong Zhang, Xi Wang (possible past Tsinghua University affiliation), Haotong Du, Yangyi Xu, Zhuo Liu, Yang Liu (possible past Tsinghua University affiliation)
Abstract

Social bot detection is pivotal for safeguarding the integrity of online information ecosystems. Although recent graph neural network (GNN) solutions achieve strong results, they remain hindered by two practical challenges: (i) severe class imbalance arising from the high cost of generating bots, and (ii) topological noise introduced by bots that skillfully mimic human behavior and forge deceptive links. We propose the Reinforcement-guided graph Augmentation social Bot detector (RABot), a multi-...

📄 TiMi: Empower Time Series Transformers with Multimodal Mixture of Experts
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21693v1
👥 Authors: Jiafeng Lin, Yuxuan Wang (possible past Google (United States) affiliation), Huakun Luo, Zhongyi Pei, Jianmin Wang (possible past Tsinghua University affiliation)
Abstract

Multimodal time series forecasting has garnered significant attention for its potential to provide more accurate predictions than traditional single-modality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and...

📄 Trie-Aware Transformers for Generative Recommendation
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21677v1
👥 Authors: Zhenxiang Xu, Jiawei Chen (possible past Tencent (China) affiliation), Sirui Chen, Yong He, Jieyu Yang, Chuan Yuan, Ke Ding, Can Wang (possible past Tsinghua University affiliation)
Abstract

Generative recommendation (GR) aligns with advances in generative AI by casting next-item prediction as token-level generation rather than score-based ranking. Most GR methods adopt a two-stage pipeline: (i) \textit{item tokenization}, which maps each item to a sequence of discrete, hierarchically organized tokens; and (ii) \textit{autoregressive generation}, which predicts the next item's tokens conditioned on the tokens of user's interaction history. Although hierarchical tokenization induces ...

📄 WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck
🗓️ Published: 2/25/2026
🔗 http://arxiv.org/abs/2602.21508v1
👥 Authors: Haoyuan He, Yu Zheng, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Robust watermarking is critical for intellectual property protection, whereas existing methods face a severe vulnerability against regeneration-based AIGC attacks. We identify that existing methods fail because they entangle the watermark with high-frequency cover texture, which is susceptible to being rewritten during generative purification. To address this, we propose WaterVIB, a theoretically grounded framework that reformulates the encoder as an information sieve via the Variational Informa...

📄 SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21158v2
👥 Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang (possible past Baidu (China) affiliation), Kenton Murray, Hua Wei (possible past Google (United States) affiliation)
Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolv...

📄 LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.21142v1
👥 Authors: Zhifan Jiang, Dong Yang (possible past Nvidia (United States) affiliation), Vishwesh Nath (possible past Nvidia (United States) affiliation), Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu (possible past Nvidia (United States) affiliation), Daguang Xu (possible past Nvidia (United States) affiliation), Syed Muhammad Anwar, Holger R. Roth (possible past Nvidia (United States) affiliation), Marius George Linguraru
Abstract

Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are es...

📄 SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20901v1
👥 Authors: Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu (possible past Tsinghua University affiliation), Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song (possible past Eth Zurich affiliation)
Abstract

Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects i...

📄 Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20730v1
👥 Authors: Zhenxing Xu, Zeyuan Ma, Weidong Bao (possible past National University Of Defense Technology affiliation), Hui Yan, Yan Zheng, Ji Wang (possible past Tencent (China) affiliation)
Abstract

We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline para...

📄 Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty
🗓️ Published: 2/24/2026
🔗 http://arxiv.org/abs/2602.20729v1
👥 Authors: Xu Wan, Chao Yang, Cheng Yang (possible past Tsinghua University affiliation), Jie Song (possible past Eth Zurich affiliation), Mingyang Sun
Abstract

Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value fun...

*Notable papers are those with at least two authors from a "big" AI/ML lab.