πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02486v1
πŸ‘₯ Authors: Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang (possible past Tencent (China) affiliation), Chong Luo (possible past Google (United States) affiliation), Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Xin Geng, Baining Guo
Abstract

LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties,...

πŸ“„ Flow Policy Gradients for Robot Control
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02481v1
πŸ‘₯ Authors: Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E. Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel (possible past University Of California, Berkeley affiliation), Guanya Shi, Karen Liu, Angjoo Kanazawa (possible past University Of California, Berkeley affiliation)
Abstract

Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients -- a recent framework that bypasses likelihood computation -- can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an impro...

πŸ“„ Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02393v1
πŸ‘₯ Authors: Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang (possible past Tencent (China) affiliation), Yong Zhang (possible past Tsinghua University affiliation), Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng
Abstract

We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively...

πŸ“„ Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02343v1
πŸ‘₯ Authors: Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng (possible past Alibaba Group (China) affiliation), Zhixuan Chu, Huajun Chen (possible past Alibaba Group (China) affiliation), Ningyu Zhang (possible past Tencent (China) affiliation)
Abstract

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that sep...

πŸ“„ Advancing General-Purpose Reasoning Models with Modular Gradient Surgery
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02301v1
πŸ‘₯ Authors: Min Cai, Yu Liang, Longzheng Wang, Yan Wang (possible past Tencent (China) affiliation), Yueyang Zhang (possible past Baidu (China) affiliation), Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Abstract

Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in l...

πŸ“„ TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02196v1
πŸ‘₯ Authors: Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang (possible past Tencent (China) affiliation), Tao Qin, Jun Liu (possible past Tencent (China) affiliation), Qika Lin
Abstract

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps...

πŸ“„ State Rank Dynamics in Linear Attention LLMs
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02195v1
πŸ‘₯ Authors: Ao Sun, Hongtao Zhang, Heng Zhou, Yixuan Ma, Yiran Qin, Tongrui Su, Yan Liu (possible past Tencent (China) affiliation), Zhanyu Ma, Jun Xu (possible past Google (United States) affiliation), Jiuchong Gao, Jinghua Hao, Renqing He
Abstract

Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation amo...

πŸ“„ WADEPre: A Wavelet-based Decomposition Model for Extreme Precipitation Nowcasting with Multi-Scale Learning
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02096v1
πŸ‘₯ Authors: Baitian Liu, Haiping Zhang, Huiling Yuan, Dongjing Wang, Ying Li (possible past Meta (United States) affiliation), Feng Chen, Hao Wu (possible past Tencent (China) affiliation)
Abstract

The heavy-tailed nature of precipitation intensity impedes precise precipitation nowcasting. Standard models that optimize pixel-wise losses are prone to regression-to-the-mean bias, which blurs extreme values. Existing Fourier-based methods also lack the spatial localization needed to resolve transient convective cells. To overcome these intrinsic limitations, we propose WADEPre, a wavelet-based decomposition model for extreme precipitation that transitions the modeling into the wavelet domain....

πŸ“„ SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02000v1
πŸ‘₯ Authors: Bing He, Jingnan Gao, Yunuo Chen, Ning Cao (possible past Google (United States) affiliation), Gang Chen, Zhengxue Cheng, Li Song, Wenjun Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue,...

πŸ“„ Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01983v1
πŸ‘₯ Authors: Xintian Shen, Jiawei Chen (possible past Tencent (China) affiliation), Lihao Zheng, Hao Ma (possible past Meta (United States) affiliation), Tao Wei (possible past Baidu (China) affiliation), Kun Zhan
Abstract

Existing Tool-Integrated Reasoning (TIR) models have effectively extended the question-answering capabilities of LLMs by incorporating external tools. However, real-world scenarios present numerous open-ended problems where fixed tools often fail to meet task requirements. Furthermore, the lack of self-optimization mechanisms means that erroneous tool outputs can mislead the LLM's responses. Additionally, the construction of existing tools entails significant manual effort, which consequently co...

πŸ“„ Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01970v1
πŸ‘₯ Authors: Yun Qu, Qi Wang (possible past Tsinghua University affiliation), Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu (possible past Tencent (China) affiliation), Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji (possible past Tsinghua University affiliation)
Abstract

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Sel...

πŸ“„ CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01844v1
πŸ‘₯ Authors: Yuliang Zhan, Jian Li (possible past Tencent (China) affiliation), Wenbing Huang (possible past Tsinghua University affiliation), Wenbing Huang (possible past Tsinghua University affiliation), Yang Liu (possible past Tsinghua University affiliation), Hao Sun
Abstract

Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework...

πŸ“„ Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01775v1
πŸ‘₯ Authors: Yucheng Wu, Yuekui Yang (possible past Tencent (China) affiliation), Hongzheng Li, Anan Liu, Jian Xiao, Junjie Zhai (possible past Tencent (China) affiliation), Huan Yu, Shaoping Ma, Leye Wang
Abstract

Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables ra...

πŸ“„ : One LLM Token for Explicit Graph Structural Understanding
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01771v1
πŸ‘₯ Authors: Jingyao Wu, Bin Lu, Zijun Di, Xiaoying Gan (possible past Shanghai Jiao Tong University affiliation), Meng Jin, Luoyi Fu (possible past Shanghai Jiao Tong University affiliation), Xinbing Wang, Chenghu Zhou
Abstract

Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one ...

πŸ“„ Meta Engine: A Unified Semantic Query Engine on Heterogeneous LLM-Based Query Systems
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01701v1
πŸ‘₯ Authors: Ruyu Li, Tinghui Zhang, Haodi Ma, Daisy Zhe Wang (possible past University Of California, Berkeley affiliation), Yifan Wang (possible past Stanford University affiliation)
Abstract

With the increasingly use of multi-modal data, semantic query has become more and more demanded in data management systems, which is an important way to access and analyze multi-modal data. As unstructured data, most information of multi-modal data (text, image, video, etc) hides in the semantics, which cannot be accessed by the traditional database queries like SQL. Given the power of Large Language Model (LLM) in understanding semantics and processing natural language, in recent years severa...

πŸ“„ Expanding the Capabilities of Reinforcement Learning via Text Feedback
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02482v1
πŸ‘₯ Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak (possible past University Of California, Berkeley affiliation), J. Andrew Bagnell (possible past Carnegie Mellon University affiliation), Aarti Singh (possible past Carnegie Mellon University affiliation), Andrea Zanette
Abstract

The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in ...

πŸ“„ Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02443v1
πŸ‘₯ Authors: Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li (possible past Tsinghua University affiliation), Jing Liu (possible past Baidu (China) affiliation), Jian Cheng
Abstract

Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confid...

πŸ“„ An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02400v1
πŸ‘₯ Authors: Qizhen Zhang, Ankush Garg, Jakob Foerster (possible past University Of Oxford affiliation), Niladri Chatterji, Kshitiz Malik, Mike Lewis (possible past Meta (United States) affiliation)
Abstract

Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy d...

πŸ“„ Learning Markov Decision Processes under Fully Bandit Feedback
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02260v1
πŸ‘₯ Authors: Zhengjia Zhuo, Anupam Gupta (possible past Carnegie Mellon University affiliation), Viswanath Nagarajan (possible past Carnegie Mellon University affiliation)
Abstract

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, b...

πŸ“„ Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02244v1
πŸ‘₯ Authors: Hao Wang (possible past Tsinghua University affiliation), Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue (possible past University Of California, Berkeley affiliation), Sirui Han, Yike Guo, Dapeng Wu
Abstract

The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningf...

πŸ“„ Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02164v1
πŸ‘₯ Authors: Pengfei He, Ash Fox, Lesly Miculicich, Stefan Friedli, Daniel Fabian, Burak Gokturk, Jiliang Tang, Chen-Yu Lee (possible past Google (United States) affiliation), Tomas Pfister (possible past University Of Oxford affiliation), Long T. Le
Abstract

Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and a lack of experience reuse. We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming workflows by integrating security-domain knowledge, code-aware analysis, execution-grounded iterative reasoning, and long-term memory. Co-R...

πŸ“„ Revisiting Adaptive Rounding with Vectorized Reparameterization for LLM Quantization
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02151v1
πŸ‘₯ Authors: Yuli Zhou, Qingxuan Chen, Luca Benini (possible past Eth Zurich affiliation), Guolei Sun, Yawei Li (possible past Google (United States) affiliation)
Abstract

Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, V...

πŸ“„ No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.02103v1
πŸ‘₯ Authors: Liyan Xu, Mo Yu (possible past Tencent (China) affiliation), Fandong Meng (possible past Tencent (China) affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing me...

πŸ“„ Self-Consolidation for Self-Evolving Agents
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01966v1
πŸ‘₯ Authors: Hongzhuo Yu, Fei Zhu, Guo-Sen Xie (possible past Inception Institute Of Artificial Intelligence affiliation), Ling Shao (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

While large language model (LLM) agents have demonstrated impressive problem-solving capabilities, they typically operate as static systems, lacking the ability to evolve through lifelong interaction. Existing attempts to bridge this gap primarily rely on retrieving successful past trajectories as demonstrations. However, this paradigm faces two critical limitations. First, by focusing solely on success, agents overlook the rich pedagogical value embedded in failed attempts, preventing them from...

πŸ“„ Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01849v1
πŸ‘₯ Authors: Ziwei Luo, Ziqi Jin, Lei Wang (possible past Baidu (China) affiliation), Lidong Bing (possible past Tencent (China) affiliation), Thomas B. SchΓΆn
Abstract

This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity...

πŸ“„ Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01842v1
πŸ‘₯ Authors: Jinbin Bai, Yixuan Li (possible past Meta (United States) affiliation), Yuchen Zhu, Yi Xin, Qingyu Shi, Aosong Feng, Xiaohong Liu (possible past Shanghai Jiao Tong University affiliation), Molei Tao, Jianru Xue, Xiangtai Li (possible past Peking University affiliation), Ming-Hsuan Yang
Abstract

Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verific...

πŸ“„ Position: Beyond Model-Centric Prediction -- Agentic Time Series Forecasting
πŸ—“οΈ Published: 2/2/2026
πŸ”— http://arxiv.org/abs/2602.01776v1
πŸ‘₯ Authors: Mingyue Cheng, Xiaoyu Tao, Qi Liu (possible past Tencent (China) affiliation), Ze Guo, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Time series forecasting has traditionally been formulated as a model-centric, static, and single-pass prediction problem that maps historical observations to future values. While this paradigm has driven substantial progress, it proves insufficient in adaptive and multi-turn settings where forecasting requires informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time. In this paper, we argue for agentic time series forecasting (ATSF), wh...

*Notable papers are those with at least two authors from a "big" AI/ML lab.