πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Data Science and Technology Towards AGI Part I: Tiered Data Management
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.09003v1
πŸ‘₯ Authors: Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao (possible past Stanford University affiliation), Chuyue Zhou, Xinle Lin, Hongya Lyu, Shuaikang Xue, Yi Yi, Yingjiao Wang, Zhi Zheng, Yuzhou Zhang (possible past Google (United States) affiliation), Jie Zhou (possible past Tsinghua University affiliation), Chaojun Xiao, Xu Han (possible past Tsinghua University affiliation), Zhiyuan Liu (possible past Tsinghua University affiliation), Maosong Sun (possible past Tsinghua University affiliation)
Abstract

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new p...

πŸ“„ iGRPO: Self-Feedback-Driven LLM Reasoning
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.09000v1
πŸ‘₯ Authors: Ali Hatamizadeh (possible past Nvidia (United States) affiliation), Shrimai Prabhumoye (possible past Carnegie Mellon University affiliation), Igor Gitman, Ximing Lu, Seungju Han, Wei Ping (possible past Baidu (China) affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation), Jan Kautz (possible past Nvidia (United States) affiliation)
Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce It...

πŸ“„ InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08990v1
πŸ‘₯ Authors: Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang (possible past Peking University affiliation), Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lilong Wang, Zifu Wang, Jiong Wang, Wanghan Xu, Yue Deng, Dongrui Liu, Yiheng Wang, Wenlong Zhang, Fenghua Ling, Shufei Zhang, Xiaosong Wang (possible past Nvidia (United States) affiliation), Shuangjia Zheng, Xun Huang (possible past Nvidia (United States) affiliation), Siqi Sun, Shuyue Hu, Peng Ye, Chunfeng Song, Bin Wang, Conghui He (possible past Tsinghua University affiliation), Yihao Liu, Xin Li (possible past Google (United States) affiliation), Qibin Hou, Tao Chen, Xiangyu Yue (possible past University Of California, Berkeley affiliation), Bin Wang, Liang He, Dahua Lin, Bowen Zhou, Bo Zhang (possible past Tencent (China) affiliation), Lei Bai
Abstract

We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. The architecture allows InternAgent-1.5 to operate continuously across extended discovery cycles while...

πŸ“„ FlattenGPT: Depth Compression for Transformer with Layer Flattening
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08858v1
πŸ‘₯ Authors: Ruihan Xu, Qingpei Guo, Yao Zhu, Xiangyang Ji (possible past Tsinghua University affiliation), Ming Yang (possible past Meta (United States) affiliation), Shiliang Zhang
Abstract

Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers....

πŸ“„ WildReward: Learning Reward Models from In-the-Wild Human Interactions
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08829v1
πŸ‘₯ Authors: Hao Peng (possible past Tsinghua University affiliation), Yunjia Qi, Xiaozhi Wang, Zijun Yao, Lei Hou (possible past Tsinghua University affiliation), Juanzi Li
Abstract

Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract rel...

πŸ“„ OSCAR: Optimization-Steered Agentic Planning for Composed Image Retrieval
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08603v1
πŸ‘₯ Authors: Teng Wang, Rong Shan, Jianghao Lin, Junjie Wu, Tianyi Xu, Jianping Zhang, Wenteng Chen, Changwang Zhang (possible past Tencent (China) affiliation), Zhaoxiang Wang, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Jun Wang (possible past Tencent (China) affiliation)
Abstract

Composed image retrieval (CIR) requires complex reasoning over heterogeneous visual and textual constraints. Existing approaches largely fall into two paradigms: unified embedding retrieval, which suffers from single-model myopia, and heuristic agentic retrieval, which is limited by suboptimal, trial-and-error orchestration. To this end, we propose OSCAR, an optimization-steered agentic planning framework for composed image retrieval. We are the first to reformulate agentic CIR from a heuristic ...

πŸ“„ Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08533v1
πŸ‘₯ Authors: Kun Peng, Conghui Tan, Yu Liu, Guohua Tang, Zhongqian Sun (possible past Tencent (China) affiliation), Wei Yang (possible past Tencent (China) affiliation), Zining Zhu, Lei Jiang, Yanbing Liu, Hao Peng (possible past Tsinghua University affiliation)
Abstract

Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a ...

πŸ“„ MemAdapter: Fast Alignment across Agent Memory Paradigms via Generative Subgraph Retrieval
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08369v1
πŸ‘₯ Authors: Xin Zhang (possible past Google (United States) affiliation), Kailai Yang, Chenyue Li, Hao Li (possible past Tsinghua University affiliation), Qiyu Wei, Jun'ichi Tsujii, Sophia Ananiadou
Abstract

Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retriev...

πŸ“„ OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08344v1
πŸ‘₯ Authors: Qi Guo, Jianing Wang, Deyang Kong, Xiangyu Xi, Jianfei Zhang, Yi Lu, Jingang Wang, Wei Wang (possible past University Of Oxford affiliation), Shikun Zhang, Wei Ye (possible past Meta (United States) affiliation)
Abstract

Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, with limited attention to the path exploration stage. In this paper, we theoretically analyze the opti...

πŸ“„ Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08335v1
πŸ‘₯ Authors: Yanming Li, Xuelin Zhang, Wenjie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang (possible past Tencent (China) affiliation), Hong Chen, Li Shen (possible past Tencent (China) affiliation)
Abstract

Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contribut...

πŸ“„ Latent Reasoning with Supervised Thinking States
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08332v1
πŸ‘₯ Authors: Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson (possible past Google (United States) affiliation), Jonathan Herzig (possible past Google (United States) affiliation), Lior Shani, Idan Szpektor (possible past Google (United States) affiliation)
Abstract

Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. Fir...

πŸ“„ Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08329v1
πŸ‘₯ Authors: Yifei Gao, Lei Wang (possible past Baidu (China) affiliation), Rong-Cheng Tu, Qixin Zhang, Jun Cheng (possible past Deepmind (United Kingdom) affiliation), Dacheng Tao
Abstract

A core bottleneck in large language model (LLM) inference is the cost of attending over the ever-growing key-value (KV) cache. Although near-oracle top-k KV selection can preserve the quality of dense attention while sharply reducing computation and bandwidth, existing sparse methods generally rely on posterior heuristics, i.e., selectors conditioned on observed attention or proxy scores. Such conditioning introduces posterior bias: it tends to distort true token importance and miss salient toke...

πŸ“„ When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08235v1
πŸ‘₯ Authors: Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio (possible past Mila - Quebec Artificial Intelligence Institute affiliation), Dawn Song (possible past University Of California, Berkeley affiliation), Yu Su, Huan Sun
Abstract

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodologica...

πŸ“„ Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08233v1
πŸ‘₯ Authors: Jiatao Chen, Xing Tang, Xiaoyue Duan, Yutang Feng, Jinchao Zhang (possible past Tencent (China) affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learni...

πŸ“„ FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
πŸ—“οΈ Published: 2/8/2026
πŸ”— http://arxiv.org/abs/2602.08024v1
πŸ‘₯ Authors: Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li (possible past Baidu (China) affiliation), Li Jiang (possible past Tencent (China) affiliation), Zhuotao Tian
Abstract

Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial positio...

πŸ“„ Towards Adaptive, Scalable, and Robust Coordination of LLM Agents: A Dynamic Ad-Hoc Networking Perspective
πŸ—“οΈ Published: 2/8/2026
πŸ”— http://arxiv.org/abs/2602.08009v1
πŸ‘₯ Authors: Rui Li (possible past Google (United States) affiliation), Zeyu Zhang, Xiaohe Bo, Quanyu Dai, Chaozhuo Li, Feng Wen, Xu Chen (possible past Tencent (China) affiliation)
Abstract

Multi-agent architectures built on large language models (LLMs) have demonstrated the potential to realize swarm intelligence through well-crafted collaboration. However, the substantial burden of manual orchestration inherently raises an imperative to automate the design of agentic workflows. We frame such an agent coordination challenge as a classic problem in dynamic ad-hoc networking: How to establish adaptive and reliable communication among a scalable number of agentic hosts? In response t...

πŸ“„ Learning-guided Kansa collocation for forward and inverse PDEs beyond linearity
πŸ—“οΈ Published: 2/8/2026
πŸ”— http://arxiv.org/abs/2602.07970v1
πŸ‘₯ Authors: Zheyuan Hu, Weitao Chen, Cengiz Γ–ztireli (possible past University Of Cambridge affiliation), Chenliang Zhou, Fangcheng Zhong (possible past University Of Cambridge affiliation)
Abstract

Partial Differential Equations are precise in modelling the physical, biological and graphical phenomena. However, the numerical methods suffer from the curse of dimensionality, high computation costs and domain-specific discretization. We aim to explore pros and cons of different PDE solvers, and apply them to specific scientific simulation problems, including forwarding solution, inverse problems and equations discovery. In particular, we extend the recent CNF (NeurIPS 2023) framework solver t...

πŸ“„ Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.09017v1
πŸ‘₯ Authors: Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru, Bowen Tan, Zavier Andrianarivo, Zicheng Teng, Yihang Zhou, Krish Mehta, Nicholas Wojno, Kevin Yuanbo Wu, Manan H Anjaria, Ziyuan Wu, Manrong Mao, Guangxun Zhang, Binit Shah, Yejin Kim, Soumith Chintala (possible past Meta (United States) affiliation), Lerrel Pinto (possible past Carnegie Mellon University affiliation), Nur Muhammad Mahi Shafiullah
Abstract

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility m...

πŸ“„ Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08909v1
πŸ‘₯ Authors: Zhendong Wang, Cihan Ruan, Jingchuan Xiao, Chuqing Shi, Wei Jiang (possible past Apple (United States) affiliation), Wei Wang (possible past University Of Oxford affiliation), Wenjie Liu, Nam Ling
Abstract

We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers f...

πŸ“„ Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08324v1
πŸ‘₯ Authors: Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li (possible past Peking University affiliation), Jie Hu, Xinghao Chen, Rongrong Ji (possible past Tencent (China) affiliation), Shaohui Lin
Abstract

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively redu...

πŸ“„ SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
πŸ—“οΈ Published: 2/9/2026
πŸ”— http://arxiv.org/abs/2602.08234v1
πŸ‘₯ Authors: Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang (possible past Tsinghua University affiliation), Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie (possible past Google (United States) affiliation), Huaxiu Yao
Abstract

Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement throu...

πŸ“„ Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations
πŸ—“οΈ Published: 2/8/2026
πŸ”— http://arxiv.org/abs/2602.08067v1
πŸ‘₯ Authors: Chenglei Shen, Yi Zhan, Weijie Yu, Xiao Zhang (possible past Tsinghua University affiliation), Jun Xu (possible past Google (United States) affiliation)
Abstract

In real-world streaming recommender systems, user preferences evolve dynamically over time. Existing bandit-based methods treat time merely as a timestamp, neglecting its explicit relationship with user preferences and leading to suboptimal performance. Moreover, online learning methods often suffer from inefficient exploration-exploitation during the early online phase. To address these issues, we propose HyperBandit+, a novel contextual bandit policy that integrates a time-aware hypernetwork t...

*Notable papers are those with at least two authors from a "big" AI/ML lab.