πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ A Very Big Video Reasoning Suite
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20159v1
πŸ‘₯ Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, ThaddΓ€us Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, RaphaΓ«l MilliΓ¨re, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille (possible past Google (United States) affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Ziming Liu (possible past Massachusetts Institute Of Technology affiliation), Bo Li (possible past Tencent (China) affiliation), Dahua Lin, Ziwei Liu, Vikash Kumar (possible past University Of Washington affiliation), Yijiang Li, Lei Yang (possible past Google (United States) affiliation), Zhongang Cai, Hokin Deng
Abstract

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To addr...

πŸ“„ AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20133v1
πŸ‘₯ Authors: Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu (possible past Tencent (China) affiliation), Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen (possible past University Of California, Berkeley affiliation), Matei Zaharia (possible past University Of California, Berkeley affiliation), Alex Dimakis, Ion Stoica (possible past University Of California, Berkeley affiliation)
Abstract

The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promis...

πŸ“„ NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20119v1
πŸ‘₯ Authors: Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li (possible past Meta (United States) affiliation), Jianing Qian, Jennifer L. Barry, Kris Kitani (possible past Carnegie Mellon University affiliation), George Konidaris
Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high lev...

πŸ“„ Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20078v1
πŸ‘₯ Authors: Shan Yang (possible past Google (United States) affiliation), Yang Liu (possible past Tsinghua University affiliation)
Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/Ρ)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analy...

πŸ“„ Agents of Chaos
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20021v1
πŸ‘₯ Authors: Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, Eunjeong Hwang, Hadas Orgad, P Sam Sahil, Negev Taglicht, Tomer Shabtay, Atai Ambus, Nitay Alon, Shiri Oron, Ayelet Gordon-Tapiero, Yotam Kaplan, Vered Shwartz, Tamar Rott Shaham (possible past Technion – Israel Institute Of Technology affiliation), Christoph Riedl, Reuth Mirsky, Maarten Sap, David Manheim, Tomer Ullman (possible past Massachusetts Institute Of Technology affiliation), David Bau (possible past Google (United States) affiliation)
Abstract

We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case s...

πŸ“„ VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19622v1
πŸ‘₯ Authors: Jingbo Zhou (possible past Baidu (China) affiliation), Jun Xia, Siyuan Li (possible past Tencent (China) affiliation), Yunfan Liu, Wenjun Wang, Yufei Huang, Changxi Chi, Mutian Hong, Zhuoli Ouyang, Shu Wang, Zhongqi Wang, Xingyu Wu, Chang Yu, Stan Z. Li
Abstract

Graph Transformer has demonstrated impressive capabilities in the field of graph representation learning. However, existing approaches face two critical challenges: (1) most models suffer from exponentially increasing computational complexity, making it difficult to scale to large graphs; (2) attention mechanisms based on node-level operations limit the flexibility of the model and result in poor generalization performance in out-of-distribution (OOD) scenarios. To address these issues, we propo...

πŸ“„ DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19565v1
πŸ‘₯ Authors: Li Zhang (possible past University Of Oxford affiliation), Mingyu Mei, Ailing Wang, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, Rujing Wang, Zaixing He, Cewu Lu (possible past Shanghai Jiao Tong University affiliation)
Abstract

Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt pro...

πŸ“„ Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19536v1
πŸ‘₯ Authors: Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Runze Yang, Huiying Xu, Xinzhong Zhu, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Wei Liu (possible past Tsinghua University affiliation)
Abstract

Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation...

πŸ“„ TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19313v1
πŸ‘₯ Authors: Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox (possible past University Of Washington affiliation), Ranjay Krishna (possible past University Of Washington affiliation)
Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded...

πŸ“„ No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19248v1
πŸ‘₯ Authors: Zunkai Dai, Ke Li (possible past University Of California, Berkeley affiliation), Jiajia Liu, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Yuanyuan Qiao
Abstract

The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exp...

πŸ“„ CRCC: Contrast-Based Robust Cross-Subject and Cross-Site Representation Learning for EEG
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19138v1
πŸ‘₯ Authors: Xiaobin Wong, Zhonghua Zhao, Haoran Guo, Zhengyi Liu, Yu Wu (possible past Baidu (China) affiliation), Feng Yan (possible past Meta (United States) affiliation), Zhiren Wang, Sen Song
Abstract

EEG-based neural decoding models often fail to generalize across acquisition sites due to structured, site-dependent biases implicitly exploited during training. We reformulate cross-site clinical EEG learning as a bias-factorized generalization problem, in which domain shifts arise from multiple interacting sources. We identify three fundamental bias factors and propose a general training framework that mitigates their influence through data standardization and representation-level constraints....

πŸ“„ Test-Time Learning of Causal Structure from Interventional Data
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19131v1
πŸ‘₯ Authors: Wei Chen, Rui Ding, Bojun Huang, Yang Zhang (possible past Tsinghua University affiliation), Qiang Fu (possible past Tencent (China) affiliation), Yuxuan Liang, Han Shi, Dongmei Zhang
Abstract

Supervised causal learning has shown promise in causal discovery, yet it often struggles with generalization across diverse interventional settings, particularly when intervention targets are unknown. To address this, we propose TICL (Test-time Interventional Causal Learning), a novel method that synergizes Test-Time Training with Joint Causal Inference. Specifically, we design a self-augmentation strategy to generate instance-specific training data at test time, effectively avoiding distributio...

πŸ“„ K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19128v1
πŸ‘₯ Authors: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation), Ion Stoica (possible past University Of California, Berkeley affiliation)
Abstract

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and fr...

πŸ“„ Learning to Detect Language Model Training Data via Active Reconstruction
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19020v1
πŸ‘₯ Authors: Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min (possible past University Of Washington affiliation), Hannaneh Hajishirzi (possible past University Of Washington affiliation)
Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstr...

πŸ“„ MagicAgent: Towards Generalized Agent Planning
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19000v1
πŸ‘₯ Authors: Xuhui Ren, Shaokang Dong, Chen Yang (possible past Tencent (China) affiliation), Qing Gao, Yunbin Zhao, Yongsheng Liu, Xinwei Geng, Xiang Li, Demei Yan, Yanqing Li, Chenhao Huang, Dingwei Zhu, Junjie Ye, Boxuan Yue, Yingnan Fu, Mengzhe Lv, Zezeng Feng, Boshen Zhou, Bocheng Wang, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang (possible past Tencent (China) affiliation), Yunke Zhang
Abstract

The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradi...

πŸ“„ Robust and Efficient Tool Orchestration via Layered Execution Structures with Reflective Correction
πŸ—“οΈ Published: 2/21/2026
πŸ”— http://arxiv.org/abs/2602.18968v1
πŸ‘₯ Authors: Tao Zhe, Haoyu Wang (possible past Tencent (China) affiliation), Bo Luo, Min Wu, Wei Fan (possible past Tencent (China) affiliation), Xiao Luo, Zijun Yao, Haifeng Chen, Dongjie Wang
Abstract

Tool invocation is a core capability of agentic systems, yet failures often arise not from individual tool calls but from how multiple tools are organized and executed together. Existing approaches tightly couple tool execution with stepwise language reasoning or explicit planning, leading to brittle behavior and high execution overhead. To overcome these limitations, we revisit tool invocation from the perspective of tool orchestration. Our key insight is that effective orchestration does not r...

πŸ“„ DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19895v1
πŸ‘₯ Authors: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang (possible past Google (United States) affiliation), Xin Wang (possible past University Of Edinburgh affiliation), Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
Abstract

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DS...

πŸ“„ Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19548v1
πŸ‘₯ Authors: Jeffrey Li, Josh Gardner (possible past University Of Washington affiliation), Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall (possible past Nvidia (United States) affiliation), Oncel Tuzel (possible past Apple (United States) affiliation), Percy Liang (possible past Stanford University affiliation), Ludwig Schmidt (possible past University Of Washington affiliation), Hadi Pour Ansari, Fartash Faghri (possible past University Of Toronto affiliation)
Abstract

One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tas...

πŸ“„ Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
πŸ—“οΈ Published: 2/22/2026
πŸ”— http://arxiv.org/abs/2602.19089v1
πŸ‘₯ Authors: Qi Sun (possible past Google (United States) affiliation), Can Wang (possible past Tsinghua University affiliation), Jiaxiang Shang, Yingchun Liu, Jing Liao
Abstract

Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion...

*Notable papers are those with at least two authors from a "big" AI/ML lab.