πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ How cyborg propaganda reshapes collective action
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.13088v1
πŸ‘₯ Authors: Jonas R. Kunst, Kinga Bierwiaczonek, Meeyoung Cha, Omid V. Ebrahimi, Marc Fawcett-Atkinson, AsbjΓΈrn FΓΈlstad, Anton Gollwitzer, Nils KΓΆbis, Gary Marcus, Jon Roozenbeek (possible past University Of Cambridge affiliation), Daniel Thilo Schroeder, Jay J. Van Bavel, Sander Van Der Linden (possible past University Of Cambridge affiliation), Rory White, Live Leonhardsen Wilhelmsen
Abstract

The distinction between genuine grassroots activism and automated influence operations is collapsing. While policy debates focus on bot farms, a distinct threat to democracy is emerging via partisan coordination apps and artificial intelligence-what we term 'cyborg propaganda.' This architecture combines large numbers of verified humans with adaptive algorithmic automation, enabling a closed-loop system. AI tools monitor online sentiment to optimize directives and generate personalized content f...

πŸ“„ Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.13035v1
πŸ‘₯ Authors: Yixiao Zhou, Yang Li (possible past Google (United States) affiliation), Dongzhou Cheng, Hehe Fan, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learn...

πŸ“„ Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12996v1
πŸ‘₯ Authors: Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu (possible past Tsinghua University affiliation), Qingfu Zhu, Maosong Sun (possible past Tsinghua University affiliation), Wanxiang Che
Abstract

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Ou...

πŸ“„ Learning Native Continuation for Action Chunking Flow Policies
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12978v1
πŸ‘₯ Authors: Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie (possible past University Of Washington affiliation), Junliang Guo, Dequan Wang, Yang Gao (possible past Tencent (China) affiliation)
Abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mix...

πŸ“„ EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12919v1
πŸ‘₯ Authors: Xiao Wang (possible past Google (United States) affiliation), Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang, Si-Bao Chen, Yaowei Wang, Yonghong Tian (possible past Peking University affiliation)
Abstract

Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, col...

πŸ“„ BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12876v1
πŸ‘₯ Authors: Huanyao Zhang, Jiepeng Zhou, Bo Li (possible past Tencent (China) affiliation), Bowen Zhou, Yanzhe Dan, Haishan Lu, Zhiyong Cao, Jiaoyang Chen, Yuqian Han, Zinan Sheng, Zhengwei Tao, Hao Liang, Jialong Wu, Yang Shi, Yuanpeng He, Jiaye Lin, Qintong Zhang, Guochen Yan, Runhao Zhao, Zhengpin Li, Xiaohan Yu, Lang Mei, Chong Chen, Wentao Zhang (possible past Mila - Quebec Artificial Intelligence Institute affiliation), Bin Cui (possible past Peking University affiliation)
Abstract

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we intro...

πŸ“„ WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12852v1
πŸ‘₯ Authors: Junjie Wang, Zequn Xie, Dan Yang, Jie Feng (possible past Tsinghua University affiliation), Yue Shen, Duolin Sun, Meixiu Long, Yihan Jiao, Zhehao Tan, Jian Wang (possible past Baidu (China) affiliation), Peng Wei, Jinjie Gu
Abstract

Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent's search process as...

πŸ“„ MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12705v1
πŸ‘₯ Authors: Baorong Shi (possible past Baidu (China) affiliation), Bo Cui, Boyuan Jiang (possible past Tencent (China) affiliation), Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang
Abstract

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce l...

πŸ“„ SLA2: Sparse-Linear Attention with Learnable Routing and QAT
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12675v1
πŸ‘₯ Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica (possible past University Of California, Berkeley affiliation), Jianfei Chen, Jun Zhu (possible past Tsinghua University affiliation), Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation)
Abstract

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2,...

πŸ“„ Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12662v1
πŸ‘₯ Authors: Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo, Xinbo Xu, Bo Zhao (possible past National University Of Singapore affiliation), Ruotian Ma, Shanyi Wang, Zhaopeng Tu (possible past Tencent (China) affiliation), Xiaolong Li, Deqing Yang, Linus
Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we int...

πŸ“„ SD-MoE: Spectral Decomposition for Effective Expert Specialization
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12556v1
πŸ‘₯ Authors: Ruijun Huang, Fang Dong, Xin Zhang (possible past Google (United States) affiliation), Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, Yujiang Wang, Jinlong Hou (possible past Tencent (China) affiliation), Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang (possible past Tencent (China) affiliation), Tun Lu, Chun Zhang, Li Shang
Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral componen...

πŸ“„ Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12517v1
πŸ‘₯ Authors: Lorenzo Magnino, Jiacheng Shen, Matthieu Geist (possible past Google (United States) affiliation), Olivier Pietquin (possible past Google (United States) affiliation), Mathieu LauriΓ¨re
Abstract

The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchm...

πŸ“„ Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12281v1
πŸ‘₯ Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini (possible past Google (United States) affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Marco Pavone (possible past Stanford University affiliation)
Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that ...

πŸ“„ Agentic Test-Time Scaling for WebAgents
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12276v1
πŸ‘₯ Authors: Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney (possible past Stanford University affiliation), Kurt Keutzer (possible past University Of California, Berkeley affiliation), Amir Gholami
Abstract

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-...

πŸ“„ ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12322v1
πŸ‘₯ Authors: Zhuoyang Zhang, Shang Yang, Qinghao Hu, Luke J. Huang, James Hou, Yufei Sun, Yao Lu (possible past Google (United States) affiliation), Song Han (possible past Stanford University affiliation)
Abstract

Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accurac...

πŸ“„ Olmix: A Framework for Data Mixing Throughout LM Development
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12237v1
πŸ‘₯ Authors: Mayee F. Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi (possible past University Of Washington affiliation), Christopher RΓ© (possible past Stanford University affiliation), Luca Soldaini, Kyle Lo
Abstract

Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like ...

πŸ“„ STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12143v1
πŸ‘₯ Authors: Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zijian Chen, Chunyi Li, Xiaohong Liu (possible past Shanghai Jiao Tong University affiliation), Zicheng Zhang, Guangtao Zhai (possible past Shanghai Jiao Tong University affiliation)
Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds se...

πŸ“„ Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12125v1
πŸ‘₯ Authors: Wenkai Yang, Weijie Liu (possible past Tencent (China) affiliation), Ruobing Xie (possible past Tencent (China) affiliation), Kai Yang, Saiyong Yang, Yankai Lin (possible past Tsinghua University affiliation)
Abstract

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any...

πŸ“„ Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12113v1
πŸ‘₯ Authors: Zewei Yu, Lirong Gao, Yuke Zhu (possible past Stanford University affiliation), Bo Zheng, Sheng Guo (possible past Google (United States) affiliation), Haobo Wang, Junbo Zhao
Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity indu...

πŸ“„ DeepSight: An All-in-One LM Safety Toolkit
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12092v1
πŸ‘₯ Authors: Bo Zhang (possible past Tencent (China) affiliation), Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zheng, Qihao Lin, Lewen Yan, Chen Qian (possible past Shanghai Jiao Tong University affiliation), Yijin Zhou, Yuyao Wu, Shaoxiong Guo, Tianyi Du, Jingyi Yang, Xuhao Hu, Ziqi Miao, Xiaoya Lu, Jing Shao, Xia Hu
Abstract

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In t...

πŸ“„ Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12089v2
πŸ‘₯ Authors: Kehang Zhu, Nithum Thain (possible past Google (United States) affiliation), Vivian Tsai, James Wexler (possible past Google (United States) affiliation), Crystal Qian
Abstract

As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N = 243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order, grants access to a single LLM assistance modality: proactive recommendations from an Advisor, reactive feedback from a Coach, or autonomous execution ...

πŸ“„ CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12004v1
πŸ‘₯ Authors: Robert Cronshaw, Konstantinos Vilouras, Junyu Yan, Yuning Du (possible past Baidu (China) affiliation), Feng Chen, Steven Mcdonagh, Sotirios A. Tsaftaris (possible past University Of Edinburgh affiliation)
Abstract

Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a...

πŸ“„ Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.11877v1
πŸ‘₯ Authors: Wanxing Wu, He Zhu, Yixia Li, Lei Yang (possible past Google (United States) affiliation), Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang (possible past Tencent (China) affiliation), Bingyi Jing, Guanhua Chen
Abstract

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output pr...

πŸ“„ Intelligent AI Delegation
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.11865v1
πŸ‘₯ Authors: Nenad TomaΕ‘ev (possible past Google (United States) affiliation), Matija Franklin, Simon Osindero (possible past Google (United States) affiliation)
Abstract

AI agents are able to tackle increasingly complex tasks. To achieve more ambitious goals, AI agents need to be able to meaningfully decompose problems into manageable sub-components, and safely delegate their completion across to other AI agents and humans alike. Yet, existing task decomposition and delegation methods rely on simple heuristics, and are not able to dynamically adapt to environmental changes and robustly handle unexpected failures. Here we propose an adaptive framework for intelli...

πŸ“„ Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.13197v1
πŸ‘₯ Authors: Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi (possible past University Of Washington affiliation), Shenlong Wang (possible past University Of Toronto affiliation), Wei-Chiu Ma
Abstract

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forwar...

πŸ“„ Contextual Online Bilateral Trade
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12903v1
πŸ‘₯ Authors: Romain Cosson, Federico Fusco, Anupam Gupta (possible past Carnegie Mellon University affiliation), Stefano Leonardi, Renato Paes Leme (possible past Google (United States) affiliation), Matteo Russo
Abstract

We study repeated bilateral trade when the valuations of the sellers and the buyers are contextual. More precisely, the agents' valuations are given by the inner product of a context vector with two unknown $d$-dimensional vectors -- one for the buyers and one for the sellers. At each time step $t$, the learner receives a context and posts two prices, one for the seller and one for the buyer, and the trade happens if both agents accept their price. We study two objectives for this problem, gai...

πŸ“„ Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12684v1
πŸ‘₯ Authors: Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu (possible past Tsinghua University affiliation), Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang (possible past Tsinghua University affiliation), Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang (possible past Google (United States) affiliation), Quanyun Zhou
Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-...

πŸ“„ Uncovering spatial tissue domains and cell types in spatial omics through cross-scale profiling of cellular and genomic interactions
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12651v1
πŸ‘₯ Authors: Rui Yan (possible past Peking University affiliation), Xiaohan Xing, Xun Wang, Zixia Zhou, Md Tauhidul Islam, Lei Xing (possible past Stanford University affiliation)
Abstract

Cellular identity and function are linked to both their intrinsic genomic makeup and extrinsic spatial context within the tissue microenvironment. Spatial transcriptomics (ST) offers an unprecedented opportunity to study this, providing in situ gene expression profiles at single-cell resolution and illuminating the spatial and functional organization of cells within tissues. However, a significant hurdle remains: ST data is inherently noisy, large, and structurally complex. This complexity makes...

πŸ“„ Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12587v1
πŸ‘₯ Authors: Anrui Chen, Ruijun Huang, Xin Zhang (possible past Google (United States) affiliation), Fang Dong, Hengjie Cao, Zhendong Huang, Yifeng Yang, Mengyi Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Jinlong Hou (possible past Tencent (China) affiliation), Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang (possible past Tencent (China) affiliation), Li Shang
Abstract

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable...

πŸ“„ AMPS: Adaptive Modality Preference Steering via Functional Entropy
πŸ—“οΈ Published: 2/13/2026
πŸ”— http://arxiv.org/abs/2602.12533v1
πŸ‘₯ Authors: Zihan Huang, Xintong Li, Rohan Surana, Tong Yu (possible past Carnegie Mellon University affiliation), Rui Wang (possible past Tencent (China) affiliation), Julian Mcauley, Jingbo Shang, Junda Wu
Abstract

Multimodal Large Language Models (MLLMs) often exhibit significant modality preference, which is a tendency to favor one modality over another. Depending on the input, they may over-rely on linguistic priors relative to visual evidence, or conversely over-attend to visually salient but facts in textual contexts. Prior work has applied a uniform steering intensity to adjust the modality preference of MLLMs. However, strong steering can impair standard inference and increase error rates, whereas w...

πŸ“„ Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12405v1
πŸ‘₯ Authors: Carl Qi, Xiaojie Wang, Silong Yong, Stephen Sheng, Huitan Mao, Sriram Srinivasan (possible past Deepmind (United Kingdom) affiliation), Manikantan Nambi, Amy Zhang (possible past University Of California, Berkeley affiliation), Yesh Dattatreya
Abstract

Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. ...

πŸ“„ T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12262v2
πŸ‘₯ Authors: Tunyu Zhang, Xinxi Zhang, Ligong Han (possible past Google (United States) affiliation), Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang (possible past Tsinghua University affiliation), Kai Xu (possible past National University Of Defense Technology affiliation), Akash Srivastava, Hao Wang (possible past Tsinghua University affiliation), Vladimir Pavlovic, Dimitris N. Metaxas
Abstract

Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We...

πŸ“„ Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
πŸ—“οΈ Published: 2/12/2026
πŸ”— http://arxiv.org/abs/2602.12241v1
πŸ‘₯ Authors: Manjunath Kudlur (possible past Nvidia (United States) affiliation), Evan King, James Wang, Pete Warden (possible past Google (United States) affiliation)
Abstract

Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this gl...

*Notable papers are those with at least two authors from a "big" AI/ML lab.