๐Ÿ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

๐Ÿ“„ ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.04385v1
๐Ÿ‘ฅ Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao (possible past Google (United States) affiliation), Jonathan T. Barron (possible past Google (United States) affiliation), Noah Snavely (possible past Google (United States) affiliation), Aleksander Holynski (possible past University Of Washington affiliation)
Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $ฯ€^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpa...

๐Ÿ“„ CubeComposer: Spatio-Temporal Autoregressive 4K 360ยฐ Video Generation from Perspective Video
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.04291v1
๐Ÿ‘ฅ Authors: Lingen Li, Guangzhi Wang, Xiaoyu Li (possible past Tencent (China) affiliation), Zhaoyang Zhang, Qi Dou, Jinwei Gu (possible past Shanghai Artificial Intelligence Laboratory affiliation), Tianfan Xue (possible past Massachusetts Institute Of Technology affiliation), Ying Shan (possible past Tencent (China) affiliation)
Abstract

Generating high-quality 360ยฐ panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diff...

๐Ÿ“„ Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.04128v1
๐Ÿ‘ฅ Authors: Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen (possible past University Of California, Berkeley affiliation), Dan Guo, Hongyuan Zhang, Xuelong Li (possible past Tencent (China) affiliation), Di Hu (possible past Baidu (China) affiliation)
Abstract

Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent...

๐Ÿ“„ Phi-4-reasoning-vision-15B Technical Report
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03975v1
๐Ÿ‘ฅ Authors: Jyoti Aneja, Michael Harrison, Neel Joshi (possible past Microsoft (United States) affiliation), Tyler Labonte, John Langford (possible past Microsoft (United States) affiliation), Eduardo Salinas
Abstract

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understan...

๐Ÿ“„ From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03825v1
๐Ÿ‘ฅ Authors: Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang (possible past Tsinghua University affiliation), Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang (possible past Peking University affiliation), Mingkun Yang, Yujiu Yang (possible past Tsinghua University affiliation), Junyang Lin, Zhibo Yang
Abstract

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal ...

๐Ÿ“„ Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03818v1
๐Ÿ‘ฅ Authors: Huihan Liu, Changyeon Kim, Bo Liu (possible past Meta (United States) affiliation), Minghuan Liu, Yuke Zhu (possible past Stanford University affiliation)
Abstract

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to f...

๐Ÿ“„ T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03790v1
๐Ÿ‘ฅ Authors: Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen (possible past Google (United States) affiliation), Helen Li (possible past Google (United States) affiliation), Yiran Chen
Abstract

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks a...

๐Ÿ“„ DisenReason: Behavior Disentanglement and Latent Reasoning for Shared-Account Sequential Recommendation
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03782v1
๐Ÿ‘ฅ Authors: Jiawei Cheng, Min Gao, Zongwei Wang (possible past Baidu (China) affiliation), Xiaofei Zhu, Zhiyi Liu, Wentao Li, Wei Li (possible past Peking University affiliation), Huan Wu
Abstract

Shared-account usage is common on streaming and e-commerce platforms, where multiple users share one account. Existing shared-account sequential recommendation (SSR) methods often assume a fixed number of latent users per account, limiting their ability to adapt to diverse sharing patterns and reducing recommendation accuracy. Recent latent reasoning technique applied in sequential recommendation (SR) generate intermediate embeddings from the user embedding (e.g, last item embedding) to uncover ...

๐Ÿ“„ LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03781v1
๐Ÿ‘ฅ Authors: Zihao Cheng, Weixin Wang, Yu Zhao (possible past Tencent (China) affiliation), Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen (possible past Tencent (China) affiliation), Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen
Abstract

Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To b...

๐Ÿ“„ Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03768v1
๐Ÿ‘ฅ Authors: Hao Zhang (possible past Tencent (China) affiliation), Ding Zhao (possible past Google (United States) affiliation), H. Eric Tseng
Abstract

Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi...

๐Ÿ“„ Interaction-Aware Whole-Body Control for Compliant Object Transport
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03751v1
๐Ÿ‘ฅ Authors: Hao Zhang (possible past Tencent (China) affiliation), Yves Tseng, Ding Zhao (possible past Google (United States) affiliation), H. Eric Tseng
Abstract

Cooperative object transport in unstructured environments remains challenging for assistive humanoids because strong, time-varying interaction forces can make tracking-centric whole-body control unreliable, especially in close-contact support tasks. This paper proposes a bio-inspired, interaction-oriented whole-body control (IO-WBC) that functions as an artificial cerebellum - an adaptive motor agent that translates upstream (skill-level) commands into stable, physically consistent whole-body be...

๐Ÿ“„ HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03741v1
๐Ÿ‘ฅ Authors: Hao Zhang (possible past Tencent (China) affiliation), Yaru Niu, Yikai Wang, Ding Zhao (possible past Google (United States) affiliation), H. Eric Tseng
Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so ...

๐Ÿ“„ PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03739v1
๐Ÿ‘ฅ Authors: Zehua Fan, Wenqi Lyu, Wenxuan Song, Linge Zhao, Yifei Yang, Xi Wang (possible past Tsinghua University affiliation), Junjie He, Lida Huang, Haiyan Liu, Bingchuan Sun, Guangjun Bao, Xuanyao Mao, Liang Xu, Yan Wang (possible past Tencent (China) affiliation), Feng Gao
Abstract

Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-con...

๐Ÿ“„ Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03700v1
๐Ÿ‘ฅ Authors: Saptarshi Chakraborty, Quentin Berthet (possible past University Of Cambridge affiliation), Peter L. Bartlett (possible past University Of California, Berkeley affiliation)
Abstract

Despite the remarkable empirical success of score-based diffusion models, their statistical guarantees remain underdeveloped. Existing analyses often provide pessimistic convergence rates that do not reflect the intrinsic low-dimensional structure common in real data, such as that arising in natural images. In this work, we study the statistical convergence of score-based diffusion models for learning an unknown distribution $ฮผ$ from finitely many samples. Under mild regularity conditions on the...

๐Ÿ“„ MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03680v1
๐Ÿ‘ฅ Authors: Lu Yang, Zelai Xu, Minyang Xie, Jiaxuan Gao, Zhao Shok, Yu Wang (possible past Tsinghua University affiliation), Yi Wu (possible past University Of California, Berkeley affiliation)
Abstract

Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus p...

๐Ÿ“„ Graph Negative Feedback Bias Correction Framework for Adaptive Heterophily Modeling
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03662v1
๐Ÿ‘ฅ Authors: Jiaqi Lv, Qingfeng Du, Yu Zhang (possible past Google (United States) affiliation), Yongqi Han, Sheng Li (possible past Google (United States) affiliation)
Abstract

Graph Neural Networks (GNNs) have emerged as a powerful framework for processing graph-structured data. However, conventional GNNs and their variants are inherently limited by the homophily assumption, leading to degradation in performance on heterophilic graphs. Although substantial efforts have been made to mitigate this issue, they remain constrained by the message-passing paradigm, which is inherently rooted in homophily. In this paper, a detailed analysis of how the underlying label autocor...

๐Ÿ“„ Mozi: Governed Autonomy for Drug Discovery LLM Agents
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03655v1
๐Ÿ‘ฅ Authors: He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li (possible past Tsinghua University affiliation), Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, Yu Li (possible past Tencent (China) affiliation)
Abstract

Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early-stage hallucinations multiplicatively compound into downstream failures. To overcome this, we pres...

๐Ÿ“„ How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03280v1
๐Ÿ‘ฅ Authors: Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel (possible past University Of California, Berkeley affiliation), Jitendra Malik (possible past University Of California, Berkeley affiliation)
Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using p...

๐Ÿ“„ UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03241v1
๐Ÿ‘ฅ Authors: Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang (possible past Tencent (China) affiliation), Yujiang Wang, Lili Qiu, Bo Li (possible past Tencent (China) affiliation), Ziwei Liu, Caihua Shan, Yifan Yang (possible past Tencent (China) affiliation), Yifei Shen
Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformation...

๐Ÿ“„ Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03143v1
๐Ÿ‘ฅ Authors: Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan (possible past Tsinghua University affiliation), Xiangxiang Chu, Yunchao Wei (possible past National University Of Singapore affiliation), Kang Liao, Guosheng Lin
Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturall...

๐Ÿ“„ APRES: An Agentic Paper Revision and Evaluation System
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03142v1
๐Ÿ‘ฅ Authors: Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman, Abhishek Charnalia, Despoina Magka, Tatiana Shavrina, Derek Dunfield, Oisin Mac Aodha (possible past University Of Edinburgh affiliation), Yoram Bachrach (possible past Deepmind (United Kingdom) affiliation)
Abstract

Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we ...

๐Ÿ“„ Multi-Scale Adaptive Neighborhood Awareness Transformer For Graph Fraud Detection
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03106v1
๐Ÿ‘ฅ Authors: Jiaqi Lv, Qingfeng Du, Yu Zhang (possible past Google (United States) affiliation), Yongqi Han, Sheng Li (possible past Google (United States) affiliation)
Abstract

Graph fraud detection (GFD) is crucial for identifying fraudulent behavior within graphs, benefiting various domains such as financial networks and social media. Existing methods based on graph neural networks (GNNs) have succeeded considerably due to their effective expressive capacity for graph-structured data. However, the inherent inductive bias of GNNs, including the homogeneity assumption and the limited global modeling ability, hinder the effectiveness of these models. To address these ch...

๐Ÿ“„ From Misclassifications to Outliers: Joint Reliability Assessment in Classification
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03903v1
๐Ÿ‘ฅ Authors: Yang Li (possible past Google (United States) affiliation), Youyang Sha, Yinzhi Wang, Timothy Hospedales, Xi Shen (possible past Tencent (China) affiliation), Shell Xu Hu, Xuanlong Yu
Abstract

Building reliable classifiers is a fundamental challenge for deploying machine learning in real-world applications. A reliable system should not only detect out-of-distribution (OOD) inputs but also anticipate in-distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end...

๐Ÿ“„ MEM: Multi-Scale Embodied Memory for Vision Language Action Models
๐Ÿ—“๏ธ Published: 3/4/2026
๐Ÿ”— http://arxiv.org/abs/2603.03596v1
๐Ÿ‘ฅ Authors: Marcel Torne, Karl Pertsch (possible past Google (United States) affiliation), Homer Walke, Kyle Vedder, Suraj Nair (possible past Stanford University affiliation), Brian Ichter (possible past Google (United States) affiliation), Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong (possible past Google (United States) affiliation), Jost Tobias Springenberg (possible past Google (United States) affiliation), Sergey Levine (possible past University Of Washington affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Danny Driess (possible past Google (United States) affiliation)
Abstract

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e....

๐Ÿ“„ Solving adversarial examples requires solving exponential misalignment
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03507v1
๐Ÿ‘ฅ Authors: Alessandro Salvatore, Stanislav Fort (possible past Stanford University affiliation), Surya Ganguli (possible past Stanford University affiliation)
Abstract

Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network's perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human conce...

๐Ÿ“„ LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03269v1
๐Ÿ‘ฅ Authors: Junyi Zhang, Charles Herrmann (possible past Google (United States) affiliation), Junhwa Hur, Chen Sun (possible past Google (United States) affiliation), Ming-Hsuan Yang, Forrester Cole (possible past Google (United States) affiliation), Trevor Darrell (possible past University Of California, Berkeley affiliation), Deqing Sun (possible past Nvidia (United States) affiliation)
Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-ch...

๐Ÿ“„ Step-Level Sparse Autoencoder for Reasoning Process Interpretation
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.03031v1
๐Ÿ‘ฅ Authors: Xuan Yang (possible past Stanford University affiliation), Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao (possible past Peking University affiliation)
Abstract

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we pro...

๐Ÿ“„ Variance reduction in lattice QCD observables via normalizing flows
๐Ÿ—“๏ธ Published: 3/3/2026
๐Ÿ”— http://arxiv.org/abs/2603.02984v1
๐Ÿ‘ฅ Authors: Ryan Abbott, Denis Boyda (possible past Massachusetts Institute Of Technology affiliation), Yang Fu, Daniel C. Hackett (possible past Massachusetts Institute Of Technology affiliation), Gurtej Kanwar (possible past Massachusetts Institute Of Technology affiliation), Fernando Romero-Lรณpez, Phiala E. Shanahan (possible past Massachusetts Institute Of Technology affiliation), Julian M. Urban
Abstract

Normalizing flows can be used to construct unbiased, reduced-variance estimators for lattice field theory observables that are defined by a derivative with respect to action parameters. This work implements the approach for observables involving gluonic operator insertions in the SU(3) Yang-Mills theory and two-flavor Quantum Chromodynamics (QCD) in four space-time dimensions. Variance reduction by factors of $10$-$60$ is achieved in glueball correlation functions and in gluonic matrix elements ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.