πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16952v1
πŸ‘₯ Authors: Kareem Amin, Rudrajit Das, Alessandro Epasto (possible past Google (United States) affiliation), Adel Javanmard, Dennis Kraft, MΓ³nica Ribero, Sergei Vassilvitskii (possible past Google (United States) affiliation)
Abstract

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguis...

πŸ“„ ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16826v1
πŸ‘₯ Authors: Zenan Wu, Bingqing Wei, Lu Liu, Zheqi He, Xi Wang (possible past Tsinghua University affiliation), Jiakang Liu, Zehui Li, Guocai Yao, Jing-Shu Zheng, Xi Yang, Yongtao Wang (possible past Peking University affiliation)
Abstract

Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tablet...

πŸ“„ A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16733v1
πŸ‘₯ Authors: Jianghan Shen, Siqi Luo, Yue Li, Jiyao Liu, Wanying Qu, Yi Zhang (possible past Google (United States) affiliation), Ziyan Huang, Tianbin Li, Ming Hu, Xiaohong Liu (possible past Shanghai Jiao Tong University affiliation), Yirong Chen, Junjun He
Abstract

Policy gradient algorithms for language models optimize the same objective $J(ΞΈ) = \mathbb{E}*{Ο„\sim p*ΞΈ(Ο„)}[R(Ο„)]$, which has exactly two factors: the trajectory probability $p_ΞΈ(Ο„)$ and the reward $R(Ο„)$. Every method from REINFORCE to PPO to GRPO and their descendants modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize these methods by domain or chronology, which obscures the rationale behind each design choice and the precise loc...

πŸ“„ ACCORD: Action-Conditioned Contextual Grounding for Language Agents
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16432v1
πŸ‘₯ Authors: Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu (possible past Baidu (China) affiliation), Heng Ji, Hao Peng (possible past Tsinghua University affiliation)
Abstract

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry i...

πŸ“„ Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16337v1
πŸ‘₯ Authors: Wei Xu (possible past Tencent (China) affiliation), Ke Yang (possible past Google (United States) affiliation), Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang (possible past Google (United States) affiliation), Kefeng Li
Abstract

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution ari...

πŸ“„ Latent Thought Flow: Efficient Latent Reasoning in Large Language Models
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16222v1
πŸ‘₯ Authors: Xiandong Zou, Jing Huang (possible past Meta (United States) affiliation), Jianshu Li (possible past National University Of Singapore affiliation), Pan Zhou
Abstract

Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flo...

πŸ“„ PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16175v1
πŸ‘₯ Authors: Qiwei Yan, Zhiqiang Yuan, Zexi Jia, Nanxing Hu, Kailin Lyu, Jie Zhou (possible past Tsinghua University affiliation), Jinchao Zhang (possible past Tencent (China) affiliation)
Abstract

Longitudinal personal albums are weak-schema multimodal databases: noisy perceptual records whose key facts require joins across faces, text, timestamps, locations, and repeated events. Existing visual, video, document, and lifelog benchmarks test sub-problems, but not album-scale profile reconstruction with social identity binding and evidence citation. Benchmarking this task is difficult because the ground truth needed for evaluation--owner profiles, social graphs, face-name maps, and evidence...

πŸ“„ TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16173v1
πŸ‘₯ Authors: Zhi Chen, Yuxuan Wang (possible past Google (United States) affiliation), Jialong Wu, Yong Liu, Haoran Zhang, Xingjian Su, Jianmin Wang (possible past Tsinghua University affiliation), Mingsheng Long (possible past Tsinghua University affiliation)
Abstract

High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harness...

πŸ“„ A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16160v1
πŸ‘₯ Authors: Mehshan Ahmed Khan, Houshyar Asadi, Li Zhang (possible past University Of Oxford affiliation), Mohammad Reza Chalak Qazani, Ghazal Bargshady, Stefanos Gkikas, Christian Arzate, Sam Oladazimi, Zoran Najdovsk, Lei Wei (possible past Tencent (China) affiliation), Chee Peng Lim
Abstract

Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods ...

πŸ“„ The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16152v1
πŸ‘₯ Authors: Haolong Qian, Xianliang Yang, Yinuo Ma, Lirong Che, Feng Lu (possible past Google (United States) affiliation), Ye Guo, Lei Song, Jiang Bian (possible past Baidu (China) affiliation), Chun Yuan
Abstract

Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM...

πŸ“„ VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16140v1
πŸ‘₯ Authors: Sen Xu, Shixi Liu, Wei Wang (possible past University Of Oxford affiliation), Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou (possible past Stanford University affiliation), Junlin Zhang
Abstract

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-...

πŸ“„ Open-SWE-Traces: Advancing Dual-Mode Multilingual Distillation for Software Engineering Agents
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.16038v1
πŸ‘₯ Authors: Wasi Uddin Ahmad, Nikolai Ludwig, Somshubra Majumdar (possible past Nvidia (United States) affiliation), Boris Ginsburg (possible past Nvidia (United States) affiliation)
Abstract

The path toward autonomous software engineering is currently bottlenecked by a severe deficit of diverse, large-scale trajectory data. We address this by introducing \ourdataset, an expansive dataset of 207,489 agentic trajectories spanning nine programming languages (Python, Go, TS, JS, Rust, Java, PHP, C, C++). Sourced from 20,000 real-world PRs via OpenHands and SWE-agent harnesses, the dataset utilizes a hybrid-reasoning synthesis: Minimax-M2.5 generates trajectories with explicit "thinking"...

πŸ“„ On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.15912v1
πŸ‘₯ Authors: Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang (possible past Tencent (China) affiliation), Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang (possible past Google (United States) affiliation), Jinqiao Wang
Abstract

Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in practice.On-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher...

πŸ“„ Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.15880v1
πŸ‘₯ Authors: Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang (possible past Tencent (China) affiliation), Yue Zhou, Caiyong Piao, Bin Li, Taiping Yao (possible past Tencent (China) affiliation), Bo Wang (possible past Tencent (China) affiliation), Youchang Xiao, Shouhong Ding (possible past Tencent (China) affiliation)
Abstract

Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of for...

πŸ“„ RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.15862v1
πŸ‘₯ Authors: Linghua Zhang, Jun Wang (possible past Tencent (China) affiliation), Jingtong Wu, Zhisong Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, a...

πŸ“„ Factorized Neural Operators Decompose Dynamic and Persistent Responses
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16900v1
πŸ‘₯ Authors: Hao Tang, Yuechen Duan, Jiongyu Zhu, Zimeng Feng, Hao Li (possible past Tsinghua University affiliation), Chao Li (possible past Baidu (China) affiliation)
Abstract

Physical systems often exhibit heterogeneous mechanisms, where rapidly evolving dynamics coexist with persistent structures. Capturing such multiscale physical behavior remains challenging for existing neural operators, which typically rely on single dominant inductive bias and therefore couple distinct physical responses into a shared representation. We introduce the Unified Green's Function Framework across domains and propose the Factorized Neural Operators (FaNO), which decompose spectral re...

πŸ“„ Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16899v1
πŸ‘₯ Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma (possible past Stanford University affiliation), Percy Liang (possible past Stanford University affiliation)
Abstract

Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B p...

πŸ“„ MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16748v1
πŸ‘₯ Authors: Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh (possible past Google (United States) affiliation), Ruslan Salakhutdinov (possible past University Of Toronto affiliation)
Abstract

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, ...

πŸ“„ How Post-Training Shapes Biological Reasoning Models
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16517v1
πŸ‘₯ Authors: Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi (possible past Google (United States) affiliation), Shekoofeh Azizi (possible past Google (United States) affiliation), Sham M. Kakade, Marinka Zitnik
Abstract

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under cont...

πŸ“„ Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
πŸ—“οΈ Published: 6/15/2026
πŸ”— http://arxiv.org/abs/2606.16429v1
πŸ‘₯ Authors: Zhongzhu Zhou, Qingyang Wu, Junxiong Wang, Mayank Mishra, Shuaiwen Leon Song (possible past Microsoft (United States) affiliation), Ben Athiwaratkun, Chenfeng Xu (possible past University Of California, Berkeley affiliation)
Abstract

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify th...

πŸ“„ Learning the generating functional for variance reduction in lattice QCD
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.15986v1
πŸ‘₯ Authors: Ryan Abbott, Yang Fu, Daniel C. Hackett (possible past Massachusetts Institute Of Technology affiliation), Gurtej Kanwar (possible past Massachusetts Institute Of Technology affiliation), Fernando Romero-LΓ³pez, Phiala E. Shanahan (possible past Massachusetts Institute Of Technology affiliation)
Abstract

The generating functional in quantum field theory provides the natural framework for constructing correlation functions as derivatives with respect to source operators. We present a methodology that leverages machine-learned normalizing flows to reduce the variance of arbitrary $N$-point correlation functions of bosonic operators in lattice gauge field theory calculations by encoding a representation of the generating functional. We show that it is possible to systematically approach noiseless e...

πŸ“„ Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support
πŸ—“οΈ Published: 6/14/2026
πŸ”— http://arxiv.org/abs/2606.15940v1
πŸ‘₯ Authors: Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Hong Liu (possible past Google (United States) affiliation), Xiao Chen, Jingwen He, Xia Li (possible past Meta (United States) affiliation)
Abstract

Synthetic and distilled student data are increasingly used to enable privacy-conscious learning analytics, yet their suitability for decision-facing institutional support remains uncertain. In dropout support, generated data must preserve not only predictive utility or distributional resemblance, but also the financial-status evidence used to guide advising, payment-plan assistance, and scholarship-related decisions. Method: This study introduces CaP-Eval, a decision-facing causal-privacy audit ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.