πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.05145v1
πŸ‘₯ Authors: Nizar Islah, Istabrak Abbes, Irina Rish (possible past Deepmind (United Kingdom) affiliation), Sarath Chandar (possible past Mila - Quebec Artificial Intelligence Institute affiliation), Eilif B. Muller
Abstract

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions c...

πŸ“„ Audio Interaction Model
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.05121v1
πŸ‘₯ Authors: Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu (possible past Tencent (China) affiliation), Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye (possible past Tencent (China) affiliation), Shuicheng Yan (possible past National University Of Singapore affiliation), Chunyan Miao
Abstract

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a u...

πŸ“„ Knowledge Index of Noah's Ark
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.05104v1
πŸ‘₯ Authors: Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang (possible past Meta (United States) affiliation), Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang (possible past Tsinghua University affiliation), Weihao Xuan, Ge Zhang
Abstract

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness throug...

πŸ“„ AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.05080v1
πŸ‘₯ Authors: Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon (possible past Google (United States) affiliation), Hao Zhang (possible past Tencent (China) affiliation), Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland (possible past Massachusetts Institute Of Technology affiliation), Zichen Chen
Abstract

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-...

πŸ“„ UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.05058v1
πŸ‘₯ Authors: Jingyuan Chen (possible past National University Of Singapore affiliation), Sheng Jin, Haopeng Sun, Wentao Liu, Chen Qian (possible past Shanghai Jiao Tong University affiliation)
Abstract

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse in...

πŸ“„ Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04815v1
πŸ‘₯ Authors: Bo Mao, Jie Zhou (possible past Tsinghua University affiliation), Yutao Yang, Xin Li (possible past Google (United States) affiliation), Xian Wei, Qin Chen, Xingjiao Wu, Liang He
Abstract

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement le...

πŸ“„ BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04807v1
πŸ‘₯ Authors: Saket Reddy, Ke Yang (possible past Google (United States) affiliation), Chengxiang Zhai (possible past Tencent (China) affiliation)
Abstract

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estim...

πŸ“„ BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04648v1
πŸ‘₯ Authors: Qi Wang (possible past Tsinghua University affiliation), Peijie Wang, Fei Yin (possible past Tsinghua University affiliation), Cheng-Lin Liu
Abstract

Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework t...

πŸ“„ Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04535v1
πŸ‘₯ Authors: Boyan Han, Yiwei Wang (possible past Google (United States) affiliation), Yi Song, Yujun Cai, Chi Zhang (possible past Peking University affiliation)
Abstract

Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor p...

πŸ“„ CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04460v1
πŸ‘₯ Authors: Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang (possible past Amazon (United States) affiliation), Jingxuan He, Wenbo Guo, Dawn Song (possible past University Of California, Berkeley affiliation)
Abstract

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abil...

πŸ“„ L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04419v1
πŸ‘₯ Authors: Arda AtalΔ±k, Sumit Chopra (possible past Meta (United States) affiliation), Daniel K. Sodickson (possible past Meta (United States) affiliation)
Abstract

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a hi...

πŸ“„ Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.04284v1
πŸ‘₯ Authors: Yifan Wang (possible past Stanford University affiliation), Jinyi Mu, Mayank Jobanputra, Yu Wang (possible past Tsinghua University affiliation), Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg
Abstract

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevert...

πŸ“„ Characterizing initial human-AI proof formalization workflows
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.04273v1
πŸ‘₯ Authors: Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum (possible past Massachusetts Institute Of Technology affiliation), Umang Bhatt (possible past University Of Cambridge affiliation), Adrian Weller (possible past University Of Cambridge affiliation), Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky
Abstract

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a ...

πŸ“„ SaliMory: Orchestrating Cognitive Memory for Conversational Agents
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.04120v1
πŸ‘₯ Authors: Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar (possible past Meta (United States) affiliation), Raffay Hamid, Xin Luna Dong (possible past University Of Washington affiliation)
Abstract

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences...

πŸ“„ Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03988v2
πŸ‘₯ Authors: Mahtab Bigverdi, Linjie Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dangjoo Kim, Zelun Luo (possible past Stanford University affiliation), Linda Shapiro, Ranjay Krishna (possible past University Of Washington affiliation)
Abstract

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive ...

πŸ“„ Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03985v1
πŸ‘₯ Authors: Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang (possible past Stanford University affiliation), Li Yi (possible past Stanford University affiliation)
Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achie...

πŸ“„ Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03962v1
πŸ‘₯ Authors: Anthony Gx-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser AygΓΌn, David Smalling, Shibl Mourad, Doina Precup (possible past Deepmind (United Kingdom) affiliation), AndrΓ© Barreto (possible past Deepmind (United Kingdom) affiliation), Mark Rowland (possible past University Of Cambridge affiliation)
Abstract

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as t...

πŸ“„ scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03906v1
πŸ‘₯ Authors: Jiabei Cheng, Jingbo Zhou (possible past Baidu (China) affiliation), Jun Xia, Changkai Li, Zhen Lei (possible past Beijing Academy Of Artificial Intelligence affiliation), Chang Yu, Stan Z. Li
Abstract

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation ...

πŸ“„ Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03879v1
πŸ‘₯ Authors: Wei Ding, Yudong Zhang, Ruobing Xie (possible past Tencent (China) affiliation), Xingwu Sun (possible past Baidu (China) affiliation), Jiansheng Chen, Yu Wang (possible past Tsinghua University affiliation)
Abstract

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five commo...

πŸ“„ A General Framework for Dynamic Consistent Submodular Maximization
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04946v1
πŸ‘₯ Authors: Paul DΓΌtting, Federico Fusco, Silvio Lattanzi (possible past Google (United States) affiliation), Ashkan Norouzi-Fard, Ola Svensson, Morteza Zadimoghaddam (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream o...

πŸ“„ Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04797v1
πŸ‘₯ Authors: Jiahua Dong, Wenqi Liang, Hongliu Li, Yang Cong, Duzhen Zhang, Hanbin Zhao, Henghui Ding, Yulun Zhang, Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation), Fahad Shahbaz Khan (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, ...

πŸ“„ SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04511v1
πŸ‘₯ Authors: Yaosheng Fu, Guangxuan Xiao, Xin Dong (possible past Tsinghua University affiliation), Song Han (possible past Stanford University affiliation), Oreste Villa (possible past Nvidia (United States) affiliation)
Abstract

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, ...

πŸ“„ LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04485v1
πŸ‘₯ Authors: Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Ming, Gang Ren (possible past Google (United States) affiliation), Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui (possible past Tsinghua University affiliation)
Abstract

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified \emph{tokenize-and-route} framework for strong TFMs: \textbf{RaBEL}...

πŸ“„ Stateful Visual Encoders for Vision-Language Models
πŸ—“οΈ Published: 6/3/2026
πŸ”— http://arxiv.org/abs/2606.04433v1
πŸ‘₯ Authors: Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation), Trevor Darrell (possible past University Of California, Berkeley affiliation)
Abstract

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially whe...

*Notable papers are those with at least two authors from a "big" AI/ML lab.