📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 AutoMem: Automated Learning of Memory as a Cognitive Skill
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.01224v1
👥 Authors: Shengguang Wu, Hao Zhu (possible past Tsinghua University affiliation), Yuhui Zhang, Xiaohan Wang (possible past Baidu (China) affiliation), Serena Yeung-Levy
Abstract

Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action ...

📄 Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.01208v1
👥 Authors: Shayan Talaei, Abhinav Chinta, Devvrit Khatri, Amin Karbasi, Azalia Mirhoseini (possible past Google (United States) affiliation), Amin Saberi (possible past Stanford University affiliation)
Abstract

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, ...

📄 World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.01202v1
👥 Authors: Liyuan Zhu, Shengyu Huang, Amrita Mazumdar, Tianye Li, Zan Gojcic, Gordon Wetzstein (possible past Stanford University affiliation), Iro Armeni, Shalini De Mello (possible past Nvidia (United States) affiliation), Alex Trevithick
Abstract

We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS represent...

📄 Autonomous Scientific Discovery via Iterative Meta-Reflection
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.01131v1
👥 Authors: Bingchen Zhao, Sara Beery (possible past Microsoft (United States) affiliation), Oisin Mac Aodha (possible past University Of Edinburgh affiliation)
Abstract

Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. ...

📄 SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00990v1
👥 Authors: Yaoqi Guo, Yang Liu (possible past Tsinghua University affiliation), Jie M. Zhang (possible past Peking University affiliation), Yun Ma, Yiling Lou, Zhenpeng Chen
Abstract

Large language model (LLM)-based software engineering agents are increasingly developed to resolve software issues by generating patches from issue reports and code repositories. Bug reproduction tests (BRTs) are an important building block for such agents and have been shown useful for patch validation. However, it remains unclear whether BRTs can also help the more central stage of patch generation. We first conduct a preliminary study and find that directly using advanced BRT generators to gu...

📄 TRCGL-Net: A Long-Tailed Multi-Label Chest X-Ray Classification Framework with Generative Data Augmentation and Label Co-Occurrence Modeling
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00975v1
👥 Authors: Tong Shao, Hongshun Ling, Li Zhang (possible past University Of Oxford affiliation), Jinjing Wu, Junke Wang, Yuan Gao (possible past Tencent (China) affiliation), Fang Wang (possible past Tencent (China) affiliation)
Abstract

Chest X-ray multi-label classification is a core task in intelligent medical imaging diagnosis. However, real clinical data often exhibit extreme long-tailed distributions, leading to degraded performance on rare diseases in tail classes. This issue is not only driven by data scarcity but also by two intrinsic factors:1) attenuation of tail-class lesion representations under complex anatomical backgrounds, and 2) dominance of head classes in modeling label co-occurrence relationships. To address...

📄 Loss Smoothing for Stable Adaptation Under Distribution Shift
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00634v1
👥 Authors: Darshan Patil, Ekaterina Lobacheva, Razvan Pascanu (possible past Google (United States) affiliation), Sarath Chandar (possible past Mila - Quebec Artificial Intelligence Institute affiliation)
Abstract

In settings such as fine-tuning and reinforcement learning, neural networks are often adapted under distribution shift. Standard adaptation methods typically optimize the target objective directly, inducing an abrupt change from the source training objective. This abrupt transition can distort learned representations, including features that may still be useful for the new task. We investigate whether a more gradual transition can improve adaptation. We propose loss smoothing, a simple approach ...

📄 Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00363v1
👥 Authors: Zuda Yu, Qianhui Xu, Ting Chen (possible past Google (United States) affiliation), Junhui Zhang, Tao Fu, Hongjiang Yu, Qiangqing Wang, Yang Song (possible past Stanford University affiliation)
Abstract

Flow Matching (FM) has emerged as a powerful paradigm for speech generation but remains constrained by high inference latency and timbre leakage. To address these bottlenecks, we propose a unified guidance framework that enhances generation efficiency and robustness through two complementary strategies. On the data front, we introduce Data-guidance via heterogeneous augmentation, encouraging the model to disentangle linguistic content from acoustic residue. In parallel, we propose an enhanced Mo...

📄 ASPIRE: Agentic /Skills Discovery for Robotics
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2607.00272v1
👥 Authors: Runyu Lu, Yubo Wu, Ethan Kou, Letian Fu, Wenli Xiao, Ajay Mandlekar, Yinzhen Xu, Guanya Shi, Ken Goldberg (possible past University Of California, Berkeley affiliation), Ang Chen, Mosharaf Chowdhury (possible past University Of California, Berkeley affiliation), Yuke Zhu (possible past Stanford University affiliation), Linxi "jim" Fan, Guanzhi Wang (possible past Stanford University affiliation)
Abstract

Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introduce ASPIRE (Agentic Skill Programming through Iterative Robot Exploration), a continual learning system that autonomously writes and refines robot control programs in a code-as-policy paradigm while compounding experience into a reusable skill library. ASPIRE discovers skills that persist across tas...

📄 SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2607.00208v1
👥 Authors: Ruikang Zhao, Zhenting Wang, Han Gao (possible past Tencent (China) affiliation), Ligong Han (possible past Google (United States) affiliation)
Abstract

Reinforcement learning for diffusion large language models (dLLMs) has largely moved to trajectory-aware methods. The current state of the art, TraceRL, holds that random masking is mismatched with the model's inference trajectory, and it reconstructs that trajectory during training by slicing each rollout into up to K/s trajectory-aligned training samples, a cost that grows with the block size K. We show that this mismatch can be mitigated without reconstructing the trajectory. Our method, SLIM...

📄 Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.32038v1
👥 Authors: Zifan Carl Guo, Laura Ruis, Jacob Andreas (possible past University Of California, Berkeley affiliation), Belinda Z. Li (possible past University Of Washington affiliation)
Abstract

When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different familie...

📄 AdaJEPA: An Adaptive Latent World Model
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.32026v1
👥 Authors: Ying Wang (possible past Tsinghua University affiliation), Oumayma Bounou, Yann Lecun (possible past Meta (United States) affiliation), Mengye Ren (possible past University Of Toronto affiliation)
Abstract

Latent world models enable planning from high-dimensional observations by predicting future states in a compact latent space. However, these models are typically kept frozen at test time: when their predictions become inaccurate, planning can fail, especially under test-time distribution shift. To address this, we propose AdaJEPA, an adaptive latent world model that performs test-time adaptation within the closed loop of model predictive control (MPC). After training, AdaJEPA plans and executes ...

📄 GR2 Technical Report
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.31984v1
👥 Authors: Yufei Li, Zaiwei Zhang, Mingfu Liang, Kavosh Asadi, Jay Xu, Jimmy Kim, Chongyang Bai, Jieyi Zhang, Hongye Xie, Prachi Agrawal, Dian Yu (possible past Tencent (China) affiliation), Tianyi Chen, Jean-Pascal Billaud, Garret Buell, Yk, Zhu, Sachin Patil, Brooke Bian, Zhou Fang, Kevin Huang, Shiva Sudanagunta, Yuzhen Huang, Emma Lu, Chris O'brien, Yang Song (possible past Stanford University affiliation), Lihong Li (possible past Microsoft (United States) affiliation), Jacob Tao, Zhicheng Zhu, Chao Li (possible past Baidu (China) affiliation), Gaoxiang Liu, Neil Wu, Zhongyin Hu, Li Han, Loki Chen, Ming Lei, Greg Rehm, Siyuan Song, Tianwei Zhang, Li Li (possible past Google (United States) affiliation), Ketan Singh, Yavuz Yetim, Ilyas Atishev, Satendra Gera, Ashkan Sadeghi, Rachel Yan, Nikko Mizutani, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Parish Aggarwal, Kaushik Rangadurai, Zhi Hua, Frank Shyu, Ruchit Sharma, Liyuan Li, Shike Mei, Wenlin Chen (possible past Meta (United States) affiliation), Santanu Kolay, Ben Schulte, Deepak Chandra (possible past Google (United States) affiliation), Adam, Song, Sandeep Pandey, Xi Liu, Hamed Firooz, Luke Simon
Abstract

Industrial recommendation systems serve billions of users through a multi-stage funnel -- retrieval, early-stage ranking, and re-ranking -- where the final re-ranking step disproportionately shapes user engagement and downstream performance, particularly for carousel and grid display formats. Despite growing enthusiasm for Large Language Models (LLMs) in recommendation, three gaps hinder industrial adoption: (1) most efforts target retrieval and ranking, leaving re-ranking -- the stage closest t...

📄 LUNA: Learning Universal 3D Human Animation Beyond Skinning
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.31981v1
👥 Authors: Peng Li (possible past Tsinghua University affiliation), Rawal Khirodkar (possible past Carnegie Mellon University affiliation), Junxuan Li, Yuan Dong, Chen Cao, Yuan Liu (possible past Google (United States) affiliation), Wenhan Luo (possible past Tencent (China) affiliation), Yike Guo, Shunsuke Saito (possible past Meta (United States) affiliation)
Abstract

Creating photorealistic, animatable 3D human avatars from monocular images still largely depends on Linear Blend Skinning (LBS) and parametric body models, which constrain expressivity and often introduce artifacts due to imperfect fitting. We propose LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls like images, keypoints, sketches, and unseen characters into 3D Gaussian deformations, bypassing explicit body fitting. At its core, a transformer-based moti...

📄 ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.31693v1
👥 Authors: Jiacheng Chen, Tao Zhang (possible past Nvidia (United States) affiliation), Manxi Lin, Dunxian Huang, Teng Shi, Honghao Fu, Mengyan Li, Xinming Zhang, Chenchi Zhang, Xuan Lu, Xiaoxiong Du, Haibin Chen, Shaolin Ye, Hao Chang, Xiaoqi Li, Shuwen Xiao, Yujin Yuan, Jingxuan Feng, Shaopan Xiong, Huimin Yi, Ju Huang, Qiu Shen, Ying Chen (possible past Baidu (China) affiliation), Junjun Zheng, Xiangheng Kong, Yuning Jiang
Abstract

The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation pipelines, forcing complex intents through low-bandwidth retrieval or ranking interfaces and leaving a gap between language understanding and item-space fulfillment. Generative recommendation gives LLMs a direct item-space interface through semantic IDs (SIDs), but existing mo...

📄 WorldRoamBench: An Open-World Benchmark for Long-Horizon Stability of Interactive World Models
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.31672v1
👥 Authors: Ting-Bing Xu, Jiacheng Sui, Zhe Gao, Kewei Shi, Wenjin Yang, Zhicheng Liu, Zhaoxu Sun, Mingchao Sun, Hongyu Pan, Fan Jiang (possible past Shanghai Jiao Tong University affiliation), Mu Xu, Qi Fan, Yong Li (possible past Tsinghua University affiliation), Baoquan Chen (possible past Peking University affiliation)
Abstract

Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at trajectory level and ignore memory and interaction physics. We introduce WorldRoamBench, an open-world benchmark for long-horizon stability across four dimensions, each with tailored innovations: (i) Action: per-frame action metric bypassing cross-model semantic scale disparity and exposing failures hidden by trajectory; (ii) Vision: segment-based drift metric capturing non-monotonic ...

📄 QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.01179v1
👥 Authors: Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman (possible past Stanford University affiliation), Emily B. Fox (possible past Apple (United States) affiliation)
Abstract

Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in p...

📄 Task-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00796v1
👥 Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Qian Xu (possible past Baidu (China) affiliation), Shuo Wang (possible past Nvidia (United States) affiliation), Zhuo Chen, Kai Lv
Abstract

Visual Reinforcement Learning (VRL) has achieved considerable success in solving control tasks. However, generalizing learned policies to new environments remains a major challenge, as agents often overfit to task-irrelevant features in the training environment. To solve this problem, we introduce the concept of decoupling observations into task-relevant and task-irrelevant representations. Building on this idea, we propose a self-supervised Task-Relevant Representation Decoupling (T2RD) algorit...

📄 Interpretable vs Learned Encoders for High-Cardinality Fraud Detection
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00477v1
👥 Authors: Xiao Han (possible past Tencent (China) affiliation), Jingjing Liu (possible past Microsoft (United States) affiliation), Moxuan Zheng, Zhen Zhang, Chenyu Wu
Abstract

A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoders had identical frozen LightGBM learners in the downstream phase, allowing for controlled comparisons of their performance to each other. CatBoost and TabNet were included as comparisons across paradigms using different ...

📄 MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules
🗓️ Published: 7/1/2026
🔗 http://arxiv.org/abs/2607.00464v1
👥 Authors: Tong Xu (possible past Baidu (China) affiliation), Xinzhe Cao, Zhihui Zhu, Keyan Ding, Huajun Chen (possible past Alibaba Group (China) affiliation)
Abstract

Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many generative models may produce molecules with toxic, reactive, or otherwise hazardous characteristics - posing hidden dangers that remain insufficiently addressed. To address this gap, we introduce MolSafeEval, a benchmark dedicated to evaluating and analyzing the safety risks ...

📄 Introduction to Stochastic Differential Equations for Generative Machine Learning: A Variational Perspective
🗓️ Published: 6/30/2026
🔗 http://arxiv.org/abs/2606.31576v1
👥 Authors: Ole Winther, Paul Jeha, Sander Dieleman (possible past Google (United States) affiliation), Andriy Mnih (possible past Google (United States) affiliation), Manfred Opper, Andrea Dittadi
Abstract

The use of ordinary and stochastic differential equations has led to substantial progress in generative machine learning with applications to, for example, image, video and biomolecule generation. This paper provides a self-contained and informal introduction to the differential equations, the probabilistic framework for using them in generative modeling and the Fokker--Planck equation that governs the temporal evolution of the marginal distribution of the stochastic variables of the differentia...

*Notable papers are those with at least two authors from a "big" AI/ML lab.