📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26111v1
👥 Authors: Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li (possible past Carnegie Mellon University affiliation), Igor Gilitschenski (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models...

📄 Language Models Need Sleep
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26099v1
👥 Authors: Sangyun Lee, Sean Mcleish, Tom Goldstein (possible past Meta (United States) affiliation), Giulia Fanti (possible past University Of California, Berkeley affiliation)
Abstract

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks thro...

📄 Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26032v1
👥 Authors: Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman (possible past Massachusetts Institute Of Technology affiliation), Congyue Deng (possible past Stanford University affiliation), Marin Soljačić (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space $\textbf{I}$mage $\textbf{L}$earning $\textbf{D}$iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physic...

📄 CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26029v1
👥 Authors: Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu (possible past Baidu (China) affiliation), Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng (possible past Tsinghua University affiliation)
Abstract

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reac...

📄 Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25985v1
👥 Authors: Weizhi Fei, Hang Yin, Zihao Wang, Shukai Zhao, Wei Zhang (possible past Tsinghua University affiliation), Yangqiu Song (possible past Tsinghua University affiliation)
Abstract

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over i...

📄 Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25920v1
👥 Authors: Wei Fan (possible past Tencent (China) affiliation), Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran Hu, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song (possible past Tsinghua University affiliation)
Abstract

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into que...

📄 $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25893v1
👥 Authors: Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu (possible past Tencent (China) affiliation), Philip Torr (possible past University Of Oxford affiliation), Adel Bibi
Abstract

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyz...

📄 Towards the Connection between Activation Sparsity and Flat Minima
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25612v1
👥 Authors: Ze Peng, Jian Zhang (possible past Tencent (China) affiliation), Lei Qi, Yang Gao (possible past Tencent (China) affiliation), Yinghuan Shi
Abstract

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models s...

📄 Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25488v1
👥 Authors: Zhicheng Zhang, Lei Wang (possible past Baidu (China) affiliation), Yu Zhang (possible past Google (United States) affiliation), Yongsheng Gao
Abstract

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We...

📄 A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25446v1
👥 Authors: Ziqing Yu, Yuhui Tao, Jiayu Huo, Lei Pan, Zilong Xiao, Juecheng Chen, Xiao Li, Jianxuan Li, You Zhou, Zhixing Li, Cong Wang, Beijian Zhang, Chen Chen (possible past Tencent (China) affiliation), Hongyang Lu, Konstantinos Patlatzoglou, Daniel B. Kramer, Jonathan W. Waks, Yangang Su, Fu Siong Ng, Shuo Wang (possible past Nvidia (United States) affiliation), Yixiu Liang, Junbo Ge
Abstract

Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set...

📄 Towards end-to-end LLM-based censoring-aware survival analysis
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25399v1
👥 Authors: Yishu Wei, Hexin Dong, Yi Lin, Jiahe Qian, Yi Liu (possible past Google (United States) affiliation), Yifan Peng (possible past Stanford University affiliation)
Abstract

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives tes...

📄 Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning
🗓️ Published: 5/24/2026
🔗 http://arxiv.org/abs/2605.25210v1
👥 Authors: Ziheng Cheng, Yixiao Huang, Hanlin Zhu, Haoran Geng, Somayeh Sojoudi, Jitendra Malik (possible past University Of California, Berkeley affiliation), Pieter Abbeel (possible past University Of California, Berkeley affiliation), Xin Guo
Abstract

Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in text-to-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffi...

📄 When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25864v1
👥 Authors: Li Wang (possible past Tesla (United States) affiliation), Xiaodong Lu, Xiaohan Wang (possible past Baidu (China) affiliation), Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin
Abstract

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samp...

📄 RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25565v1
👥 Authors: Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang (possible past Google (United States) affiliation), Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang (possible past Tsinghua University affiliation)
Abstract

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of...

📄 ERNIE-Image Technical Report
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.25347v1
👥 Authors: Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu (possible past Google (United States) affiliation), Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun (possible past Baidu (China) affiliation), Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu (possible past Tencent (China) affiliation), Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang (possible past Tencent (China) affiliation), Qian Zhang (possible past University Of Washington affiliation), Qiao Zhao, Qi Zhou
Abstract

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic asses...

*Notable papers are those with at least two authors from a "big" AI/ML lab.