📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Taming Outlier Tokens in Diffusion Transformers
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.05206v1
👥 Authors: Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen (possible past Google (United States) affiliation), Zhe Gan (possible past Microsoft (United States) affiliation), Chen Wei
Abstract

We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier represen...

📄 Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.05007v1
👥 Authors: Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang (possible past Tsinghua University affiliation), Hanqing Wang, Yuxin Wu (possible past University Of California, Berkeley affiliation), Yifan Wu (possible past Carnegie Mellon University affiliation), Siru Zhong, Tao Yu (possible past University Of Washington affiliation), Yifu Guo, Siyu Zhang, Xinlei Yu, Qibing Ren, Usman Naseem
Abstract

Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL traj...

📄 StoryAlign: Evaluating and Training Reward Models for Story Generation
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04831v1
👥 Authors: Haotian Xia, Hao Peng (possible past Tsinghua University affiliation), Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou (possible past Tsinghua University affiliation), Juanzi Li
Abstract

Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of h...

📄 DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04808v1
👥 Authors: Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang (possible past Stanford University affiliation), Wenbo Guo, Dawn Song (possible past University Of California, Berkeley affiliation), Bo Li (possible past Tencent (China) affiliation)
Abstract

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is in...

📄 Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04638v1
👥 Authors: Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Ting Liu (possible past Google (United States) affiliation)
Abstract

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks t...

📄 From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04572v1
👥 Authors: Xiao Wang (possible past Google (United States) affiliation), Yifei Zhang, Yongkang Liu, Xiaocui Yang, Zihan Wang (possible past Tsinghua University affiliation), Shi Feng, Daling Wang
Abstract

Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this paper, we uncover a critical mechanism underlying safety degradation by analyzing parameter dynamics, where benign f...

📄 DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04503v1
👥 Authors: Yuancheng Wei, Haojie Zhang, Linli Yao, Lei Li (possible past Carnegie Mellon University affiliation), Jiali Chen, Tao Huang, Yiting Lu, Duojun Huang, Xin Li (possible past Google (United States) affiliation), Zhao Zhong
Abstract

Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust ...

📄 StableI2I: Spotting Unintended Changes in Image-to-Image Transition
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04453v1
👥 Authors: Jiayang Li, Shuo Cao, Xiaohui Li, Zhizhen Zhang, Kaiwen Zhu, Yule Duan, Yu Qiao (possible past Shanghai Artificial Intelligence Laboratory affiliation), Jian Zhang (possible past Tencent (China) affiliation), Yihao Liu
Abstract

In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure of the input image. To address this limitation, we propose StableI2I, a unified and dynamic evaluation framework that explicitly measures content fidelity and pre--post consistency across a wide rang...

📄 Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04431v1
👥 Authors: Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Liancheng Fang, Kening Zheng, Hongyi Liu (possible past Amazon (United States) affiliation), Xiaosong Huang, Philip S. Yu (possible past Tsinghua University affiliation), Ying Li (possible past Meta (United States) affiliation)
Abstract

Reinforcement fine-tuning (RFT) has become a core paradigm for post-training large language models, yet its training process remains highly fragile. Existing efforts mainly improve reliability at the system level or address specific issues in individual subproblems by modifying RFT algorithms. Despite their effectiveness, they largely overlook the problem of failure management at the training-process level. When training goes wrong, practitioners still rely heavily on expert-driven manual inspec...

📄 Resilient AI Supercomputer Networking using MRC and SRv6
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.04333v1
👥 Authors: Joao Araujo, Alex Chow, Mark Handley, Ryder Lewis, Christoph Paasch, Jitendra Padhye, Michael Papamichael, Greg Steinbrecher, Amin Tootoonchian, Lihua Yuan, S. Anantharamu, Abhishek Dosi, Mohit Garg, Mahdieh Ghazi, Torsten Hoefler (possible past Eth Zurich affiliation), Deepal Jayasinghe, Jithin Jose, Abdul Kabbani (possible past Google (United States) affiliation), Guohan Lu, Yang Wang (possible past Baidu (China) affiliation), K. Doddapaneni, Murali Garimella, Vipin Jain, Yanfang Le, H. Nagulapalli, S. Narayanan, Rong Pan, Rathina Sabesan, Raghava Sivaramu, Rip Sohan, Eric Davis, Dragos Dumitrescu, Mohan Kalkunte, Bhaswar Mitra, Guglielmo Morandin, Adrian Popa, Costin Raiciu, Eric Spada, John Spillane, Niranjan Vaidya, Aviv Barnea, Idan Burstein, Elazar Cohen, Yamin Friedman, Noam Katz, Masoud Moshref, Yuval Shpigelman, Shahaf Shuler, Shy Shyman, Sayantan Sur
Abstract

Tail latency dominates the performance of synchronous pretraining jobs when running at very large scales. We describe a three-pronged approach: (1) a new RDMA-based transport protocol, MRC, sprays across many paths and actively load-balances between them, eliminating the issue of flow collisions (2) the use of multi-plane Clos topologies to get the benefits of high switch radix and redundancy, allowing training clusters well over 100K GPUs to be built as two-tier topologies while increasing phys...

📄 SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.04012v1
👥 Authors: Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen (possible past Google (United States) affiliation), Mike Schaekermann (possible past Google (United States) affiliation), Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu (possible past Google (United States) affiliation), Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel Mcduff (possible past Google (United States) affiliation)
Abstract

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app i...

📄 iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.03941v2
👥 Authors: Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen (possible past Tsinghua University affiliation), Yong Li (possible past Tsinghua University affiliation)
Abstract

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception an...

📄 Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.04128v1
👥 Authors: Lin Song (possible past Tencent (China) affiliation), Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang (possible past Tencent (China) affiliation), Yuan Zhang (possible past Google (United States) affiliation), Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang (possible past Stanford University affiliation), Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan
Abstract

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervisi...

📄 RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.03821v1
👥 Authors: Hao Wu (possible past Tencent (China) affiliation), Yuqi Li, Yuan Gao (possible past Tencent (China) affiliation), Fan Xu, Fan Zhang, Kun Wang, Penghao Zhao, Qiufeng Wang, Yizhou Zhao, Weiyan Wang, Yingli Tian, Xian Wu (possible past Tencent (China) affiliation), Xiaomeng Huang
Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon in...

📄 Agentic-imodels: Evolving agentic interpretability tools via autoresearch
🗓️ Published: 5/5/2026
🔗 http://arxiv.org/abs/2605.03808v1
👥 Authors: Chandan Singh, Yan Shuo Tan, Weijia Xu, Zelalem Gero, Weiwei Yang, Michel Galley (possible past Microsoft (United States) affiliation), Jianfeng Gao (possible past Microsoft (United States) affiliation)
Abstract

Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Spe...

📄 Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.05115v1
👥 Authors: Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas Mcgrath (possible past Google (United States) affiliation), Owen Lewis, Jack Merullo, Noah Goodman, Thomas Fel, Atticus Geiger (possible past Stanford University affiliation), Ekdeep Singh Lubana
Abstract

Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold $M_h$ to representations and ...

📄 Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.05112v1
👥 Authors: Tianshu Zhu, Wenyu Zhang (possible past Tencent (China) affiliation), Xiaoying Zuo, Lun Tian, Haotian Zhao, Yucheng Zeng, Jingnan Gu, Daxiang Dong (possible past Baidu (China) affiliation), Jianmin Wu, Dawei Yin (possible past Baidu (China) affiliation), Dou Shen (possible past Baidu (China) affiliation)
Abstract

SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive...

📄 Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.05088v1
👥 Authors: Yunfei Bai (possible past Google (United States) affiliation), Aaron Tesfa Tsion, Raul Rosales, Barbara Shollock, Wei He (possible past Baidu (China) affiliation)
Abstract

Achieving resilient and sustainable cities requires scalable approaches to decarbonising residential buildings, which account for about 20% of UK greenhouse gas emissions and 25% of energy-related emissions in the European Union. Energy Performance Certificates (EPCs) support regulation and retrofit planning, but their reliance on on-site inspections limits timely city-scale assessment. This study introduces a gated multimodal model to predict Standard Assessment Procedure (SAP) energy efficienc...

📄 KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04956v1
👥 Authors: Han Wang (possible past Peking University affiliation), Jintao Zhang, Kai Jiang, Haoxu Wang, Jianfei Chen, Jun Zhu (possible past Tsinghua University affiliation)
Abstract

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than metho...

📄 To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04877v1
👥 Authors: Yangchen Yu, Qian Chen (possible past Shanghai Jiao Tong University affiliation), Jia Li (possible past Google (United States) affiliation), Zhenzhen Hu, Jinpeng Hu, Lizi Liao, Erik Cambria, Richang Hong
Abstract

Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resoluti...

📄 Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)
🗓️ Published: 5/6/2026
🔗 http://arxiv.org/abs/2605.04556v1
👥 Authors: Cyril Allauzen (possible past Google (United States) affiliation), Tom Bagby (possible past Google (United States) affiliation), Georg Heigold (possible past Google (United States) affiliation), Ehsan Variani (possible past Google (United States) affiliation), Ke Wu
Abstract

The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB c...

*Notable papers are those with at least two authors from a "big" AI/ML lab.