📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09544v1
👥 Authors: Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg (possible past Google (United States) affiliation), Peter Henderson (possible past Stanford University affiliation), Seraphina Goldfarb-Tarrant, Yonatan Belinkov
Abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. ...

📄 VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09508v1
👥 Authors: Yucheng Shen, Jiulong Wu, Jizhou Huang (possible past Baidu (China) affiliation), Dawei Yin (possible past Baidu (China) affiliation), Lingyong Yan, Min Cao
Abstract

Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often r...

📄 E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09455v1
👥 Authors: Weiyang Guo, Zesheng Shi, Liye Zhao, Jiayuan Ma, Zeen Zhu, Junxian He (possible past Carnegie Mellon University affiliation), Min Zhang (possible past Tsinghua University affiliation), Jing Li (possible past Tencent (China) affiliation)
Abstract

While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of ag...

📄 ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09450v1
👥 Authors: Lifeng Chen, Tianqi You, Hao Liu (possible past Tencent (China) affiliation), Zhimin Bao, Jile Jiao, Xiao Han (possible past Tencent (China) affiliation), Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin (possible past National University Of Singapore affiliation), Yi Xu
Abstract

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence du...

📄 PhysInOne: Visual Physics Learning and Reasoning in One Suite
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09415v1
👥 Authors: Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li (possible past Tsinghua University affiliation), Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun, Bowen Cheng (possible past Google (United States) affiliation), Pok Kazaf Fu, Wai Kit Lai, Jiahao Chen, Kaiyuan Wang, Zhixuan Sun, Ziqi Li, Haochen Hu, Di Zhang, Chun Ho Yuen, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang (possible past Tencent (China) affiliation)
Abstract

We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ...

📄 SAGE: A Service Agent Graph-guided Evaluation Benchmark
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09285v1
👥 Authors: Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang (possible past Tsinghua University affiliation), Chaozheng Wang, Yujie Wang, Wei He (possible past Baidu (China) affiliation), Jinpeng Wang (possible past Tencent (China) affiliation), Deiyi Xiong
Abstract

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent b...

📄 Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09253v1
👥 Authors: Yuqin Lan, Gen Li (possible past University Of Edinburgh affiliation), Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang (possible past Tsinghua University affiliation), Laurence T. Yang, Zhiming Zheng
Abstract

Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous setting...

📄 Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09121v1
👥 Authors: Peng Wang (possible past Peking University affiliation), Yanqiao Zhu, Zixuan Jiang, Qinyuan Chen, Xingjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu (possible past Baidu (China) affiliation), Xie Chen
Abstract

Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been syst...

📄 Large-Scale Universal Defect Generation: Foundation Models and Datasets
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.08915v1
👥 Authors: Yuanting Fan, Jun Liu (possible past Tencent (China) affiliation), Bin-Bin Gao, Xiaochen Chen, Yuhuan Lin, Zhewei Dai, Jiawei Zhan, Chengjie Wang (possible past Tencent (China) affiliation)
Abstract

Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by pres...

📄 HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.08884v1
👥 Authors: Xinyu Zhang (possible past Baidu (China) affiliation), Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen, Yuhang Chen, Xiaoya Fan, Chan Tsz Ho, Bi Tianyuan, Haoyuan Liang, Ruifeng Su, Zihao Qian, Juepeng Zheng (possible past Tsinghua University affiliation), Jianxi Huang, Yutong Lu, Haohuan Fu (possible past Tsinghua University affiliation)
Abstract

While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specific...

📄 SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.08865v1
👥 Authors: Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li (possible past Tsinghua University affiliation), Yang Liu (possible past Tsinghua University affiliation), Guanhua Chen
Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimatio...

📄 Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.08863v1
👥 Authors: Pengze Li, Jiaquan Zhang, Yunbo Long, Xinping Liu, Zhou Wenjie, Encheng Su, Zihang Zeng, Jiaqi Liu, Jiyao Liu, Junchi Yu, Lihao Liu, Philip Torr (possible past University Of Oxford affiliation), Shixiang Tang, Aoran Wang, Xi Chen (possible past University Of California, Berkeley affiliation)
Abstract

Recovering analytical solutions of physical fields from visual observations is a fundamental yet underexplored capability for AI-assisted scientific reasoning. We study visual-to-symbolic analytical solution inference (ViSA) for two-dimensional linear steady-state fields: given field visualizations (and first-order derivatives) plus minimal auxiliary metadata, the model must output a single executable SymPy expression with fully instantiated numeric constants. We introduce ViSA-R2 and align it w...

📄 HiFloat4 Format for Language Model Pre-training on Ascend NPUs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08826v1
👥 Authors: Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang (possible past University Of Edinburgh affiliation), Hu Liu, Yu Cheng (possible past National University Of Singapore affiliation), Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, Shadan Golestan
Abstract

Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x impro...

📄 Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08718v1
👥 Authors: Xinmiao Xiong, Bangya Liu, Hao Wang (possible past Tsinghua University affiliation), Dayou Li, Nuo Chen, Andrew Feng (possible past Nvidia (United States) affiliation), Mingyu Ding, Suman Banerjee, Yang Zhou, Zhiwen Fan
Abstract

Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate t...

📄 AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08540v1
👥 Authors: Ziwei Zhou, Zeyuan Lai, Rui Wang (possible past Tencent (China) affiliation), Yifan Yang (possible past Tencent (China) affiliation), Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, Chong Luo (possible past Google (United States) affiliation)
Abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we pro...

📄 ClawBench: Can AI Agents Complete Everyday Online Tasks?
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08523v1
👥 Authors: Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu (possible past Tencent (China) affiliation), Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen (possible past Google (United States) affiliation), Dongfu Jiang, Wenhu Chen, Kelsey R. Allen
Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These ...

📄 Synthetic Data for any Differentiable Target
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08423v1
👥 Authors: Tristan Thrush (possible past Hugging Face affiliation), Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts (possible past Tencent (China) affiliation), Tatsunori Hashimoto (possible past Stanford University affiliation)
Abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradi...

📄 TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08384v1
👥 Authors: Jing Peng, Chenghao Wang, Yi Yang (possible past Baidu (China) affiliation), Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu (possible past Baidu (China) affiliation)
Abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributio...

📄 SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08377v1
👥 Authors: Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang (possible past Baidu (China) affiliation), Yiming Hu (possible past Tsinghua University affiliation), Tongwen Huang, Xiangxiang Chu
Abstract

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experi...

📄 Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08362v1
👥 Authors: Jiawei Chen (possible past Tencent (China) affiliation), Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han (possible past Tencent (China) affiliation), Le Sun, Xiangyu Wu, Hongyu Lin
Abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns ...

📄 PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08340v1
👥 Authors: Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li (possible past Eth Zurich affiliation), Lixin Duan (possible past Amazon (United States) affiliation)
Abstract

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. ...

📄 HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08232v1
👥 Authors: He Zhao (possible past Tencent (China) affiliation), Yijun Yang, Zichuan Lin, Deheng Ye (possible past Tencent (China) affiliation), Chunyan Miao
Abstract

Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberat...

📄 Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09258v1
👥 Authors: Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong (possible past Tsinghua University affiliation), Ke Shen, Jun Zhu (possible past Tsinghua University affiliation)
Abstract

Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pret...

📄 Feature-Label Modal Alignment for Robust Partial Multi-Label Learning
🗓️ Published: 4/10/2026
🔗 http://arxiv.org/abs/2604.09064v1
👥 Authors: Yu Chen (possible past Meta (United States) affiliation), Weijun Lv, Yue Huang, Xiaozhao Fang, Jie Wen, Yong Xu (possible past Tencent (China) affiliation), Guanbin Li
Abstract

In partial multi-label learning (PML), each instance is associated with a set of candidate labels containing both ground-truth and noisy labels. The presence of noisy labels disrupts the correspondence between features and labels, degrading classification performance. To address this challenge, we propose a novel PML method based on feature-label modal alignment (PML-MA), which treats features and labels as two complementary modalities and restores their consistency through systematic alignment....

📄 SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08368v1
👥 Authors: Seyed Mahmoud Sajjadi Mohammadabadi, Xiaolong Ma, Lei Yang (possible past Google (United States) affiliation), Feng Yan (possible past Meta (United States) affiliation), Junshan Zhang
Abstract

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses eac...

📄 Small Vision-Language Models are Smart Compressors for Long Video Understanding
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08120v1
👥 Authors: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen (possible past Google (United States) affiliation), Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra (possible past Meta (United States) affiliation), Mohamed Elhoseiny (possible past Meta (United States) affiliation), Chenchen Zhu
Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.