πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Understanding and Enforcing Weight Disentanglement in Task Arithmetic
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.17078v1
πŸ‘₯ Authors: Shangge Liu, Yuehan Yin, Lei Wang (possible past Baidu (China) affiliation), Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao (possible past Tencent (China) affiliation), Dacheng Tao
Abstract

Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($ΞΈ_0$) or the task vectors ($Ο„_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Spec...

πŸ“„ Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.17009v1
πŸ‘₯ Authors: Wenzhen Yuan (possible past Massachusetts Institute Of Technology affiliation), Wutao Xiong, Fanchen Yu, Shengji Tang, Ting Liu (possible past Google (United States) affiliation), Tao Chen, Peng Ye, Yuzhuo Fu, Wanli Ouyang, Lei Bai
Abstract

Multi-agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent-as-Tool, a unified parallel orchestration paradigm that abstracts both a...

πŸ“„ D-QRELO: Training- and Data-Free Delta Compression for Large Language Models via Quantization and Residual Low-Rank Approximation
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16940v1
πŸ‘₯ Authors: Junlin Li, Shuangyong Song, Guodong Du, Ngai Wong, Xuebo Liu, Yongxiang Li, Min Zhang (possible past Tsinghua University affiliation), Jing Li (possible past Tencent (China) affiliation), Xuelong Li (possible past Tencent (China) affiliation)
Abstract

Supervised Fine-Tuning (SFT) accelerates taskspecific large language models (LLMs) development, but the resulting proliferation of finetuned models incurs substantial memory overhead. Delta compression addresses this by retaining a single pre-trained LLM with multiple compressed delta weights. However, existing methods fail on models fine-tuned with largescale datasets. We find that larger SFT data scale amplifies delta parameter magnitude, singular values, and entropy, exacerbating compression ...

πŸ“„ ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16922v1
πŸ‘₯ Authors: Hao Wang (possible past Tsinghua University affiliation), Jindong Han, Wei Fan (possible past Tencent (China) affiliation), Hao Liu (possible past Tencent (China) affiliation)
Abstract

Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (Q&A) tasks. These approaches of...

πŸ“„ Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16881v1
πŸ‘₯ Authors: Jiang Zhou, Xiaohu Zhao, Xinwei Wu, Tianyu Dong, Hao Wang (possible past Tsinghua University affiliation), Yangyang Liu, Heng Liu (possible past Google (United States) affiliation), Linlong Xu, Longyue Wang (possible past Tencent (China) affiliation), Weihua Luo, Deyi Xiong
Abstract

Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural en...

πŸ“„ TowerDataset: A Heterogeneous Benchmark for Transmission Corridor Segmentation with a Global-Local Fusion Framework
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16848v1
πŸ‘₯ Authors: Xu Cui, Xinyan Liu, Chen Yang (possible past Tencent (China) affiliation), Zhaobo Qi, Beichen Zang, Weigang Zhang, Antoni B. Chan (possible past Eth Zurich affiliation)
Abstract

Fine-grained semantic segmentation of transmission-corridor point clouds is fundamental for intelligent power-line inspection. However, current progress is limited by realistic data scarcity and the difficulty of modeling global corridor structure and local geometric details in long, heterogeneous scenes. Existing public datasets usually provide only a few coarse categories or short cropped scenes which overlook long-range structural dependencies, severe long-tail distributions, and subtle disti...

πŸ“„ The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16830v1
πŸ‘₯ Authors: Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong (possible past Salesforce (United States) affiliation), Chien-Sheng Wu (possible past Salesforce (United States) affiliation)
Abstract

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formal...

πŸ“„ AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16625v1
πŸ‘₯ Authors: Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen (possible past University Of Washington affiliation), Yiming Yang (possible past Microsoft (United States) affiliation), Sean Welleck
Abstract

Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non-linear optimization landscape further make naive ...

πŸ“„ Learning to Reason with Insight for Informal Theorem Proving
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16278v1
πŸ‘₯ Authors: Yunhe Li, Hao Shi, Bowen Deng, Wei Wang (possible past University Of Oxford affiliation), Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang (possible past Google (United States) affiliation), Shuang Qiu, Linqi Song
Abstract

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and...

πŸ“„ VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16272v1
πŸ‘₯ Authors: Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang, Siyuan Yang, Mingyang Wu, Jiongze Yu, Qi Zheng, Haozhi Wang, Jiayi Zhang, Jared Yang, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Zihan Wang (possible past Tsinghua University affiliation), Qing Yin, Zhengzhong Tu (possible past Google (United States) affiliation)
Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often reli...

πŸ“„ BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16241v1
πŸ‘₯ Authors: Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-Mcmahon, Marius Miron, David Robinson, Emmanuel Chemla, Sara Keen, Gagan Narula, Mathieu LauriΓ¨re, Matthieu Geist (possible past Google (United States) affiliation), Olivier Pietquin (possible past Google (United States) affiliation)
Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination...

πŸ“„ MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16175v1
πŸ‘₯ Authors: Yi Lin, Yihao Ding, Yonghui Wu (possible past Google (United States) affiliation), Yifan Peng (possible past Stanford University affiliation)
Abstract

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierar...

πŸ“„ AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16056v1
πŸ‘₯ Authors: Sihan Lv, Yechen Jin, Zhen Li (possible past Google (United States) affiliation), Jintao Chen, Jinshan Zhang, Ying Li (possible past Meta (United States) affiliation), Jianwei Yin, Meng Xi
Abstract

Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-tra...

πŸ“„ Towards Trustworthy Depression Estimation via Disentangled Evidential Learning
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16579v1
πŸ‘₯ Authors: Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Tong Xu (possible past Baidu (China) affiliation), Meng Li (possible past Meta (United States) affiliation), Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Automated depression estimation is highly vulnerable to signal corruption and ambient noise in real-world deployment. Prevailing deterministic methods produce uncalibrated point estimates, exposing safety-critical clinical systems to the severe risk of overconfident misdiagnoses. To establish a highly resilient and trustworthy assessment paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a ...

πŸ“„ AgentV-RL: Scaling Reward Modeling with Agentic Verifier
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16004v1
πŸ‘₯ Authors: Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He (possible past Baidu (China) affiliation), Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang (possible past Tencent (China) affiliation), Xuanjing Huang
Abstract

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augme...

πŸ“„ ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.15994v1
πŸ‘₯ Authors: Qiang Xu, Shengyuan Bai, Yu Wang (possible past Tsinghua University affiliation), He Cao, Leqing Chen, Yuanyuan Liu, Bin Feng, Zijing Liu, Yu Li (possible past Tencent (China) affiliation)
Abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark ...

πŸ“„ Automated Classification of Plasma Regions at Mars Using Machine Learning
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.17131v1
πŸ‘₯ Authors: Yilan Qin, Chuanfei Dong, Hongyang Zhou, Chi Zhang (possible past Peking University affiliation), Kaichun Xu, Jiawei Gao, Simin Shekarpaz, Xinmin Li, Liang Wang (possible past Tencent (China) affiliation)
Abstract

The plasma environment around Mars is highly variable because it is strongly influenced by the solar wind. Accurate identification of plasma regions around Mars is important for the community studying solar wind-Mars interactions, region-specific plasma processes, and atmospheric escape. In this study, we develop a machine-learning-based classifier to automatically identify three key plasma regions--solar wind, magnetosheath, and induced magnetosphere--using only ion omnidirectional energy spect...

πŸ“„ Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
πŸ—“οΈ Published: 4/18/2026
πŸ”— http://arxiv.org/abs/2604.16918v1
πŸ‘₯ Authors: Weiyu Ma, Yongcheng Zeng, Yan Song (possible past Tencent (China) affiliation), Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny (possible past Meta (United States) affiliation)
Abstract

Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample ef...

πŸ“„ S-GRPO: Unified Post-Training for Large Vision-Language Models
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16557v1
πŸ‘₯ Authors: Yuming Yan, Kai Tang, Sihong Chen (possible past Tencent (China) affiliation), Ke Xu, Dan Hu, Qun Yu, Pengfei Hu (possible past Tencent (China) affiliation)
Abstract

Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajec...

*Notable papers are those with at least two authors from a "big" AI/ML lab.