📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Vero: An Open RL Recipe for General Visual Reasoning
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04917v1
👥 Authors: Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen (possible past Stanford University affiliation), Zhuang Liu (possible past University Of California, Berkeley affiliation)
Abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL d...

📄 QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04898v1
👥 Authors: Lm-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li (possible past Google (United States) affiliation), Ian Wu, Lewis Tunstall, Aviral Kumar (possible past University Of California, Berkeley affiliation)
Abstract

Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained t...

📄 DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04875v1
👥 Authors: Ke Li (possible past University Of California, Berkeley affiliation), Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen (possible past Tencent (China) affiliation)
Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we...

📄 Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04759v1
👥 Authors: Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan (possible past Tsinghua University affiliation), Tianyu Pang (possible past Tsinghua University affiliation), Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie (possible past Google (United States) affiliation)
Abstract

OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which...

📄 ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04664v1
👥 Authors: Rongfeng Zhao, Xuanhao Zhang, Zhaochen Guo, Xiang Shao, Zhongpan Zhu, Bin He (possible past Baidu (China) affiliation), Jie Chen (possible past Tencent (China) affiliation)
Abstract

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt mod...

📄 Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04651v1
👥 Authors: Yizhou Liu, Qi Sun (possible past Google (United States) affiliation), Yulin Chen, Siyue Zhang, Chen Zhao (possible past Stanford University affiliation)
Abstract

Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLM...

📄 Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04634v1
👥 Authors: Zhengcen Li, Chenyang Jiang, Hang Zhao (possible past Nvidia (United States) affiliation), Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang (possible past Tencent (China) affiliation), Qiben Shan, Shaocong Wu, Jingyong Su
Abstract

The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existi...

📄 One Model for All: Multi-Objective Controllable Language Models
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04497v1
👥 Authors: Qiang He, Yucheng Yang, Tianyi Zhou (possible past University Of Washington affiliation), Meng Fang (possible past Tencent (China) affiliation), Mykola Pechenizkiy, Setareh Maghsudi
Abstract

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the ...

📄 GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04399v1
👥 Authors: Yuwen Zhai, Runze Li, Liang Wang (possible past Tencent (China) affiliation), Nian Shi, Liwu Xu, Wei Zhang (possible past Tsinghua University affiliation), Ran Lin, Bo Xu, Benlei Cui
Abstract

Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development....

📄 PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence
🗓️ Published: 4/5/2026
🔗 http://arxiv.org/abs/2604.04297v1
👥 Authors: Marija Zelic, Anna Tegon, Yawei Li (possible past Google (United States) affiliation), Thorir Mar Ingolfsson, Luca Benini (possible past Eth Zurich affiliation)
Abstract

Physiological foundation models (FMs) have shown promise for biosignal representation learning, yet most remain confined to a single modality such as EEG, ECG, or PPG, largely because paired multimodal datasets are scarce. In this paper, we present PanLUNA, a compact 5.4M-parameter pan-modal FM that jointly processes EEG, ECG, and PPG within a single shared encoder. Extending LUNA's channel-unification module, PanLUNA treats multimodal channels as entries in a unified query set augmented with se...

📄 Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
🗓️ Published: 4/5/2026
🔗 http://arxiv.org/abs/2604.04247v1
👥 Authors: Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica (possible past University Of California, Berkeley affiliation), Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation)
Abstract

Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and ...

📄 Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark
🗓️ Published: 4/5/2026
🔗 http://arxiv.org/abs/2604.04226v1
👥 Authors: Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu (possible past Shanghai Jiao Tong University affiliation), Weiwen Liu, Hiroki Kobayashi, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Agentic Web, as a new paradigm that redefines the internet through autonomous, goal-driven interactions, plays an important role in group intelligence. As the foundational semantic primitives of the Agentic Web, digital assets encapsulate interactive web elements into agents, which expand the capacities and coverage of agents in agentic web. The lack of automated methodologies for agent generation limits the wider usage of digital assets and the advancement of the Agentic Web. In this paper, we ...

📄 BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging
🗓️ Published: 4/5/2026
🔗 http://arxiv.org/abs/2604.04078v1
👥 Authors: Taiping Qu, Hongkai Zhang, Lantian Zhang, Can Zhao, Nan Zhang, Hui Wang, Zhen Zhou, Mingye Zou, Kairui Bo, Pengfei Zhao, Xingxing Jin, Zixian Su, Kun Jiang, Huan Liu (possible past Tsinghua University affiliation), Yu Du, Maozhou Wang, Ruifang Yan, Zhongyuan Wang, Tiejun Huang, Lei Xu (possible past Tsinghua University affiliation), Henggui Zhang
Abstract

Cardiac magnetic resonance (CMR) is a cornerstone for diagnosing cardiovascular disease. However, it remains underutilized due to complex, time-consuming interpretation across multi-sequences, phases, quantitative measures that heavily reliant on specialized expertise. Here, we present BAAI Cardiac Agent, a multimodal intelligent system designed for end-to-end CMR interpretation. The agent integrates specialized cardiac expert models to perform automated segmentation of cardiac structures, funct...

📄 Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
🗓️ Published: 4/5/2026
🔗 http://arxiv.org/abs/2604.03976v1
👥 Authors: Wenyue Hua, Tianyi Peng, Chi Wang (possible past Microsoft (United States) affiliation), Ian Kaufman, Bryan Lim (possible past University Of Oxford affiliation), Chandler Fang
Abstract

Prior work on trustworthy AI emphasizes model-internal properties such as bias mitigation, adversarial robustness, and interpretability. As AI systems evolve into autonomous agents deployed in open environments and increasingly connected to payments or assets, the operational meaning of trust shifts to end-to-end outcomes: whether an agent completes tasks, follows user intent, and avoids failures that cause material or psychological harm. These risks are fundamentally product-level and cannot be...

📄 An Improved Last-Iterate Convergence Rate for Anchored Gradient Descent Ascent
🗓️ Published: 4/4/2026
🔗 http://arxiv.org/abs/2604.03782v1
👥 Authors: Anja Surina, Arun Suggala, George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Francisco J. R. Ruiz (possible past Deepmind (United Kingdom) affiliation), Pushmeet Kohli (possible past Google (United States) affiliation), Swarat Chaudhuri
Abstract

We analyze the last-iterate convergence of the Anchored Gradient Descent Ascent algorithm for smooth convex-concave min-max problems. While previous work established a last-iterate rate of $\mathcal{O}(1/t^{2-2p})$ for the squared gradient norm, where $p \in (1/2, 1)$, it remained an open problem whether the improved exact $\mathcal{O}(1/t)$ rate is achievable. In this work, we resolve this question in the affirmative. This result was discovered autonomously by an AI system capable of writing fo...

📄 A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04488v1
👥 Authors: Tianmeng Fang, Yong Wang (possible past Baidu (China) affiliation), Zetai Kong, Zengzhen Su, Jun Wang (possible past Tencent (China) affiliation), Chengjin Yu, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. T...

📄 LightThinker++: From Reasoning Compression to Memory Management
🗓️ Published: 4/4/2026
🔗 http://arxiv.org/abs/2604.03679v1
👥 Authors: Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen (possible past Alibaba Group (China) affiliation), Ningyu Zhang (possible past Tencent (China) affiliation)
Abstract

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into Lig...

*Notable papers are those with at least two authors from a "big" AI/ML lab.