πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Learning to Reason with Insight for Informal Theorem Proving
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16278v1
πŸ‘₯ Authors: Yunhe Li, Hao Shi, Bowen Deng, Wei Wang (possible past University Of Oxford affiliation), Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang (possible past Google (United States) affiliation), Shuang Qiu, Linqi Song
Abstract

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and...

πŸ“„ VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16272v1
πŸ‘₯ Authors: Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang, Siyuan Yang, Mingyang Wu, Jiongze Yu, Qi Zheng, Haozhi Wang, Jiayi Zhang, Jared Yang, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Zihan Wang (possible past Tsinghua University affiliation), Qing Yin, Zhengzhong Tu (possible past Google (United States) affiliation)
Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often reli...

πŸ“„ BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16241v1
πŸ‘₯ Authors: Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-Mcmahon, Marius Miron, David Robinson, Emmanuel Chemla, Sara Keen, Gagan Narula, Mathieu LauriΓ¨re, Matthieu Geist (possible past Google (United States) affiliation), Olivier Pietquin (possible past Google (United States) affiliation)
Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination...

πŸ“„ MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16175v1
πŸ‘₯ Authors: Yi Lin, Yihao Ding, Yonghui Wu (possible past Google (United States) affiliation), Yifan Peng (possible past Stanford University affiliation)
Abstract

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierar...

πŸ“„ AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16056v1
πŸ‘₯ Authors: Sihan Lv, Yechen Jin, Zhen Li (possible past Google (United States) affiliation), Jintao Chen, Jinshan Zhang, Ying Li (possible past Meta (United States) affiliation), Jianwei Yin, Meng Xi
Abstract

Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-tra...

πŸ“„ AgentV-RL: Scaling Reward Modeling with Agentic Verifier
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.16004v1
πŸ‘₯ Authors: Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He (possible past Baidu (China) affiliation), Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang (possible past Tencent (China) affiliation), Xuanjing Huang
Abstract

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augme...

πŸ“„ ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.15994v1
πŸ‘₯ Authors: Qiang Xu, Shengyuan Bai, Yu Wang (possible past Tsinghua University affiliation), He Cao, Leqing Chen, Yuanyuan Liu, Bin Feng, Zijing Liu, Yu Li (possible past Tencent (China) affiliation)
Abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark ...

πŸ“„ Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.15839v1
πŸ‘₯ Authors: Chengwu Liu, Yichun Yin, Ye Yuan (possible past Carnegie Mellon University affiliation), Jiaxuan Xie, Botao Li, Siqi Li, Jianhao Shen, Yan Xu (possible past Peking University affiliation), Lifeng Shang, Ming Zhang (possible past Peking University affiliation)
Abstract

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-rean...

πŸ“„ Why Fine-Tuning Encourages Hallucinations and How to Fix It
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15574v1
πŸ‘₯ Authors: Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner, Yuval Reif, Swabha Swayamdipta (possible past Google (United States) affiliation), Derek Hoiem, Roy Schwartz (possible past University Of Washington affiliation)
Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a sel...

πŸ“„ LACE: Lattice Attention for Cross-thread Exploration
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15529v1
πŸ‘₯ Authors: Yang Li (possible past Google (United States) affiliation), Zirui Zhang, Yang Liu (possible past Tsinghua University affiliation), Chengzhi Mao
Abstract

Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another...

πŸ“„ PolicyBank: Evolving Policy Understanding for LLM Agents
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15505v1
πŸ‘₯ Authors: Jihye Choi, Jinsung Yoon (possible past Google (United States) affiliation), Long T. Le, Somesh Jha, Tomas Pfister (possible past University Of Oxford affiliation)
Abstract

LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to clos...

πŸ“„ Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15464v1
πŸ‘₯ Authors: Jevin Jiang, Ying Chen (possible past Baidu (China) affiliation), Blake A. Hechtman (possible past Google (United States) affiliation), Fenghui Zhang, Yarong Mu
Abstract

Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, w...

πŸ“„ MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15309v1
πŸ‘₯ Authors: Yan Li (possible past Tencent (China) affiliation), Zezi Zeng, Yifan Yang (possible past Tencent (China) affiliation), Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo (possible past Google (United States) affiliation)
Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generat...

πŸ“„ Why Do Vision Language Models Struggle To Recognize Human Emotions?
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15280v1
πŸ‘₯ Authors: Madhav Agarwal, Sotirios A. Tsaftaris (possible past University Of Edinburgh affiliation), Laura Sevilla-Lara (possible past University Of Edinburgh affiliation), Steven Mcdonagh
Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VL...

πŸ“„ Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15233v1
πŸ‘₯ Authors: Moin Aminnaseri, Farima Fatahi Bayat, Nikita Bhutani, Jean-Flavien Bussotti, Kevin Chan, Rafael Li Chen, Yanlin Feng, Jackson Hassell, Estevam Hruschka, Eser Kandogan, Hannah Kim, James Levine, Seiji Maekawa, Jalal Mahmud, Kushan Mitra, Naoki Otani, Pouya Pezeshkpour, Nima Shahbazi, Chen Shen (possible past Tencent (China) affiliation), Dan Zhang (possible past Google (United States) affiliation)
Abstract

NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. I...

πŸ“„ PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15411v1
πŸ‘₯ Authors: Tingjia Miao, Wenkai Jin, Muhua Zhang, Jinxin Tan, Yuelin Hu, Tu Guo, Jiejun Zhang, Yuhan Wang (possible past Tencent (China) affiliation), Wenbo Li, Yinuo Gao, Shuo Chen, Weiqi Jiang, Yayun Hu, Zixing Lei, Xianghe Pang, Zexi Liu, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang (possible past University Of Oxford affiliation), Weinan E, Siheng Chen
Abstract

The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex rea...

πŸ“„ VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15188v1
πŸ‘₯ Authors: Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng (possible past Tencent (China) affiliation), Jiaxin Ding, Luoyi Fu (possible past Shanghai Jiao Tong University affiliation), Xinbing Wang
Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configu...

πŸ“„ Dr.~RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14989v1
πŸ‘₯ Authors: Wenji Fang, Yao Lu (possible past Google (United States) affiliation), Shang Liu, Jing Wang (possible past Google (United States) affiliation), Ziyan Guo, Junxian He (possible past Carnegie Mellon University affiliation), Fengbin Tu, Zhiyao Xie
Abstract

Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small-scale RTL designs and rely on weak open-source tools. Their optimization methods are also limited, relying on coarse design-level feedback and simple pre-defined rewriting rules. ...

πŸ“„ RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14885v1
πŸ‘₯ Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang (possible past Tencent (China) affiliation), Hai Zhao (possible past Shanghai Jiao Tong University affiliation)
Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding)...

πŸ“„ AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution
πŸ—“οΈ Published: 4/17/2026
πŸ”— http://arxiv.org/abs/2604.15622v1
πŸ‘₯ Authors: Yiwei Zhao, Yi Zheng, Huapeng Su, Jieyu Lin, Stefano Ambrogio, Cijo Jose, MichaΓ«l Ramamonjisoa, Patrick Labatut, Barbara De Salvo, Chiao Liu (possible past Meta (United States) affiliation), Phillip B. Gibbons (possible past Carnegie Mellon University affiliation), Ziyun Li
Abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision application...

πŸ“„ $Ο€_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15483v1
πŸ‘₯ Authors: Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared Dicarlo, Danny Driess (possible past Google (United States) affiliation), Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn (possible past University Of California, Berkeley affiliation), Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachlan Groom, Haroun Habeeb, Hunter Hancock, Karol Hausman (possible past Google (United States) affiliation), Gashon Hussein, Victor Hwang, Brian Ichter (possible past Google (United States) affiliation), Connor Jacobsen, Szymon Jakubczak, Rowan Jen, Tim Jones, Gregg Kammerer (possible past Meta (United States) affiliation), Ben Katz, Liyiming Ke, Mairbek Khadikov, Chandra Kuchi, Marinda Lamb, Devin Leblanc, Brendon Lecount, Sergey Levine (possible past University Of Washington affiliation), Xinyu Li, Adrian Li-Bell, Vladislav Lialin, Zhonglin Liang, Wallace Lim, Yao Lu (possible past Google (United States) affiliation), Enyu Luo, Vishnu Mano, Nandan Marwaha, Aikys Mongush, Liam Murphy, Suraj Nair (possible past Stanford University affiliation), Tyler Patterson, Karl Pertsch (possible past Google (United States) affiliation), Allen Z. Ren, Gavin Schelske, Charvi Sharma, Baifeng Shi, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg (possible past Google (United States) affiliation), Kyle Stachowicz, Will Stoeckle, Jiaming Tang, Jimmy Tanner, Shalom Tekeste, Marcel Torne, Kyle Vedder, Quan Vuong (possible past Google (United States) affiliation), Anna Walling, Haohuan Wang, Jason Wang, Xudong Wang, Chris Whalen, Samuel Whitmore, Blake Williams, Charles Xu, Sukwon Yoo, Lili Yu, Wuming Zhang, Zhuoyang Zhang, Ury Zhilinsky
Abstract

We present a new robotic foundation model, called $Ο€_{0.7}$, that can enable strong out-of-the-box performance in a wide range of scenarios. $Ο€_{0.7}$ can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level o...

πŸ“„ Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14974v1
πŸ‘₯ Authors: Jean-Bastien Grill (possible past Deepmind (United Kingdom) affiliation), Michal Valko, RΓ©mi Munos (possible past Google (United States) affiliation)
Abstract

You are a robot and you live in a Markov decision process (MDP) with a finite or an infinite number of transitions from state-action to next states. You got brains and so you plan before you act. Luckily, your roboparents equipped you with a generative model to do some Monte-Carlo planning. The world is waiting for you and you have no time to waste. You want your planning to be efficient. Sample-efficient. Indeed, you want to exploit the possible structure of the MDP by exploring only a subset o...

πŸ“„ LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14922v1
πŸ‘₯ Authors: Bowen Ping, Zijun Chen (possible past Google (United States) affiliation), Tingfeng Hui, Qize Yu, Chenxuan Li, Junchi Yan (possible past Shanghai Jiao Tong University affiliation), Baobao Chang (possible past Peking University affiliation)
Abstract

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantizatio...

πŸ“„ Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14726v1
πŸ‘₯ Authors: Jiaqi Zhu, Shaofeng Cai, Jie Chen (possible past Tencent (China) affiliation), Fang Deng, Beng Chin Ooi (possible past National University Of Singapore affiliation), Wenqiao Zhang
Abstract

Online anomaly detection (OAD) plays a pivotal role in real-time analytics and decision-making for evolving data streams. However, existing methods often rely on costly retraining and rigid decision boundaries, limiting their ability to adapt both effectively and efficiently to concept drift in dynamic environments. To address these challenges, we propose DyMETER, a dynamic concept adaptation framework for OAD that unifies on-the-fly parameter shifting and dynamic thresholding within a single on...

*Notable papers are those with at least two authors from a "big" AI/ML lab.