πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15309v1
πŸ‘₯ Authors: Yan Li (possible past Tencent (China) affiliation), Zezi Zeng, Yifan Yang (possible past Tencent (China) affiliation), Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo (possible past Google (United States) affiliation)
Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generat...

πŸ“„ Why Do Vision Language Models Struggle To Recognize Human Emotions?
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15280v1
πŸ‘₯ Authors: Madhav Agarwal, Sotirios A. Tsaftaris (possible past University Of Edinburgh affiliation), Laura Sevilla-Lara (possible past University Of Edinburgh affiliation), Steven Mcdonagh
Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VL...

πŸ“„ Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15233v1
πŸ‘₯ Authors: Moin Aminnaseri, Farima Fatahi Bayat, Nikita Bhutani, Jean-Flavien Bussotti, Kevin Chan, Rafael Li Chen, Yanlin Feng, Jackson Hassell, Estevam Hruschka, Eser Kandogan, Hannah Kim, James Levine, Seiji Maekawa, Jalal Mahmud, Kushan Mitra, Naoki Otani, Pouya Pezeshkpour, Nima Shahbazi, Chen Shen (possible past Tencent (China) affiliation), Dan Zhang (possible past Google (United States) affiliation)
Abstract

NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. I...

πŸ“„ VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.15188v1
πŸ‘₯ Authors: Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng (possible past Tencent (China) affiliation), Jiaxin Ding, Luoyi Fu (possible past Shanghai Jiao Tong University affiliation), Xinbing Wang
Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configu...

πŸ“„ Dr.~RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14989v1
πŸ‘₯ Authors: Wenji Fang, Yao Lu (possible past Google (United States) affiliation), Shang Liu, Jing Wang (possible past Google (United States) affiliation), Ziyan Guo, Junxian He (possible past Carnegie Mellon University affiliation), Fengbin Tu, Zhiyao Xie
Abstract

Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small-scale RTL designs and rely on weak open-source tools. Their optimization methods are also limited, relying on coarse design-level feedback and simple pre-defined rewriting rules. ...

πŸ“„ UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14967v1
πŸ‘₯ Authors: Jun Wang (possible past Tencent (China) affiliation), Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Cewu Lu (possible past Shanghai Jiao Tong University affiliation)
Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual ...

πŸ“„ RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14885v1
πŸ‘₯ Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang (possible past Tencent (China) affiliation), Hai Zhao (possible past Shanghai Jiao Tong University affiliation)
Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding)...

πŸ“„ Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14858v1
πŸ‘₯ Authors: Zhonghao Yang, Yu Li (possible past Tencent (China) affiliation), Yanxu Zhu, Tianyi Zhou (possible past University Of Washington affiliation), Yuejin Xie, Haoyu Luo, Jing Shao, Xia Hu, Dongrui Liu
Abstract

As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-CodeX, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the t...

πŸ“„ Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14726v1
πŸ‘₯ Authors: Jiaqi Zhu, Shaofeng Cai, Jie Chen (possible past Tencent (China) affiliation), Fang Deng, Beng Chin Ooi (possible past National University Of Singapore affiliation), Wenqiao Zhang
Abstract

Online anomaly detection (OAD) plays a pivotal role in real-time analytics and decision-making for evolving data streams. However, existing methods often rely on costly retraining and rigid decision boundaries, limiting their ability to adapt both effectively and efficiently to concept drift in dynamic environments. To address these challenges, we propose DyMETER, a dynamic concept adaptation framework for OAD that unifies on-the-fly parameter shifting and dynamic thresholding within a single on...

πŸ“„ CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14615v1
πŸ‘₯ Authors: Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park, A. Ali Heydari, Ahmed A. Metwally, Hong Yu, Xin Liu, Xuhai Xu, Yuzhe Yang, Maxwell A. Xu, Zhihan Zhang, Cynthia Breazeal (possible past Massachusetts Institute Of Technology affiliation), Tim Althoff, Petar Sirkovic, Ivor Rendulic, Annalisa Pawlosky (possible past Google (United States) affiliation), Nicolas Stroppa, Juraj Gottweis (possible past Google (United States) affiliation), Elahe Vedadi, Alan Karthikesalingam (possible past Google (United States) affiliation), Pushmeet Kohli (possible past Google (United States) affiliation), Vivek Natarajan (possible past Google (United States) affiliation), Mark Malhotra, Shwetak Patel, Hae Won Park, Hamid Palangi, Daniel Mcduff (possible past Google (United States) affiliation)
Abstract

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-obse...

πŸ“„ FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14451v1
πŸ‘₯ Authors: Biwei Dai (possible past University Of California, Berkeley affiliation), Po-Wen Chang, Wahid Bhimji, Paolo Calafiura, Ragansu Chakkappai, Yuan-Tang Chou, Sascha Diefenbacher, Jordan Dudley, Ibrahim Elsharkawy, Steven Farrell, Isabelle Guyon, Chris Harris, Elham E Khoda, Benjamin Nachman (possible past University Of California, Berkeley affiliation), David Rousseau, UroΕ‘ Seljak, Ihsan Ullah, Yulei Zhang
Abstract

Weak gravitational lensing, the correlated distortion of background galaxy shapes by foreground structures, is a powerful probe of the matter distribution in our universe and allows accurate constraints on the cosmological model. In recent years, high-order statistics and machine learning (ML) techniques have been applied to weak lensing data to extract the nonlinear information beyond traditional two-point analysis. However, these methods typically rely on cosmological simulations, which poses ...

πŸ“„ LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14140v1
πŸ‘₯ Authors: Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish Mcdonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr (possible past University Of Oxford affiliation), Natasha Jaques (possible past University Of California, Berkeley affiliation), Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder De Witt (possible past University Of Oxford affiliation)
Abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Proble...

πŸ“„ HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14125v1
πŸ‘₯ Authors: Tianshuo Yang, Guanyu Chen, Yutian Chen (possible past Deepmind (United Kingdom) affiliation), Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang (possible past Shanghai Artificial Intelligence Laboratory affiliation), Yao Mu, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation)
Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs tas...

πŸ“„ TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14116v1
πŸ‘₯ Authors: Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun (possible past Tencent (China) affiliation), Wenran Liu, Kai Chen (possible past Shanghai Jiao Tong University affiliation), Yining Li
Abstract

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formula...

πŸ“„ ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.14261v1
πŸ‘₯ Authors: Zhuofeng Li, Yi Lu, Dongfu Jiang, Haoxiang Zhang, Yuyang Bai, Chuan Li, Yu Wang (possible past Tsinghua University affiliation), Shuiwang Ji, Jianwen Xie, Yu Zhang (possible past Google (United States) affiliation)
Abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper...

πŸ“„ Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14974v1
πŸ‘₯ Authors: Jean-Bastien Grill (possible past Deepmind (United Kingdom) affiliation), Michal Valko, RΓ©mi Munos (possible past Google (United States) affiliation)
Abstract

You are a robot and you live in a Markov decision process (MDP) with a finite or an infinite number of transitions from state-action to next states. You got brains and so you plan before you act. Luckily, your roboparents equipped you with a generative model to do some Monte-Carlo planning. The world is waiting for you and you have no time to waste. You want your planning to be efficient. Sample-efficient. Indeed, you want to exploit the possible structure of the MDP by exploring only a subset o...

πŸ“„ LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14922v1
πŸ‘₯ Authors: Bowen Ping, Zijun Chen (possible past Google (United States) affiliation), Tingfeng Hui, Qize Yu, Chenxuan Li, Junchi Yan (possible past Shanghai Jiao Tong University affiliation), Baobao Chang (possible past Peking University affiliation)
Abstract

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantizatio...

πŸ“„ VoxSafeBench: Not Just What Is Said, but Who, How, and Where
πŸ—“οΈ Published: 4/16/2026
πŸ”— http://arxiv.org/abs/2604.14548v1
πŸ‘₯ Authors: Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang (possible past Tesla (United States) affiliation), Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang (possible past Baidu (China) affiliation), Jie Shi, Zhizheng Wu (possible past University Of Edinburgh affiliation)
Abstract

As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with ...

πŸ“„ DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13902v1
πŸ‘₯ Authors: Xiaofan Li, Ming Yang (possible past Meta (United States) affiliation), Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng (possible past National University Of Singapore affiliation), Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma (possible past Shanghai Jiao Tong University affiliation), Yuan Xie
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strateg...

πŸ“„ RPS: Information Elicitation with Reinforcement Prompt Selection
πŸ—“οΈ Published: 4/15/2026
πŸ”— http://arxiv.org/abs/2604.13817v1
πŸ‘₯ Authors: Tao Wang (possible past Stanford University affiliation), Jingyao Lu, Xibo Wang, Haonan Huang, Su Yao, Zhiqiang Hu (possible past Peking University affiliation), Xingyan Chen, Enmao Diao
Abstract

Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather compl...

*Notable papers are those with at least two authors from a "big" AI/ML lab.