📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.06161v1
👥 Authors: Zhengming Yu, Li Ma, Mingming He, Leo Isikdogan, Yuancheng Xu, Dmitriy Smirnov, Pablo Salamanca, Dao Mi, Pablo Delgado, Ning Yu, Julien Philip, Xin Li (possible past Google (United States) affiliation), Wenping Wang, Paul Debevec (possible past Google (United States) affiliation)
Abstract

Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- an...

📄 Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.06132v1
👥 Authors: Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li (possible past Carnegie Mellon University affiliation), Lingpeng Kong (possible past Google (United States) affiliation), Qi Liu (possible past Tencent (China) affiliation), Zhifang Sui (possible past Peking University affiliation), Tong Yang (possible past Peking University affiliation)
Abstract

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified...

📄 Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05955v1
👥 Authors: Kai Yu (possible past Baidu (China) affiliation), Zhenhao Zhou, Junhao Zeng, Ying Wang (possible past Tsinghua University affiliation), Xueying Du, Zhiqiang Yuan, Junwei Liu, Ziyu Zhou, Yujia Wang, Chong Wang (possible past Google (United States) affiliation), Xin Peng
Abstract

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{...

📄 HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05887v1
👥 Authors: Bowen Zeng, Feiyang Ren, Jun Zhang (possible past Tencent (China) affiliation), Xiaoling Gu, Ke Chen (possible past Tencent (China) affiliation), Lidan Shou, Huan Li
Abstract

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at differ...

📄 Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05719v1
👥 Authors: Jiaren Peng, Zeqin Li, Chang You, Yan Wang (possible past Tencent (China) affiliation), Hanlin Sun, Xuan Tian, Shuqiao Zhang, Junyi Liu, Jianguo Zhao, Renyang Liu, Haoran Ou, Yuqiang Sun, Jiancheng Zhang, Yutong Jiao, Kunshu Song, Chao Zhang, Fan Shi, Hongda Sun, Rui Yan (possible past Peking University affiliation), Cheng Huang
Abstract

The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural...

📄 Experience Transfer for Multimodal LLM Agents in Minecraft Game
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05533v1
👥 Authors: Chenghao Li, Jun Liu (possible past Tencent (China) affiliation), Songbo Zhang, Huadong Jian, Hao Ni, Lik-Hang Lee, Sung-Ho Bae, Guoqing Wang, Yang Yang (possible past Tencent (China) affiliation), Chaoning Zhang
Abstract

Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formula...

📄 Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05523v1
👥 Authors: Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu (possible past Shanghai Jiao Tong University affiliation), Xiongkuo Min, Guangtao Zhai (possible past Shanghai Jiao Tong University affiliation)
Abstract

The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs b...

📄 TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05364v1
👥 Authors: Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon (possible past Google (United States) affiliation), Tomas Pfister (possible past University Of Oxford affiliation)
Abstract

We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic mu...

📄 ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05355v1
👥 Authors: Xuan Xiong, Huan Liu (possible past Tsinghua University affiliation), Li Gu, Zhixiang Chi, Yue Qiu, Yuanhao Yu, Yang Wang (possible past Baidu (China) affiliation)
Abstract

Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this ins...

📄 MedGemma 1.5 Technical Report
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.05081v1
👥 Authors: Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger (possible past Google (United States) affiliation), Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, Liron Yatziv (possible past Google (United States) affiliation), Tiffany Chen, Bram Sterling, Kenneth Philbrick, Richa Tiwari, Yun Liu (possible past Google (United States) affiliation), Madhuram Jajoo, Chandrashekar Sankarapu, Swapnil Vispute (possible past Google (United States) affiliation), Harshad Purandare, Abhishek Bijay Mishra, Sam Schmidgall, Tao Tu (possible past Google (United States) affiliation), Anil Palepu (possible past Google (United States) affiliation), Chunjong Park, Tim Strother, Rahul Thapa, Yong Cheng (possible past Tsinghua University affiliation), Preeti Singh, Kat Black, Yossi Matias (possible past Google (United States) affiliation), Katherine Chou (possible past Google (United States) affiliation), Avinatan Hassidim (possible past Google (United States) affiliation), Kavi Goel, Joelle Barral, Tris Warkentin, Shravya Shetty (possible past Google (United States) affiliation), Dale Webster, Sunny Virmani (possible past Google (United States) affiliation), David F. Steiner (possible past Google (United States) affiliation), Can Kirmizibayrak, Daniel Golden
Abstract

We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, i...

📄 PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.05018v1
👥 Authors: Yiwen Song, Yale Song, Tomas Pfister (possible past University Of Oxford affiliation), Jinsung Yoon (possible past Google (United States) affiliation)
Abstract

Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi-agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre-writing materials into submission-ready LaTeX manuscripts, including comprehensive literature s...

📄 Vero: An Open RL Recipe for General Visual Reasoning
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04917v2
👥 Authors: Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen (possible past Stanford University affiliation), Zhuang Liu (possible past University Of California, Berkeley affiliation)
Abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL d...

📄 QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04898v1
👥 Authors: Lm-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li (possible past Google (United States) affiliation), Ian Wu, Lewis Tunstall, Aviral Kumar (possible past University Of California, Berkeley affiliation)
Abstract

Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained t...

📄 DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04875v1
👥 Authors: Ke Li (possible past University Of California, Berkeley affiliation), Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen (possible past Tencent (China) affiliation)
Abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we...

📄 HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.06165v1
👥 Authors: Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach (possible past University Of California, Berkeley affiliation), Marcus Rohrbach (possible past University Of California, Berkeley affiliation)
Abstract

Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear wh...

📄 A deep learning framework for jointly solving transient Fokker-Planck equations with arbitrary parameters and initial distributions
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.06001v1
👥 Authors: Xiaolong Wang (possible past Carnegie Mellon University affiliation), Jing Feng, Qi Liu (possible past Tencent (China) affiliation), Chengli Tan, Yuanyuan Liu, Yong Xu (possible past Tencent (China) affiliation)
Abstract

Efficiently solving the Fokker-Planck equation (FPE) is central to analyzing complex parameterized stochastic systems. However, current numerical methods lack parallel computation capabilities across varying conditions, severely limiting comprehensive parameter exploration and transient analysis. This paper introduces a deep learning-based pseudo-analytical probability solution (PAPS) that, via a single training process, simultaneously resolves transient FPE solutions for arbitrary multi-modal i...

📄 QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization
🗓️ Published: 4/7/2026
🔗 http://arxiv.org/abs/2604.05963v1
👥 Authors: Changxin Ke, Rui Zhang, Jiaming Guo, Yuanbo Wen, Li Ding, Shuo Wang (possible past Nvidia (United States) affiliation), Xuyuan Zhu, Xiong Peng, Di Huang (possible past Google (United States) affiliation), Zidong Du, Xing Hu (possible past Baidu (China) affiliation), Qi Guo, Yunji Chen
Abstract

Large Language Models (LLMs) achieve strong program repair performance but often suffer from over-editing, where excessive modifications overwrite correct code and hinder bug localization. We systematically quantify its impact and introduce precise repair task, which maximizes reuse of correct code while fixing only buggy parts. Building on this insight, we propose PRepair, a framework that mitigates over-editing and improves repair accuracy. PRepair has two components: Self-Breaking, which gene...

📄 One Model for All: Multi-Objective Controllable Language Models
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04497v1
👥 Authors: Qiang He, Yucheng Yang, Tianyi Zhou (possible past University Of Washington affiliation), Meng Fang (possible past Tencent (China) affiliation), Mykola Pechenizkiy, Setareh Maghsudi
Abstract

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the ...

📄 A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
🗓️ Published: 4/6/2026
🔗 http://arxiv.org/abs/2604.04488v1
👥 Authors: Tianmeng Fang, Yong Wang (possible past Baidu (China) affiliation), Zetai Kong, Zengzhen Su, Jun Wang (possible past Tencent (China) affiliation), Chengjin Yu, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. T...

*Notable papers are those with at least two authors from a "big" AI/ML lab.