πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02714v1
πŸ‘₯ Authors: Tienyu Chang, Zhen Chen, Renjie Liang, Jinyu Ding, Jie Xu, Sunu Mathew, Amir Reza Hajrasouliha, Andrew J. Saykin, Ruogu Fang, Yu Huang (possible past Tencent (China) affiliation), Jiang Bian (possible past Baidu (China) affiliation), Qingyu Chen
Abstract

The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings...

πŸ“„ SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02707v1
πŸ‘₯ Authors: Tienyu Chang, Tianhao Li, Ruogu Fang, Jiang Bian (possible past Baidu (China) affiliation), Yu Huang (possible past Tencent (China) affiliation)
Abstract

Optical coherence tomography (OCT), a commonly used retinal imaging modality, plays a central role in retinal disease diagnosis by providing high-resolution visualization of retinal layers. While deep learning (DL) has achieved expert-level accuracy in OCT-based retinal disease detection, its "black box" nature poses challenges for clinical adoption, where explainability is essential for clinical trust and regulatory approval. Existing post-hoc explainable AI (XAI) methods often struggle to deli...

πŸ“„ AcademiClaw: When Students Set Challenges for AI Agents
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02661v1
πŸ‘₯ Authors: Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang (possible past Baidu (China) affiliation), Yanjie Wang, Yi Yang (possible past Baidu (China) affiliation), Zijian Hu, Ziyi Yang (possible past Tencent (China) affiliation), Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen (possible past Baidu (China) affiliation), Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li (possible past Tencent (China) affiliation), Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li (possible past Baidu (China) affiliation), Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li (possible past Peking University affiliation), Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu
Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates...

πŸ“„ Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02537v1
πŸ‘₯ Authors: Meisheng Zhang, Shizhao Sun, Yang Zhao (possible past Google (United States) affiliation), Ziyuan Liu, Zhijun Gao, Jiang Bian (possible past Baidu (China) affiliation)
Abstract

Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration. By internalizing a novel zone-based logic, ZoneMaestro translates high-level semantic intent into functional zo...

πŸ“„ FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02411v1
πŸ‘₯ Authors: Kyle Zheng, Han Zhang (possible past Tsinghua University affiliation), Renliang Sun, Chenchen Ye, Wei Wang (possible past University Of Oxford affiliation)
Abstract

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText generates natural-language pseudo-tool descriptions a...

πŸ“„ HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02396v1
πŸ‘₯ Authors: Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li (possible past Tencent (China) affiliation), Wei Wang (possible past University Of Oxford affiliation), Xunliang Cai
Abstract

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the mod...

πŸ“„ Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02378v1
πŸ‘₯ Authors: Haoyu Wang (possible past Tencent (China) affiliation), Haonan Wang, Yuyan Chen, Jun Chen, Gang Liu (possible past Tencent (China) affiliation), Qian Wang, Jiahong Yan, Yanghua Xiao
Abstract

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual c...

πŸ“„ Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
πŸ—“οΈ Published: 5/3/2026
πŸ”— http://arxiv.org/abs/2605.01970v1
πŸ‘₯ Authors: Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian TramΓ¨r (possible past Stanford University affiliation), David Wagner (possible past University Of California, Berkeley affiliation)
Abstract

Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such ...

πŸ“„ TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
πŸ—“οΈ Published: 5/3/2026
πŸ”— http://arxiv.org/abs/2605.01809v1
πŸ‘₯ Authors: Xiaoda Yang, Majun Zhang, Changhao Pan, Nick Huang, Yang Yuguang, Fan Zhuo, Pengfei Zhou, Jin Zhou (possible past Google (United States) affiliation), Sizhe Shan, Shan Yang (possible past Google (United States) affiliation), Miles Yang, Yang You (possible past University Of California, Berkeley affiliation), Zhou Zhao
Abstract

Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current ev...

πŸ“„ FEDIN: Frequency-Enhanced Deep Interest Network for Click-Through Rate Prediction
πŸ—“οΈ Published: 5/3/2026
πŸ”— http://arxiv.org/abs/2605.01726v1
πŸ‘₯ Authors: Zenan Dai, Jinpeng Wang (possible past Tencent (China) affiliation), Junwei Pan, Dapeng Liu (possible past Tsinghua University affiliation), Lei Xiao (possible past Tencent (China) affiliation), Shu-Tao Xia
Abstract

Sequential recommendation models often struggle to capture latent periodic patterns in user interests, primarily due to the noise inherent in time-domain behavioral data. While frequency-domain analysis offers a global perspective to address this, existing approaches typically treat user sequences in isolation, overlooking the crucial context of the target item. In this work, we present a novel empirical observation: user attention scores exhibit distinct spectral entropy distributions when cond...

πŸ“„ Motion-Aware Caching for Efficient Autoregressive Video Generation
πŸ—“οΈ Published: 5/3/2026
πŸ”— http://arxiv.org/abs/2605.01725v1
πŸ‘₯ Authors: Jing Xu (possible past Meta (United States) affiliation), Yuexiao Ma, Songwei Liu, Xuzhe Zheng, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji (possible past Tencent (China) affiliation), Fei Chao, Xing Wang (possible past Tencent (China) affiliation)
Abstract

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent er...

πŸ“„ VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02834v1
πŸ‘₯ Authors: Tanush Yadav, Mohammadreza Salehi, Jae Sung Park (possible past University Of California, Berkeley affiliation), Vivek Ramanujan, Hannaneh Hajishirzi (possible past University Of Washington affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation), Ali Farhadi (possible past University Of Washington affiliation), Rohun Tripathi, Ranjay Krishna (possible past University Of Washington affiliation)
Abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we intro...

πŸ“„ Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02735v1
πŸ‘₯ Authors: Xin Zhang (possible past Google (United States) affiliation), Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou (possible past Tencent (China) affiliation)
Abstract

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space...

πŸ“„ CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02657v1
πŸ‘₯ Authors: Ziyang Yu, Yi He, Wenbing Huang (possible past Tsinghua University affiliation), Wen Yan, Yang Liu (possible past Tsinghua University affiliation)
Abstract

Estimating free energy differences quantifies thermodynamic preferences in molecular interactions, which is central to chemistry and drug discovery. Despite fruitful progress, existing methods still face key limitations: classical computational approaches remain prohibitively expensive due to their reliance on extensive molecular dynamics simulations, while deep learning-based methods are constrained by either less-expressive generative models or input dimensions tied to a specific system, resul...

πŸ“„ Black-box optimization of noisy functions with unknown smoothness
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02462v1
πŸ‘₯ Authors: Jean-Bastien Grill (possible past Deepmind (United Kingdom) affiliation), Michal Valko, RΓ©mi Munos (possible past Google (United States) affiliation)
Abstract

We study the problem of black-box optimization of a function f of any dimension, given function evaluations perturbed by noise. The function is assumed to be locally smooth around one of its global optima, but this smoothness is unknown. Our contribution is an adaptive optimization algorithm, POO or parallel optimistic optimization, that is able to deal with this setting. POO performs almost as well as the best known algorithms requiring the knowledge of the smoothness. Furthermore, POO works fo...

πŸ“„ Geometric and Spectral Alignment for Deep Neural Network II
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02111v1
πŸ‘₯ Authors: Ziran Liu, Wei Wang (possible past University Of Oxford affiliation), Jinhao Wang, Pengcheng Wang, Xinyi Sui, Cihan Ruan, Nam Ling, Wei Jiang (possible past Apple (United States) affiliation)
Abstract

This paper develops the angular and static-channel component of Geometric and Spectral Alignment for residual Jacobian chains. Starting from Cartan-coordinate rigidity and fitted effective-rank windows, we study how dominant singular subspaces are transported across adjacent layers and how the resulting finite matrices can be displayed in physical channel coordinates. The main results are deterministic, margin-verified results. We bound the error between full interface transport and its domina...

πŸ“„ Geometric and Spectral Alignment for Deep Neural Network I
πŸ—“οΈ Published: 5/4/2026
πŸ”— http://arxiv.org/abs/2605.02108v1
πŸ‘₯ Authors: Ziran Liu, Wei Wang (possible past University Of Oxford affiliation), Jinhao Wang, Pengcheng Wang, Xinyi Sui, Cihan Ruan, Nam Ling, Wei Jiang (possible past Apple (United States) affiliation)
Abstract

Deep residual architectures are modeled as products of near-identity Jacobians. This paper proves deterministic quotient-geometric estimates for singular spectra of Frobenius-normalized layer factors, emphasizing a normalized top-radial Cartan coordinate and fitted power-law chart. Full-rank factors are mapped from $\mathrm{GL}(d)$ to the positive cone by $A\mapsto A^\top A$, then to ordered eigenvalue data. Under Frobenius normalization, exact power-law spectra form a trace-normalized Cartan ...

πŸ“„ The (Marginal) Value of a Search Ad: An Online Causal Framework for Repeated Second-price Auctions
πŸ—“οΈ Published: 5/3/2026
πŸ”— http://arxiv.org/abs/2605.01756v1
πŸ‘₯ Authors: Yuxiao Wen, Zihao Hu, Yanjun Han, Yuan Yao (possible past Tsinghua University affiliation), Zhengyuan Zhou (possible past Stanford University affiliation)
Abstract

Existing auto-bidding algorithms in digital advertising often treat the value of an ad opportunity as the revenue obtained when an ad is shown and/or clicked, and bid accordingly. This can lead to wasteful spending because the true value is the marginal gain from paid exposure: even without winning a sponsored slot, an advertiser may still earn revenue via an organic search result (e.g., on Google or Amazon). Motivated by recent work, we model ad value as a treatment effect--the outcome differen...

*Notable papers are those with at least two authors from a "big" AI/ML lab.