πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Solving Physics Olympiad via Reinforcement Learning on Physics Simulators
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11805v1
πŸ‘₯ Authors: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki (possible past University Of California, Berkeley affiliation), Deepak Pathak (possible past University Of California, Berkeley affiliation)
Abstract

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can ser...

πŸ“„ ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11790v1
πŸ‘₯ Authors: Wei Zhao (possible past Tencent (China) affiliation), Zhe Li (possible past Google (United States) affiliation), Peixin Zhang, Jun Sun
Abstract

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server inj...

πŸ“„ General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11778v1
πŸ‘₯ Authors: Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang (possible past Google (United States) affiliation), Wenling Yuan, Yifan Zhou, Xiaoyu Li (possible past Tencent (China) affiliation), Ziwen Wang, Xuezhi Cao, Xunliang Cai
Abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branc...

πŸ“„ StarVLA-$Ξ±$: Reducing Complexity in Vision-Language-Action Systems
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11757v1
πŸ‘₯ Authors: Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Shu Liu (possible past Tencent (China) affiliation), Jiaya Jia (possible past Tencent (China) affiliation)
Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$Ξ±$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$Ξ±$ deliberately minimizes...

πŸ“„ CodeTracer: Towards Traceable Agent States
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11641v1
πŸ‘₯ Authors: Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He (possible past Tencent (China) affiliation), Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun (possible past Baidu (China) affiliation), Zhaoxiang Zhang (possible past Beijing Academy Of Artificial Intelligence affiliation), He Ye, Jiaheng Liu
Abstract

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses eit...

πŸ“„ SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11548v1
πŸ‘₯ Authors: Ningyan Zhu, Huacan Wang, Jie Zhou (possible past Tsinghua University affiliation), Feiyu Chen, Shuo Zhang (possible past National University Of Defense Technology affiliation), Ge Chen, Chen Liu, Jiarou Wu, Wangyi Chen, Xiaofeng Mou, Yi Xu
Abstract

The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agent...

πŸ“„ A collaborative agent with two lightweight synergistic models for autonomous crystal materials research
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11540v1
πŸ‘₯ Authors: Tongyu Shi, Yutang Li, Zhanyuan Li, Qian Liu, Jie Zhou (possible past Tsinghua University affiliation), Wenhe Xu, Yang Li (possible past Google (United States) affiliation), Dawei Dai, Rui He, Wenhua Zhou, Jiahong Wang, Xue-Feng Yu
Abstract

Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestratin...

πŸ“„ Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11435v1
πŸ‘₯ Authors: Argyrios Papoudakis, Mirella Lapata (possible past University Of Edinburgh affiliation), Frank Keller (possible past University Of Edinburgh affiliation)
Abstract

Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find tha...

πŸ“„ Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11407v1
πŸ‘₯ Authors: Bo Li (possible past Tencent (China) affiliation), Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye (possible past Meta (United States) affiliation)
Abstract

We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which th...

πŸ“„ Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11365v1
πŸ‘₯ Authors: Peiyang Liu, Zhirui Chen, Xi Wang (possible past Tsinghua University affiliation), Di Liang, Youru Li, Zhi Cai, Wei Ye (possible past Meta (United States) affiliation)
Abstract

Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured ref...

πŸ“„ BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11304v1
πŸ‘₯ Authors: Elaine Lau, Markus DΓΌcker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu (possible past Google (United States) affiliation), Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco GuzmΓ‘n (possible past Meta (United States) affiliation)
Abstract

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires a...

πŸ“„ The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11297v1
πŸ‘₯ Authors: Yang Liu (possible past Tsinghua University affiliation), Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang (possible past Tencent (China) affiliation), Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu
Abstract

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing ...

πŸ“„ CocoaBench: Evaluating Unified Digital Agents in the Wild
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11201v1
πŸ‘₯ Authors: Cocoabench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang (possible past Tsinghua University affiliation), Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian Mcauley, Eric P. Xing, Zhengzhong Liu (possible past Tencent (China) affiliation), Rupesh Kumar Srivastava, Zhiting Hu
Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon...

πŸ“„ MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11188v1
πŸ‘₯ Authors: Zixiong Yu, Jun Rao, Guhan Chen, Songtao Tian, Bohan Li (possible past Google (United States) affiliation), Jiansheng Wei, Min Zhang (possible past Tsinghua University affiliation), Xiaojun Meng
Abstract

Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation t...

πŸ“„ Introspective Diffusion Language Models
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11035v1
πŸ‘₯ Authors: Yifan Yu, Yuqing Jian, Junxiong Wang, Zhongzhu Zhou, Donglin Zhuang, Xinyu Fang, Sri Yanamandra, Xiaoxia Wu, Qingyang Wu, Shuaiwen Leon Song (possible past Microsoft (United States) affiliation), Tri Dao, Ben Athiwaratkun, James Zou, Fan Lai, Chenfeng Xu (possible past University Of California, Berkeley affiliation)
Abstract

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Mo...

πŸ“„ Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.10963v1
πŸ‘₯ Authors: Ruiyang Li, Fang Liu (possible past Massachusetts Institute Of Technology affiliation), Licheng Jiao, Xinglin Xie, Jiayao Hao, Shuo Li, Xu Liu (possible past Massachusetts Institute Of Technology affiliation), Jingyi Yang, Lingling Li, Puhua Chen, Wenping Ma
Abstract

Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insuffici...

πŸ“„ Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.10905v1
πŸ‘₯ Authors: Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-Gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi (possible past Nvidia (United States) affiliation), Bryan Catanzaro (possible past University Of California, Berkeley affiliation), Ming-Yu Liu (possible past Nvidia (United States) affiliation), Wei Ping (possible past Baidu (China) affiliation)
Abstract

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data ...

πŸ“„ CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation
πŸ—“οΈ Published: 4/13/2026
πŸ”— http://arxiv.org/abs/2604.11483v1
πŸ‘₯ Authors: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang (possible past Baidu (China) affiliation), Wen-Cai Ye, Li Liu (possible past National University Of Defense Technology affiliation)
Abstract

Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framew...

πŸ“„ FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation
πŸ—“οΈ Published: 4/12/2026
πŸ”— http://arxiv.org/abs/2604.10678v1
πŸ‘₯ Authors: Yingguang Yang, Hao Liu (possible past Tencent (China) affiliation), Xin Zhang (possible past Google (United States) affiliation), Yunhui Liu, Yutong Xia, Qi Wu, Hao Peng (possible past Tsinghua University affiliation), Taoran Liang, Bin Chong, Tieke He, Philip S. Yu (possible past Tsinghua University affiliation)
Abstract

Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To ad...

πŸ“„ ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning
πŸ—“οΈ Published: 4/12/2026
πŸ”— http://arxiv.org/abs/2604.10568v1
πŸ‘₯ Authors: Kewei Zhu, Cameron Wilson, Bartosz Mazur, Yi Li (possible past University Of Washington affiliation), Ashleigh M. Chester, Peyman Z. Moghadam (possible past Google (United States) affiliation)
Abstract

Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF na...

*Notable papers are those with at least two authors from a "big" AI/ML lab.