📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18584v1
👥 Authors: Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman (possible past Massachusetts Institute Of Technology affiliation), Antonio Torralba (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and...

📄 ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18543v1
👥 Authors: Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica (possible past University Of California, Berkeley affiliation), Cho-Jui Hsieh, Tianyi Zhou (possible past University Of Washington affiliation)
Abstract

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generat...

📄 OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18530v1
👥 Authors: Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong (possible past Tencent (China) affiliation), Min Zhang (possible past Tsinghua University affiliation)
Abstract

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guid...

📄 LLM Safety From Within: Detecting Harmful Content with Internal Representations
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18519v1
👥 Authors: Difan Jiao, Yilun Liu, Ye Yuan (possible past Carnegie Mellon University affiliation), Zhenwei Tang, Linfeng Du, Haolun Wu, Ashton Anderson (possible past Stanford University affiliation)
Abstract

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM ...

📄 Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18468v1
👥 Authors: Tianshi Cao, Jiawei Ren, Yuxuan Zhang, Jaewoo Seo, Jiahui Huang, Shikhar Solanki, Haotian Zhang (possible past Stanford University affiliation), Mingfei Guo, Haithem Turki, Muxingzi Li, Yue Zhu, Sipeng Zhang, Zan Gojcic, Sanja Fidler (possible past University Of Toronto affiliation), Kangxue Yin
Abstract

Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that ...

📄 Using large language models for embodied planning introduces systematic safety risks
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18463v1
👥 Authors: Tao Zhang (possible past Nvidia (United States) affiliation), Kaixian Qu, Zhibin Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Marco Hutter, Manling Li, Fan Shi
Abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 op...

📄 EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18320v1
👥 Authors: Yongrui Heng, Chaoya Jiang, Han Yang (possible past Eth Zurich affiliation), Shikun Zhang, Wei Ye (possible past Meta (United States) affiliation)
Abstract

Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model's internal certainty, but also a mechanism to perpetually dive...

📄 AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18240v1
👥 Authors: Wentao Shi, Yu Wang (possible past Tsinghua University affiliation), Yuyang Zhao, Yuxin Chen, Fuli Feng (possible past National University Of Singapore affiliation), Xueyuan Hao, Xi Su, Qi Gu, Hui Su (possible past Tencent (China) affiliation), Xunliang Cai, Xiangnan He (possible past National University Of Singapore affiliation)
Abstract

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introd...

📄 Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18235v1
👥 Authors: Jiayi Wu, Ruobing Xie (possible past Tencent (China) affiliation), Zeqian Huang, Lei Jiang, Can Xu (possible past Google (United States) affiliation), Kangyang Luo, Ming Gao, Xiang Li
Abstract

Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorre...

📄 WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18224v1
👥 Authors: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun (possible past Baidu (China) affiliation), Zhaoxiang Zhang (possible past Beijing Academy Of Artificial Intelligence affiliation), Jiaheng Liu
Abstract

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web ...

📄 Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18146v1
👥 Authors: Yunjia Xi, Menghui Zhu, Jianghao Lin, Bo Chen (possible past Tencent (China) affiliation), Ruiming Tang (possible past Huawei Technologies (China) affiliation), Yong Yu (possible past Shanghai Jiao Tong University affiliation), Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Recently, large language models (LLMs) have advanced recommendation systems (RSs), and recent works have begun to explore how to integrate LLMs into industrial RSs. While most approaches deploy LLMs offline to generate and pre-cache augmented representations for RSs, high-dimensional representations from LLMs introduce substantial storage and computational costs. Thus, it is crucial to compress LLM representations effectively. However, we identify a counterintuitive phenomenon during representat...

📄 Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18131v1
👥 Authors: Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li (possible past Google (United States) affiliation), Jing Tang, Nuo Chen, Haitao Mi, Yan Wang (possible past Tencent (China) affiliation)
Abstract

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge im...

📄 LoReC: Rethinking Large Language Models for Graph Data Analysis
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17897v1
👥 Authors: Hongyu Zhan, Qixin Wang, Yusen Tan, Haitao Yu, Jingbo Zhou (possible past Baidu (China) affiliation), Shuai Chen, Jia Li (possible past Google (United States) affiliation), Xiao Tan (possible past Baidu (China) affiliation), Jun Xia
Abstract

The advent of Large Language Models (LLMs) has fundamentally reshaped the way we interact with graphs, giving rise to a new paradigm called GraphLLM. As revealed in recent studies, graph learning can benefit from LLMs. However, we observe limited benefits when we directly utilize LLMs to make predictions for graph-related tasks within GraphLLM paradigm, which even yields suboptimal results compared to conventional GNN-based approaches. Through in-depth analysis, we find this failure can be attri...

📄 PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17819v1
👥 Authors: Wang Bill Zhu, Qiutong Tony Yi, Robin Jia (possible past Stanford University affiliation), Jesse Thomason (possible past University Of Washington affiliation)
Abstract

Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explici...

📄 Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17803v1
👥 Authors: Prasoon Goyal, Sattvik Sahai, Michael Johnston, Hangjie Shi, Yao Lu (possible past Google (United States) affiliation), Shaohua Liu, Anna Rumshisky, Rahul Gupta (possible past Google (United States) affiliation), Anna Gottardi, Desheng Zhang, Lavina Vaz, Leslie Ball, Lucy Hu, Luke Dai, Samyuth Sagi, Maureen Murray, Sankaranarayanan Ananthakrishnan
Abstract

Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactiv...

📄 DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17789v1
👥 Authors: Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu (possible past Google (United States) affiliation), Ying Wei (possible past Tencent (China) affiliation), Qingfu Zhang, Zhenan Sun
Abstract

The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-b...

📄 Evolutionary Negative Module Pruning for Better LoRA Merging
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17753v1
👥 Authors: Anda Cao, Zhuo Gou, Yi Wang, Kaixuan Chen, Yu Wang (possible past Tsinghua University affiliation), Can Wang (possible past Tsinghua University affiliation), Mingli Song, Jie Song (possible past Eth Zurich affiliation)
Abstract

Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of $\textit{negative modules}$ -- specific LoRA layers that...

📄 SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
🗓️ Published: 4/19/2026
🔗 http://arxiv.org/abs/2604.17503v1
👥 Authors: Zheng Nie, Ruolin Shen, Xinlei Yu, Bo Yin, Jiangning Zhang (possible past Tencent (China) affiliation), Xiaobin Hu (possible past Tencent (China) affiliation)
Abstract

Scaling vision-language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework...

📄 Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18473v1
👥 Authors: Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia (possible past University Of California, Berkeley affiliation), Noah A. Smith (possible past University Of Washington affiliation), Sewon Min (possible past University Of Washington affiliation)
Abstract

Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight r...

📄 AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18445v1
👥 Authors: Chongxiao Li, Pengwei Jin, Di Huang (possible past Google (United States) affiliation), Guangrun Sun, Husheng Han, Jianan Mu, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Hanjun Wei, Tianyun Ma, Shuyao Cheng, Rui Zhang, Ying Wang (possible past Tsinghua University affiliation), Zidong Du, Qi Guo, Xing Hu (possible past Baidu (China) affiliation)
Abstract

Performance, power, and area (PPA) optimization is a fundamental task in RTL design, requiring a precise understanding of circuit functionality and the relationship between circuit structures and PPA metrics. Recent studies attempt to automate this process using LLMs, but neither feedback-based nor knowledge-based methods are efficient enough, as they either design without any prior knowledge or rely heavily on human-summarized optimization rules. In this paper, we propose AutoPPA, a fully aut...

📄 How Creative Are Large Language Models in Generating Molecules?
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18031v1
👥 Authors: Wen Tao, Yiwei Wang (possible past Google (United States) affiliation), Peng Zhou (possible past Tencent (China) affiliation), Bryan Hooi, Wanlong Fang, Tianle Zhang, Xiao Luo, Yuansheng Liu, Alvin Chan
Abstract

Molecule generation requires satisfying multiple chemical and biological constraints while searching a large and structured chemical space. This makes it a non-binary problem, where effective models must identify non-obvious solutions under constraints while maintaining exploration to improve success by escaping local optima. From this perspective, creativity is a functional requirement in molecular generation rather than an aesthetic notion. Large language models (LLMs) can generate molecular r...

📄 Neural Garbage Collection: Learning to Forget while Learning to Reason
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.18002v1
👥 Authors: Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox (possible past Apple (United States) affiliation), Noah D. Goodman (possible past Stanford University affiliation)
Abstract

Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model's behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can't it learn to forget? We int...

📄 Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17883v1
👥 Authors: Tianfu Wang, Zhezheng Hao, Yin Wu, Wei Wu (possible past Tencent (China) affiliation), Qiang Lin, Hande Dong, Nicholas Jing Yuan, Hui Xiong (possible past Baidu (China) affiliation)
Abstract

Vibe coding produces correct, executable code at speed, but leaves no record of the structural commitments, dependencies, or evidence behind it. Reviewers cannot determine what invariants were assumed, what changed, or why a regression occurred. This is not a generation failure but a control failure: the dominant artifact of AI-assisted development (code plus chat history) performs dimension collapse, flattening complex system topology into low-dimensional text and making systems opaque and frag...

📄 M100: An Orchestrated Dataflow Architecture Powering General AI Computing
🗓️ Published: 4/20/2026
🔗 http://arxiv.org/abs/2604.17862v1
👥 Authors: Yan Xie, Changkui Mao, Changsong Wu, Chao Lu, Chao Suo, Cheng Qian, Chun Yang, Danyang Zhu, Hengchang Xiong, Hongzhan Lu, Hongzhen Liu, Jiafu Liu, Jie Chen (possible past Tencent (China) affiliation), Jie Dai, Junfeng Tang, Kai Liu (possible past Baidu (China) affiliation), Kun Li, Lipeng Ge, Meng Sun, Min Luo (possible past Tencent (China) affiliation), Peng Chen (possible past Tencent (China) affiliation), Peng Wang (possible past Peking University affiliation), Shaodong Yang, Shibin Tang, Shibo Chen, Weikang Zhang, Xiao Ling, Xiaobo Du, Xin Wu, Yang Liu (possible past Tsinghua University affiliation), Yi Jiang, Yihua Jin, Yin Huang, Yuli Zhang, Zhen Yuan, Zhiyuan Man, Zhongxiao Yao
Abstract

As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective arch...

📄 ARMove: Learning to Predict Human Mobility through Agentic Reasoning
🗓️ Published: 4/19/2026
🔗 http://arxiv.org/abs/2604.17419v1
👥 Authors: Chuyue Wang, Jie Feng (possible past Tsinghua University affiliation), Yuxi Wu, Shenglin Yi, Hang Zhang (possible past Amazon (United States) affiliation)
Abstract

Human mobility prediction is a critical task but remains challenging due to its complexity and variability across populations and regions. Recently, large language models (LLMs) have made progress in zero-shot prediction, but existing methods suffer from limited interpretability (due to black-box reasoning), lack of iterative learning from new data, and poor transferability. In this paper, we introduce \textbf{ARMove}, a fully transferable framework for predicting human mobility through agentic ...

📄 A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
🗓️ Published: 4/19/2026
🔗 http://arxiv.org/abs/2604.17312v1
👥 Authors: Zhiyin Yu, Yuchen Mou, Juncheng Yan, Junyu Luo, Chunchun Chen, Xing Wei, Yunhui Liu, Hongru Sun, Yuxing Zhang, Jun Xu (possible past Google (United States) affiliation), Yatao Bian, Ming Zhang (possible past Peking University affiliation), Wei Ye (possible past Meta (United States) affiliation), Tieke He, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Guanjie Zheng, Zhonghai Wu, Bo Zhang (possible past Tencent (China) affiliation), Lei Bai, Xiao Luo
Abstract

Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first syste...

*Notable papers are those with at least two authors from a "big" AI/ML lab.