📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Agentic Hardware Design as Repository-Level Code Evolution
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.28279v1
👥 Authors: Cunxi Yu, Chenhui Deng, Nathaniel Pinckney (possible past Nvidia (United States) affiliation), Brucek Khailany (possible past Nvidia (United States) affiliation)
Abstract

We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hard...

📄 Towards Automating Scientific Review with Google's Paper Assistant Tool
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.28277v1
👥 Authors: Rajesh Jayaram, Drew Tyler, David Woodruff, Corinna Cortes (possible past Google (United States) affiliation), Yossi Matias (possible past Google (United States) affiliation), Vahab Mirrokni (possible past Google (United States) affiliation), Vincent Cohen-Addad
Abstract

Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy c...

📄 Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.28186v1
👥 Authors: Chenguang Wang (possible past Amazon (United States) affiliation), Ming Li, Xinyue Zeng, Zhuochun Li, Hong Jiao, Tianyi Zhou (possible past University Of Washington affiliation), Dawei Zhou
Abstract

Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes that make items difficult. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem-solving burden an item induces. Large Rea...

📄 Tandem Reinforcement Learning with Verifiable Rewards
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.28166v1
👥 Authors: Difan Jiao, Raghav Singhal, Robert West (possible past Stanford University affiliation), Ashton Anderson (possible past Stanford University affiliation)
Abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this ...

📄 JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM/VLM-Centric Solution for Item Understanding, Management, and Applications
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.28070v1
👥 Authors: Oxygen Aiic, Chan Long, Chao Liu, Chaofan Chen, Chaohui Dong, Chunyuan Guo, Danping Liu, Debin Liu, Deping Xiang, Fulai Xu, Guangyue Liu, Hao Li (possible past Tsinghua University affiliation), Huichun Hu, Jian Yang, Jianan Wang (possible past Deepmind (United Kingdom) affiliation), Jianbo Zhao, Jiaoyang Li, Jiaxing Wang, Jinglong Li, Jinjin Guo, Jun Fang, Jun Liu (possible past Tencent (China) affiliation), Kai Zhou, Li Wang (possible past Tesla (United States) affiliation), Lili Gao, Liying Chen, Luning Yang, Mengdi Zhou, Pengzhang Liu, Qi Lv, Qianyun Wang, Qixia Jiang, Ruyue Li, Shimu Liang, Shuxing Wang, Sijie Zhang, Siqi Li, Tianhao Gao, Wang Ke, Weihu Huang, Wencan Lai, Wenjie Zhang, Xiaohui Zhang (possible past Meta (United States) affiliation), Xiaojing Dong, Ya Liu, Yifeng Zhang, Yixiang Wang, Yongtai Zhang, Yongyi Liao, Zhaoru Chen, Zhen Chen, Zhiyong Ma, Zhiyuan Liu (possible past Tsinghua University affiliation), Zhongwei Liu, Ziyan Xing
Abstract

JD.com, one of the world's largest e-commerce platforms, serves over 700 million active users and millions of merchants, with a catalog of tens of billions of SKUs. At this scale, high-quality, structured item knowledge underpins a better consumer experience, lower management costs, and higher operational efficiency-yet producing and serving it poses three industrial-scale challenges: fast-emerging concepts, high-quality knowledge production for massive SKUs, and diverse downstream requirements....

📄 ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.27814v1
👥 Authors: Qitai Tan, Zefang Zong, Yang Li (possible past Google (United States) affiliation), Peng Chen (possible past Tencent (China) affiliation)
Abstract

Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling. Reinforcement learning (RL) directly optimizes environment rewards and encourages exploratory improvement toward a higher reward-defined ceiling, but sp...

📄 SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.27786v1
👥 Authors: Ruochang Li, Pengcheng Huang, Zhenghao Liu (possible past Tsinghua University affiliation), Yukun Yan, Huiyuan Xie, Yu Gu, Ge Yu, Maosong Sun (possible past Tsinghua University affiliation)
Abstract

Retrieval-augmented generation (RAG) enhances LLMs by incorporating external knowledge to support response generation. However, conflicts between retrieved context and parametric knowledge have emerged as a critical challenge in RAG systems. To mitigate such conflicts, numerous studies have attempted to identify and edit knowledge-related internal neurons, aiming to improve the ability of LLMs to rely on contextual evidence during generation. However, these neuron-level approaches may introduce ...

📄 ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
🗓️ Published: 6/26/2026
🔗 http://arxiv.org/abs/2606.27736v1
👥 Authors: Zhaoqi Wang, Zijian Zhang, Kun Zheng, Zhen Li (possible past Google (United States) affiliation), Xin Li (possible past Google (United States) affiliation), Chunlei Li, Jiamou Liu
Abstract

The rapid spread of fake news poses increasing threats to information ecosystems, especially as AI-generated misinformation under Generative Engine Optimization (GEO) poisoning allows adversarially crafted content to be systematically surfaced by retrieval systems, contaminating LLM reasoning. In this paper, we propose Tree of Evidence (ToE), a hierarchical evidence reasoning framework for automated fact-checking that models each claim as a dynamically expanding argument tree. ToE integrates a r...

📄 Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27483v1
👥 Authors: Xuan Zhang (possible past Meta (United States) affiliation), Zhijian Zhou, Lingfeng Qiao, Yulei Qin, Ke Li (possible past University Of California, Berkeley affiliation), Xing Sun (possible past Tencent (China) affiliation), Xiaoyu Tan, Chao Qu, Yuan Qi
Abstract

Large language model (LLM) agents have demonstrated strong capability in sequential decision-making, yet they remains fundamentally reactive in long-horizon tasks. Unlike humans who employ "what-if" reasoning to evaluate potential plans before commitment, standard agents lack an internal world model to simulate future outcomes. Therefore, we propose to internalize future-aware planning by training a single autoregressive model to verbalize both a prospective state rollout and a plan-conditioned ...

📄 E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27268v1
👥 Authors: Wen Ye, Peiyan Li, Tingyu Yuan, Yuan Xu, Xiangnan Wu, Chaoyang Zhao, Jing Liu (possible past Baidu (China) affiliation), Nianfeng Liu, Yan Huang (possible past Tencent (China) affiliation), Liang Wang (possible past Tencent (China) affiliation)
Abstract

Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address...

📄 TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27161v1
👥 Authors: Tinghao Wang, Yichen Guo, Rui Huang (possible past Google (United States) affiliation), Zheng Lu, Qizhe Zhang, Chenxi Li, Yuan Zhang (possible past Google (United States) affiliation), Jiajun Cao, Zhirong Shen, Yaosong Du, Guangyan Gan, Wenya Wang, Lin William Cong, Shanghang Zhang
Abstract

Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principle...

📄 OpenRCA 2.0: From Outcome Labels to Causal Process Supervision
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27154v1
👥 Authors: Aoyang Fang, Yifan Yang (possible past Tencent (China) affiliation), Jin'ao Shang, Qisheng Lu, Junjielung Xu, Rui Wang (possible past Tencent (China) affiliation), Songhan Zhang, Yuzhong Zhang, Boxi Yu, Pinjia He
Abstract

Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection...

📄 Scaling Multi-Reference Image Generation with Dynamic Reward Optimization
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26947v1
👥 Authors: Wenwang Huang, Yusen Fu, Junjie Wang, Mengfei Huang, Yulin Li (possible past Baidu (China) affiliation), Gan Liu, Jing Cai, Yancheng He (possible past Tencent (China) affiliation), Zhuotao Tian
Abstract

While personalized image generation has achieved remarkable progress, multi-reference image generation (MRIG) remains a challenging task. Most existing benchmarks fail to adequately evaluate complex MRIG scenarios, hindering further progress in this area. To better assess model performance on complex MRIG tasks, we introduce OmniRef-Bench, a benchmark that covers complex combinations of reference image types and a large number of reference images. Evaluations on OmniRef-Bench show that mainstrea...

📄 Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26899v1
👥 Authors: Chenghao Liu, Yu Zhang (possible past Google (United States) affiliation), Zhongtao Jiang, Kun Xu (possible past Tsinghua University affiliation), Zhenwei An, Renzhi Wang, Zhao Wang, Jiachen Zhang, Yuxiao Zhang, Kun Xu (possible past Tsinghua University affiliation), Songfang Huang
Abstract

Embedding-based retrieval ranks items by their similarity to a query in a shared vector space and usually aims to return the highest-scoring items. In many production settings this is not what is wanted: given a seed set that expresses a fine-grained pattern, one needs more items that both satisfy a target attribute and stay within that pattern. We formalize this as pattern-preserving attribute retrieval. The two goals pull against each other: averaging the seeds preserves the pattern but stays ...

📄 AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26859v2
👥 Authors: Changxin Lao, Fei Pan, Guozhuang Ma, Han Li, Huihuang Lin, Jijun Shi, Kangzhi Zhao, Kun Gai, Mo Zhou, Qinqin Zhou, Quan Chen, Ruochen Yang, Shifu Bie, Shijie Yi, Shuang Yang, Shuo Yang, Wenhao Li, Wentao Xie, Xiao Lv, Xuming Wang, Yijun Wang, Yiming Chen, Yusheng Huang, Zhongyuan Wang, Zibo Zhao, Zijie Zhuang, Baoning Xia, Chao Liu, Chaoyi Ma, Chubo He, Dawei Cong, Feng Jiang, Gang Wang, Guilin Xia, Hanwen Xu, Jiahong Xie, Jiahui Qiao, Jian Liang, Jiangfan Yue, Jing Wang (possible past Google (United States) affiliation), Jinghan Yang, Jinghui Jia, Kan Qin, Lei Wang (possible past Baidu (China) affiliation), Ming Li, Peilin Song, Pengbo Xu, Qiang Luo, Ruiming Tang (possible past Huawei Technologies (China) affiliation), Shiyang Liu, Shuxian Jin, Tao Wang (possible past Stanford University affiliation), Tao Zhang (possible past Nvidia (United States) affiliation), Xiang Gao, Xianghan Li, Yingsong Luo, Yiwen Ning, Yongcheng Liu, Yueyang Liu, Yuan Guo, Zhaojie Liu, Zhenkai Cui
Abstract

Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge....

📄 NaviCache: Test-Time Self-Calibration Caching for Video Generation
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26795v1
👥 Authors: Zheqi Lv, Zhibo Zhu, Jinke Wang, Qi Tian (possible past Huawei Technologies (China) affiliation), Shengyu Zhang (possible past Tencent (China) affiliation), Zhengyu Chen, Chengxi Zang, Zhou Zhao, Fei Wu (possible past Google (United States) affiliation)
Abstract

Video Diffusion Models (VDMs) is constrained by immense computational costs. While offline calibration-based acceleration suffers from calibration data dependency, prohibitive calibration duration, and susceptibility to distribution shifts, offline calibration-free methods eliminate these hurdles. However, since they rely on instantaneous zero-order approximations where the mapping between input and output differences varies in real-time, they are susceptible to observational noise and ignore th...

📄 ResilPhase: Plug-and-Play Phase Mapping and Noise-Resilient Macro-Trajectory Extrapolation for Diffusion Acceleration
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26769v1
👥 Authors: Qicheng Zhao, Yu Li (possible past Tencent (China) affiliation), Qi Sun (possible past Google (United States) affiliation), Zheyu Yan
Abstract

The adoption of powerful diffusion models is hindered by their significant inference latency. Recent ``cache-then-forecast'' schemes alleviate this issue by accelerating DiTs using derivative-based polynomials, but they suffer from severe quality degradation at high acceleration ratios. Our analysis reveals its root cause: the discrete extrapolation performed on representations that are misaligned with the continuous diffusion trajectory and are numerically unstable. Thus, accelerated DiTs suffe...

📄 Qwen-Image-2.0-RL Technical Report
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27608v1
👥 Authors: Yixian Xu, Kaiyuan Gao, Yuxiang Chen, Yilei Chen, Zecheng Tang, Zihao Liu, Zikai Zhou, Deqing Li (possible past Baidu (China) affiliation), Hao Meng, Kuan Cao, Jiahao Li, Jie Zhang, Liang Peng, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yi Wang, Yu Wu (possible past Baidu (China) affiliation), Yujia Wu, Zekai Zhang, Zhendong Wang, Xiao Xu, Kun Yan, Chenfei Wu
Abstract

We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models...

📄 RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27247v1
👥 Authors: Parmitha Vangapandu, Sai Ganesh Mokkapati, Sathwik Narkedimilli, Msvpj Sathvik, Timothy Liu, Simon See (possible past Nvidia (United States) affiliation), Johannes C. Eichstaedt (possible past Stanford University affiliation)
Abstract

In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational triggers. We introduce the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts annotated by psychiatrists for diagnostic categories, including the most prevalent mood disorders (anxiety and depression), relational stressor triggers, and indications of...

📄 DMuon: Efficient Distributed Muon Training with Near-Adam Overhead
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27153v1
👥 Authors: Vincent Chen, Starrick Liu, Regis Cheng, Dance Yang, Shalfun Li, Ryan Yu, Lucy Liang, Hang Su (possible past Tsinghua University affiliation), Roy Gan, Hao Wang (possible past Tsinghua University affiliation), Qian Wang
Abstract

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers su...

📄 Reasoning Quality Emerges Early: Data Curation for Reasoning Models
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26797v1
👥 Authors: Hongyi Henry Jin, Wenhan Yang (possible past Peking University affiliation), Meysam Ghaffari, Carlos Morato, Baharan Mirzasoleiman (possible past Eth Zurich affiliation)
Abstract

Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be id...

*Notable papers are those with at least two authors from a "big" AI/ML lab.