📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27268v1
👥 Authors: Wen Ye, Peiyan Li, Tingyu Yuan, Yuan Xu, Xiangnan Wu, Chaoyang Zhao, Jing Liu (possible past Baidu (China) affiliation), Nianfeng Liu, Yan Huang (possible past Tencent (China) affiliation), Liang Wang (possible past Tencent (China) affiliation)
Abstract

Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address...

📄 TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27161v1
👥 Authors: Tinghao Wang, Yichen Guo, Rui Huang (possible past Google (United States) affiliation), Zheng Lu, Qizhe Zhang, Chenxi Li, Yuan Zhang (possible past Google (United States) affiliation), Jiajun Cao, Zhirong Shen, Yaosong Du, Guangyan Gan, Wenya Wang, Lin William Cong, Shanghang Zhang
Abstract

Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principle...

📄 OpenRCA 2.0: From Outcome Labels to Causal Process Supervision
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27154v1
👥 Authors: Aoyang Fang, Yifan Yang (possible past Tencent (China) affiliation), Jin'ao Shang, Qisheng Lu, Junjielung Xu, Rui Wang (possible past Tencent (China) affiliation), Songhan Zhang, Yuzhong Zhang, Boxi Yu, Pinjia He
Abstract

Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection...

📄 Scaling Multi-Reference Image Generation with Dynamic Reward Optimization
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26947v1
👥 Authors: Wenwang Huang, Yusen Fu, Junjie Wang, Mengfei Huang, Yulin Li (possible past Baidu (China) affiliation), Gan Liu, Jing Cai, Yancheng He (possible past Tencent (China) affiliation), Zhuotao Tian
Abstract

While personalized image generation has achieved remarkable progress, multi-reference image generation (MRIG) remains a challenging task. Most existing benchmarks fail to adequately evaluate complex MRIG scenarios, hindering further progress in this area. To better assess model performance on complex MRIG tasks, we introduce OmniRef-Bench, a benchmark that covers complex combinations of reference image types and a large number of reference images. Evaluations on OmniRef-Bench show that mainstrea...

📄 Generative Retrieval via Diffusion Transformer with Metric-Ordered Sequence Training and Hybrid-Policy Preference Optimization
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26899v1
👥 Authors: Chenghao Liu, Yu Zhang (possible past Google (United States) affiliation), Zhongtao Jiang, Kun Xu (possible past Tsinghua University affiliation), Zhenwei An, Renzhi Wang, Zhao Wang, Jiachen Zhang, Yuxiao Zhang, Kun Xu (possible past Tsinghua University affiliation), Songfang Huang
Abstract

Embedding-based retrieval ranks items by their similarity to a query in a shared vector space and usually aims to return the highest-scoring items. In many production settings this is not what is wanted: given a seed set that expresses a fine-grained pattern, one needs more items that both satisfy a target attribute and stay within that pattern. We formalize this as pattern-preserving attribute retrieval. The two goals pull against each other: averaging the seeds preserves the pattern but stays ...

📄 AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26859v1
👥 Authors: Changxin Lao, Fei Pan, Guozhuang Ma, Han Li, Huihuang Lin, Jijun Shi, Kangzhi Zhao, Kun Gai, Mo Zhou, Qinqin Zhou, Quan Chen, Ruochen Yang, Shifu Bie, Shuang Yang, Shuo Yang, Wenhao Li, Wentao Xie, Xiao Lv, Xuming Wang, Yijun Wang, Yiming Chen, Yusheng Huang, Zhongyuan Wang, Zibo Zhao, Zijie Zhuang, Baoning Xia, Chao Liu, Chaoyi Ma, Chubo He, Dawei Cong, Feng Jiang, Gang Wang, Guilin Xia, Hanwen Xu, Jiahong Xie, Jiahui Qiao, Jian Liang, Jiangfan Yue, Jing Wang (possible past Google (United States) affiliation), Jinghan Yang, Jinghui Jia, Kan Qin, Lei Wang (possible past Baidu (China) affiliation), Ming Li, Peilin Song, Pengbo Xu, Qiang Luo, Ruiming Tang (possible past Huawei Technologies (China) affiliation), Shiyang Liu, Shuxian Jin, Tao Wang (possible past Stanford University affiliation), Tao Zhang (possible past Nvidia (United States) affiliation), Xiang Gao, Xianghan Li, Yingsong Luo, Yiwen Ning, Yongcheng Liu, Yuan Guo, Zhaojie Liu, Zhenkai Cui
Abstract

Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge....

📄 NaviCache: Test-Time Self-Calibration Caching for Video Generation
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26795v1
👥 Authors: Zheqi Lv, Zhibo Zhu, Jinke Wang, Qi Tian (possible past Huawei Technologies (China) affiliation), Shengyu Zhang (possible past Tencent (China) affiliation), Zhengyu Chen, Chengxi Zang, Zhou Zhao, Fei Wu (possible past Google (United States) affiliation)
Abstract

Video Diffusion Models (VDMs) is constrained by immense computational costs. While offline calibration-based acceleration suffers from calibration data dependency, prohibitive calibration duration, and susceptibility to distribution shifts, offline calibration-free methods eliminate these hurdles. However, since they rely on instantaneous zero-order approximations where the mapping between input and output differences varies in real-time, they are susceptible to observational noise and ignore th...

📄 ResilPhase: Plug-and-Play Phase Mapping and Noise-Resilient Macro-Trajectory Extrapolation for Diffusion Acceleration
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26769v1
👥 Authors: Qicheng Zhao, Yu Li (possible past Tencent (China) affiliation), Qi Sun (possible past Google (United States) affiliation), Zheyu Yan
Abstract

The adoption of powerful diffusion models is hindered by their significant inference latency. Recent ``cache-then-forecast'' schemes alleviate this issue by accelerating DiTs using derivative-based polynomials, but they suffer from severe quality degradation at high acceleration ratios. Our analysis reveals its root cause: the discrete extrapolation performed on representations that are misaligned with the continuous diffusion trajectory and are numerically unstable. Thus, accelerated DiTs suffe...

📄 LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26713v1
👥 Authors: Yuqi Jiang, Yumeng Liu, Zimu Li, Jinyuan Deng, Qian Jin, Yucheng Cui, Yu Li (possible past Tencent (China) affiliation), Xunzhao Yin, Qi Sun (possible past Google (United States) affiliation), Cheng Zhuo
Abstract

As semiconductor technology nodes scale, computational lithography is essential for ensuring yield and performance. However, lithography is a continuous physical process involving mask optimization, optical imaging, resist exposure, and development, which existing models fail to capture. To overcome this limitation, we present LithoDreamer, the first physics-informed World Model (WM) framework for computational lithography, which formulates the ``Layout-Mask-Resist Image-After Development Image ...

📄 IDEA: Insensitive to Dynamics Mismatch via Effect Alignment for Sim-to-Real Transfer in Multi-Agent Control
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26575v1
👥 Authors: Chenlong Liu, Zhuohui Zhang, Xinyan Chen, Zhipeng Wang, Bin Cheng (possible past Tencent (China) affiliation), Bin He (possible past Baidu (China) affiliation)
Abstract

Complex multi-agent control tasks remain challenging for traditional rule-based and model-based approaches, motivating the adoption of learning-based methods. However, learning-based methods often struggle with sim-to-real transfer because they rely on accurate dynamics modeling or system identification and learn policies in low-level control spaces that are highly sensitive to dynamics mismatch, making them costly and fragile in complex environments. To address this issue, we propose a sim-to-r...

📄 VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26534v1
👥 Authors: Tianxin Xie, Chenxing Li, Dong Yu (possible past Tencent (China) affiliation), Li Liu (possible past National University Of Defense Technology affiliation)
Abstract

Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datasets, limiting rapid personalization. We propose VoiceTTA, a reinforcement learning-based test-time adaptation (TTA) method that improves voice imitation of pretrained zero-shot TTS models. VoiceTTA introduces two style r...

📄 Active Adversarial Perturbation-driven Associative Memory Retrieval for RGB-Event Visual Object Tracking
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26455v1
👥 Authors: Xiao Wang (possible past Google (United States) affiliation), Xufeng Lou, Zikang Yan, Lan Chen, Sibao Chen, Yaowei Wang, Yonghong Tian (possible past Peking University affiliation), Jin Tang
Abstract

RGB-Event tracking improves localization robustness by fusing RGB appearance textures and dense temporal motion cues from event sensors. While this multi-modal scheme broadens tracking applicability, real-world scenes suffer diverse structured signal degradations that hinder traditional multi-modal fusion. In harsh environments, either modality can lose reliability drastically, and targets frequently appear incomplete due to occlusion, edge truncation and foreground clutter.To tackle the above c...

📄 WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26443v1
👥 Authors: Baiqi Li, Ce Zhang (possible past Eth Zurich affiliation), Yu Fang (possible past University Of California, Berkeley affiliation), Yue Yang, Shangzhe Li, Mingyu Ding, Gedas Bertasius
Abstract

A robot working alongside people must reason about what they have done, in what order, and with what intent. Video carries the spatial layouts, object histories, and gestures that language leaves underspecified, yet today's manipulation benchmarks pair an instruction with a single current image, offering no way to evaluate reasoning over observed human behavior. We introduce WatchAct, a benchmark for robot manipulation grounded in observed human behavior. Each instance pairs a real-world human-a...

📄 Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26428v1
👥 Authors: Tyler Ga Wei Lum, Kushal Kedia, C. Karen Liu (possible past Stanford University affiliation), Jeannette Bohg (possible past Stanford University affiliation)
Abstract

Multi-fingered robots promise the speed and dexterity of human hands, yet challenging problems such as precise assembly have remained out of reach. These tasks are contact-rich, making data collection for imitation learning difficult, and sparse-reward, making direct exploration with reinforcement learning (RL) intractable. Consequently, prior work has made progress by structuring the problem with specialized grippers, tool attachments, and environment fixtures. In this work, we argue that befor...

📄 CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26423v1
👥 Authors: Haonan Chen, Yuxiang Ma, Stephen Tian (possible past University Of California, Berkeley affiliation), Xiaoshen Han, Wenlong Huang, Feiyang Wu, Yunzhu Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Edward H. Adelson (possible past Massachusetts Institute Of Technology affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Long-horizon, contact-rich complex manipulation tasks, such as seating a GPU into a PCIe slot, demand both millimeter high precision and out-of-the-box generalization to new tasks. Existing paradigms struggle to satisfy both: classical pipelines use brittle, task-specific interfaces to achieve high-precision control but require costly pipeline redesigns to adapt to new tasks, whereas monolithic end-to-end policies provide better generalization but lack high precision on complex, out-of-distribut...

📄 SOLAR: AI-Powered Speed-of-Light Performance Analysis
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26383v1
👥 Authors: Qijing Huang, Sana Damani, Zhifan Ye, Athinagoras Skiadopoulos, Siva Kumar Sastry Hari (possible past Nvidia (United States) affiliation), Jason Clemons (possible past Nvidia (United States) affiliation), Sahil Modi, Jingquan Wang, Aditya Kane, Edward C Lin, Humphrey Shi, Christos Kozyrakis (possible past Stanford University affiliation)
Abstract

How fast could a deep-learning model run on target hardware, and how far is today's implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload's theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically deriv...

📄 Can Trustless Agents Be Trusted? An Empirical Study of the ERC-8004 Decentralized AI Agent Ecosystem
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26028v1
👥 Authors: Xihan Xiong, Zelin Li, Wei Wei (possible past Google (United States) affiliation), Qin Wang (possible past Eth Zurich affiliation), William Knottenbelt, Zhipeng Wang
Abstract

As autonomous AI agents increasingly transact across organizational boundaries, a fundamental trust challenge emerges: how can an agent assess whether an unknown counterpart is trustworthy? The ERC-8004 protocol addresses this challenge with the first permissionless trust layer for AI agent economies, built around three on-chain registries for Identity, Reputation, and Validation. Despite its rapid adoption, the protocol has not been studied empirically, leaving it unclear whether the informatio...

📄 Autodata: An agentic data scientist to create high quality synthetic data
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.25996v2
👥 Authors: Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu (possible past University Of Cambridge affiliation), Yixin Nie, Swarnadeep Saha, Eryk Helenowski, Weizhe Yuan, Olga Golovneva, Jack Lanchantin, Yoram Bachrach (possible past Deepmind (United Kingdom) affiliation), Jakob Foerster (possible past University Of Oxford affiliation), Xian Li (possible past Meta (United States) affiliation), Han Fang, Sainbayar Sukhbaatar, Jason Weston (possible past Stanford University affiliation)
Abstract

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain impro...

📄 RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27247v1
👥 Authors: Parmitha Vangapandu, Sai Ganesh Mokkapati, Sathwik Narkedimilli, Msvpj Sathvik, Timothy Liu, Simon See (possible past Nvidia (United States) affiliation), Johannes C. Eichstaedt (possible past Stanford University affiliation)
Abstract

In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational triggers. We introduce the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts annotated by psychiatrists for diagnostic categories, including the most prevalent mood disorders (anxiety and depression), relational stressor triggers, and indications of...

📄 DMuon: Efficient Distributed Muon Training with Near-Adam Overhead
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.27153v1
👥 Authors: Vincent Chen, Starrick Liu, Regis Cheng, Dance Yang, Shalfun Li, Ryan Yu, Lucy Liang, Hang Su (possible past Tsinghua University affiliation), Roy Gan, Hao Wang (possible past Tsinghua University affiliation), Qian Wang
Abstract

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers su...

📄 Reasoning Quality Emerges Early: Data Curation for Reasoning Models
🗓️ Published: 6/25/2026
🔗 http://arxiv.org/abs/2606.26797v1
👥 Authors: Hongyi Henry Jin, Wenhan Yang (possible past Peking University affiliation), Meysam Ghaffari, Carlos Morato, Baharan Mirzasoleiman (possible past Eth Zurich affiliation)
Abstract

Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be id...

📄 From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
🗓️ Published: 6/24/2026
🔗 http://arxiv.org/abs/2606.26196v1
👥 Authors: Haoxiang Sun, Tao Wang (possible past Stanford University affiliation), Li Yuan (possible past National University Of Singapore affiliation), Jian Zhao, Jiancheng Lv
Abstract

Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modalit...

*Notable papers are those with at least two authors from a "big" AI/ML lab.