📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30645v1
👥 Authors: Yen-Jen Wang, Jiaman Li, Sirui Chen, Takara E. Truong, Pei Xu, Pieter Abbeel (possible past University Of California, Berkeley affiliation), Rocky Duan, Koushil Sreenath, Angjoo Kanazawa (possible past University Of California, Berkeley affiliation), Carmelo Sferrazza, Guanya Shi, Karen Liu
Abstract

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to r...

📄 LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30642v1
👥 Authors: Shun Lei, Huaicheng Zhang, Dapeng Wu, Yaoxun Xu, Lishi Zuo, Wei Tan, Hangting Chen, Guangzheng Li, Jianwei Yu, Zhiyong Wu (possible past Tsinghua University affiliation), Dong Yu (possible past Tencent (China) affiliation)
Abstract

Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-len...

📄 DOPD: Dual On-policy Distillation
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30626v1
👥 Authors: Xinlei Yu, Gen Li (possible past University Of Edinburgh affiliation), Qingyi Si, Guibin Zhang, Yuqi Xu, Congcong Wang, Shuai Dong, Kaiwen Tuo, Xiangyu Zeng, Kaituo Feng, Qunzhong Wang, Yang Shi, Xiaobin Hu (possible past Tencent (China) affiliation), Xiangyu Yue (possible past University Of California, Berkeley affiliation), Jiaqi Wang, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that studen...

📄 ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30362v1
👥 Authors: Xiao Chen, Weishuai Zeng, Xiaojie Niu, Zirui Wang, Jianan Li, Huayi Wang, Furui Xu, Jiahe Chen, Weixiang Zhong, Lihe Ding, Kailin Li, Jiangmiao Pang (possible past Shanghai Artificial Intelligence Laboratory affiliation), Tai Wang, Tianfan Xue (possible past Massachusetts Institute Of Technology affiliation), Jingbo Wang
Abstract

While current Behavior Foundation Models (BFMs) provide robust control priors for humanoids, they only execute pre-defined reference motions. As a result, they are vulnerable to environmental shifts and incapable of reactive whole-body coordination. Naively cascading them with generative motion planners fails to achieve true reactivity, as inevitable tracking discrepancies induce fatal cumulative exposure bias. To bridge this gap, we propose ReactiveBFM, a real-time closed-loop planning-control ...

📄 Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30246v1
👥 Authors: Zihan Guo, Zeyi Chen, Zhiyu Chen, Zicai Cui, Shuai Shao, Bo Huang, Zhi Han, Yuanyi Song, Yuan Yuan, Chenxi Zeng, Xiaohang Nie, Zhengxi Yu, Hanwen Zhu, Junwei Liao, Ming Zhou, Yang Li (possible past Google (United States) affiliation), Yuanjian Zhou, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Existing autonomous research agents can support parts of the research process, but most systems still treat research as either an isolated assistant task or a closed workflow. Therefore, autonomous science needs a collaboration infrastructure that coordinates projects, agents, and digital and physical resources. We identify this as a shift from code-centered execution loops to research-oriented collaboration processes, where questions, evidence, participants, and resources must be coordinated un...

📄 IBRSteG: Learning a Generalizable Steganography Framework for 3D Gaussian Splatting
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30024v1
👥 Authors: Fanye Kong, Hongyu Xia, Yu Zheng, Boyang Gong, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Recent advances in deep learning have notably improved steganographic message hiding. However, designing a generalizable steganographic approach for 3D Gaussian Splatting (3DGS) that can embed meaningful 3D scene content remains challenging. In this paper, we propose IBRSteG, a generalizable framework for 3DGS steganography that enables undetectable concealment of secret scenes within a steganographic scene. Unlike existing approaches whose parameter generation is rigidly coupled with the specif...

📄 SWE-Together: Evaluating Coding Agents in Interactive User Sessions
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29957v1
👥 Authors: Yifan Wu (possible past Carnegie Mellon University affiliation), Zhuokai Zhao, Songlin Li, Ho Hin Lee, Jiacheng Zhu, Shirley Wu, Tianhe Yu (possible past Stanford University affiliation), Serena Li, Lizhu Zhang, Xiangjun Fan, Shengzhi Li
Abstract

Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with r...

📄 HippoSpark: An On-Demand Experience System for LLM Reasoning
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29929v1
👥 Authors: Jingyao Liu, Danling Meng, Chen Huang, Yukun Yan, Zhenghao Liu (possible past Tsinghua University affiliation), Wenqiang Lei, See-Kiong Ng, Maosong Sun (possible past Tsinghua University affiliation)
Abstract

Distilling historical trajectories into reusable experience to enhance future problem-solving has become a focal point of recent LLM research. However, existing methods predominantly operate at the task level, leveraging general summaries or rules under the assumption that analogous tasks share universal solution patterns. This approach often fails in complex reasoning, which typically falters at local bottlenecks that require precise, state-specific guidance rather than broad heuristics. We int...

📄 Experience Graphs: The Data Foundation for Self-Improving Agents
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29823v1
👥 Authors: Gang Liao, Yujia He, Abdullah Ozturk, Zhouyang Li, Ying Wang (possible past Tsinghua University affiliation), Zhitong Guo, Hongsen Qin, Yaobin Qin, Tao Yang, Zewei Jiang, Dianshi Li, Jort Gemmeke, Jiangyuan Li, Liyuan Li, Nathan Yan, Masha Basmanova, Uladzimir Pashkevich, Matt Steiner, Pedro Pedreira, Rob Fergus (possible past Meta (United States) affiliation), Anirudh Goyal, Carole-Jean Wu (possible past Meta (United States) affiliation), Gaoxiang Liu, Andrew Witten, Daniel J. Abadi
Abstract

The database community has repeatedly advanced the state of the art by recognizing that new workloads demand new system architectures. We argue that long-horizon agentic tasks -- code generation, scientific discovery, hardware design -- are such a workload. These agents explore: they generate artifacts, execute tools, observe failures, branch, and repair over hundreds of steps. This search produces a structured object we call an experience graph: executable artifacts, tool outputs, rewards, sibl...

📄 DEEPMED Search: An Open-Source Agentic Platform for Medical Deep Research with Introspective Verification
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29746v1
👥 Authors: Maolin Liu, Fanyu Xu, Ruoqing Xu, Jiahang Zhang, Hao Wang (possible past Tsinghua University affiliation), Rui Wang (possible past Tencent (China) affiliation)
Abstract

Navigating the deluge of heterogeneous medical data, from academic literature (PubMed) to clinical guidelines (Web) and private knowledge bases, remains a critical bottleneck for evidence-based medicine. While commercial black-box tools lack transparency, standard open-source RAG implementations frequently suffer from reasoning drift when handling complex, long-tail queries. We present DEEPMED Search, a fully open-source, agentic platform designed for transparent medical deep research. Built on ...

📄 DeepTrans Studio: Turning Expert Interventions into Shared Team Knowledge in Agentic Translation Workflows
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29727v1
👥 Authors: Ziyang Lian, Qingya Zhang, Hao Wang (possible past Tsinghua University affiliation), Huiwen Xiong, Qi Yang, Lingyi Meng, Xiaoyi Gu, Rui Wang (possible past Tencent (China) affiliation)
Abstract

Professional translation is often a team-based process: translators, reviewers, and project managers must coordinate terminology, legal force, and accountability across documents. Yet many LLM-based translation tools treat human corrections as isolated edits. Expert decisions made in one segment or by one member are rarely captured as reusable knowledge for the rest of the team. We present DeepTrans Studio, a collaborative translation workspace that lets professionals intercept selected nodes in...

📄 RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29538v1
👥 Authors: Yijia Fan, Zonglin Di, Zimo Wen, Yifan Yang (possible past Tencent (China) affiliation), Mingxi Cheng, Qi Dai, Bei Liu, Kai Qiu, Yue Dong, Ji Li, Chong Luo (possible past Google (United States) affiliation)
Abstract

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedural knowledge. Yet existing skill libraries are mostly hand-written, text-centric, or derived from agent traces, leaving tutorial videos and other multimodal human resources largely underused. We present RESOURCE2SKILL, a framework that distills multimodal resources, including tutorial videos, repositories, articles, and reference artifacts, into executable skills for software agents. RES...

📄 OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29537v1
👥 Authors: Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen (possible past Google (United States) affiliation), Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy (possible past University Of Edinburgh affiliation), Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu (possible past University Of Washington affiliation)
Abstract

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete a...

📄 MotionAtlas: Detailed Region Captioning for Motion-Centric Videos
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29531v1
👥 Authors: Weisong Liu, Haochen Wang, Kuan Gao, Yuhao Wang, Yikang Zhou, Zhongwei Ren, Jacky Mai, Anna Wang, Yanwei Li, Jason Li (possible past Nvidia (United States) affiliation), Zhaoxiang Zhang (possible past Beijing Academy Of Artificial Intelligence affiliation)
Abstract

We propose MotionAtlas, a system for detailed captioning of motion-centric videos, comprising (1) a dedicated human-annotated benchmark, (2) a scalable, high-quality pipeline to construct training samples, and (3) a family of powerful Video-MLLMs. Unlike conventional global motion captioning datasets, we focus on region-aware motion captioning: given a video and a spatiotemporal mask, the model generates precise descriptions of motion within the target region, thereby alleviating visual clutter ...

📄 Cognitive World Models for Process-Level Social Influence Evaluation
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29495v1
👥 Authors: Minghui Ma, Bin Guo, Han Wang (possible past Peking University affiliation), Mengqi Chen, Jingqi Liu, Yan Liu (possible past Tencent (China) affiliation), Zhiwen Yu
Abstract

Social influence dialogue changes user behavior by altering internal cognitive states. The central evaluation question is whether the user's beliefs, desires, intentions, and emotions measurably change over the course of conversation, a process-oriented criterion that neither surface-level text metrics (BLEU/ROUGE) nor single-score LLM judgments can capture. We propose the \textbf{Cog}nitive \textbf{W}orld \textbf{M}odel \textbf{(CogWM)}, an LLM-based user model that reframes multi-turn dialogue...

📄 To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29481v1
👥 Authors: Jiuheng Lin, Chen Zhang (possible past Peking University affiliation), Yansong Feng (possible past Peking University affiliation)
Abstract

While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by memorizing correct answers and fabricating post-hoc reasoning. To address this, we introduce HIPPO, a novel RL framework that integrates hint-injected aggregation with a tailored pairwise reward model. By utilizing hint injection to deliberately trigger overlap-induced ...

📄 Experience Augmented Policy Optimization for LLM Reasoning
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30420v1
👥 Authors: Jinda Lu, Kexin Huang (possible past Stanford University affiliation), Junkang Wu, Shuo Yang, Jinghan Li, Chiyu Ma, Shaohang Wei, Xiang Wang (possible past Tencent (China) affiliation), Guoyin Wang, Jingren Zhou
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Mo...

📄 Diffusion Fine-tuning with Rewarded Moment Matching Distillation
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30414v1
👥 Authors: Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet (possible past University Of Cambridge affiliation), Arnaud Doucet (possible past University Of Oxford affiliation), Romuald Elie
Abstract

Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturaln...

📄 MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30406v1
👥 Authors: Wenhan Ma, Jianyu Wei, Liang Zhao (possible past Baidu (China) affiliation), Hailin Zhang, Bangjun Xiao, Lei Li (possible past Carnegie Mellon University affiliation), Qibin Yang, Bofei Gao, Yudong Wang, Rang Li, Jinhao Dong, Zhifang Sui (possible past Peking University affiliation), Fuli Luo (possible past Peking University affiliation)
Abstract

Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set...

📄 REAR: Test-time Preference Realignment through Reward Decomposition
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30339v1
👥 Authors: Fuxiang Zhang, Pengcheng Wang, Chenran Li, Yi-Chen Li, Yuxin Chen, Lang Feng, Chenfeng Xu (possible past University Of California, Berkeley affiliation), Masayoshi Tomizuka (possible past University Of California, Berkeley affiliation), Bo An
Abstract

Aligning large language models (LLMs) with diverse user preferences is a critical yet challenging task. While post-training methods can adapt models to specific needs, they often require costly data curation and additional training. Test-time scaling (TTS) presents an efficient, training-free alternative, but its application has been largely limited to verifiable domains like mathematics and coding, where response correctness is easily judged. To extend TTS to preference alignment, we introduce ...

📄 A multi-architecture study of specificity refinement and false-positive mechanism analysis in prostate MRI
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29977v1
👥 Authors: Yongbo Shu, Kewen Chen, Yifeng Yuan, Zirui Xin, Luo Lei, Yang Yang (possible past Tencent (China) affiliation), Xi Chen (possible past University Of California, Berkeley affiliation), Aijing Luo
Abstract

Objectives: To characterize residual false positives in prostate MRI detection, and to evaluate a lightweight post-hoc refinement head for case-level specificity. Materials and Methods: This retrospective study used PI-CAI (5-fold cross-validation) and Prostate158 (n=158; external). A context-aware evidence head and an 89,216-parameter refinement head were trained on a frozen detection backbone; the evidence head was also trained on four further backbones (bare nnU-Net, bare U-Net, bare Mamba, M...

📄 Exploring the Cryptographic Limits of Transformer Networks
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29389v1
👥 Authors: Stefan Domunco, Andis Draguns, Philip Torr (possible past University Of Oxford affiliation), Isaac Robinson, Christian Schroeder De Witt (possible past University Of Oxford affiliation)
Abstract

In recent work it has been shown that colluding AI agents can use steganographic methods to exchange malicious information. Whether a transformer can implement steganographic methods depends on what cryptographic functions it can implement, since a transformer that can implement a cryptographic function within its layers has source-free randomness access. Despite existing circuit-complexity results, no prior work maps specific cryptographic constructions to transformer architectures. As Merrill ...

📄 Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29357v1
👥 Authors: Xiao Wang (possible past Google (United States) affiliation), Liye Jin, Dan Xu (possible past University Of Oxford affiliation), Yuehang Li, Lan Chen, Yaowei Wang, Yonghong Tian (possible past Peking University affiliation), Jin Tang
Abstract

Vision-language tracking guided by natural language specifications leverages high-level semantic cues of target objects to substantially boost tracking accuracy and robustness. Existing studies have verified that adaptively optimizing textual descriptions throughout the tracking process can effectively mitigate the semantic-visual mismatch induced by dynamic variations in target appearance, position, and other inherent attributes. Nevertheless, mainstream methods that directly generate textual i...

*Notable papers are those with at least two authors from a "big" AI/ML lab.