📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30246v1
👥 Authors: Zihan Guo, Zeyi Chen, Zhiyu Chen, Zicai Cui, Shuai Shao, Bo Huang, Zhi Han, Yuanyi Song, Yuan Yuan, Chenxi Zeng, Xiaohang Nie, Zhengxi Yu, Hanwen Zhu, Junwei Liao, Ming Zhou, Yang Li (possible past Google (United States) affiliation), Yuanjian Zhou, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Existing autonomous research agents can support parts of the research process, but most systems still treat research as either an isolated assistant task or a closed workflow. Therefore, autonomous science needs a collaboration infrastructure that coordinates projects, agents, and digital and physical resources. We identify this as a shift from code-centered execution loops to research-oriented collaboration processes, where questions, evidence, participants, and resources must be coordinated un...

📄 IBRSteG: Learning a Generalizable Steganography Framework for 3D Gaussian Splatting
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.30024v1
👥 Authors: Fanye Kong, Hongyu Xia, Yu Zheng, Boyang Gong, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Recent advances in deep learning have notably improved steganographic message hiding. However, designing a generalizable steganographic approach for 3D Gaussian Splatting (3DGS) that can embed meaningful 3D scene content remains challenging. In this paper, we propose IBRSteG, a generalizable framework for 3DGS steganography that enables undetectable concealment of secret scenes within a steganographic scene. Unlike existing approaches whose parameter generation is rigidly coupled with the specif...

📄 SWE-Together: Evaluating Coding Agents in Interactive User Sessions
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29957v1
👥 Authors: Yifan Wu (possible past Carnegie Mellon University affiliation), Zhuokai Zhao, Songlin Li, Ho Hin Lee, Jiacheng Zhu, Shirley Wu, Tianhe Yu (possible past Stanford University affiliation), Serena Li, Lizhu Zhang, Xiangjun Fan, Shengzhi Li
Abstract

Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with r...

📄 HippoSpark: An On-Demand Experience System for LLM Reasoning
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29929v1
👥 Authors: Jingyao Liu, Danling Meng, Chen Huang, Yukun Yan, Zhenghao Liu (possible past Tsinghua University affiliation), Wenqiang Lei, See-Kiong Ng, Maosong Sun (possible past Tsinghua University affiliation)
Abstract

Distilling historical trajectories into reusable experience to enhance future problem-solving has become a focal point of recent LLM research. However, existing methods predominantly operate at the task level, leveraging general summaries or rules under the assumption that analogous tasks share universal solution patterns. This approach often fails in complex reasoning, which typically falters at local bottlenecks that require precise, state-specific guidance rather than broad heuristics. We int...

📄 Experience Graphs: The Data Foundation for Self-Improving Agents
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29823v1
👥 Authors: Gang Liao, Yujia He, Abdullah Ozturk, Zhouyang Li, Ying Wang (possible past Tsinghua University affiliation), Zhitong Guo, Hongsen Qin, Yaobin Qin, Tao Yang, Zewei Jiang, Dianshi Li, Jort Gemmeke, Jiangyuan Li, Liyuan Li, Nathan Yan, Masha Basmanova, Uladzimir Pashkevich, Matt Steiner, Pedro Pedreira, Rob Fergus (possible past Meta (United States) affiliation), Anirudh Goyal, Carole-Jean Wu (possible past Meta (United States) affiliation), Gaoxiang Liu, Andrew Witten, Daniel J. Abadi
Abstract

The database community has repeatedly advanced the state of the art by recognizing that new workloads demand new system architectures. We argue that long-horizon agentic tasks -- code generation, scientific discovery, hardware design -- are such a workload. These agents explore: they generate artifacts, execute tools, observe failures, branch, and repair over hundreds of steps. This search produces a structured object we call an experience graph: executable artifacts, tool outputs, rewards, sibl...

📄 DEEPMED Search: An Open-Source Agentic Platform for Medical Deep Research with Introspective Verification
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29746v1
👥 Authors: Maolin Liu, Fanyu Xu, Ruoqing Xu, Jiahang Zhang, Hao Wang (possible past Tsinghua University affiliation), Rui Wang (possible past Tencent (China) affiliation)
Abstract

Navigating the deluge of heterogeneous medical data, from academic literature (PubMed) to clinical guidelines (Web) and private knowledge bases, remains a critical bottleneck for evidence-based medicine. While commercial black-box tools lack transparency, standard open-source RAG implementations frequently suffer from reasoning drift when handling complex, long-tail queries. We present DEEPMED Search, a fully open-source, agentic platform designed for transparent medical deep research. Built on ...

📄 DeepTrans Studio: Turning Expert Interventions into Shared Team Knowledge in Agentic Translation Workflows
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29727v1
👥 Authors: Ziyang Lian, Qingya Zhang, Hao Wang (possible past Tsinghua University affiliation), Huiwen Xiong, Qi Yang, Lingyi Meng, Xiaoyi Gu, Rui Wang (possible past Tencent (China) affiliation)
Abstract

Professional translation is often a team-based process: translators, reviewers, and project managers must coordinate terminology, legal force, and accountability across documents. Yet many LLM-based translation tools treat human corrections as isolated edits. Expert decisions made in one segment or by one member are rarely captured as reusable knowledge for the rest of the team. We present DeepTrans Studio, a collaborative translation workspace that lets professionals intercept selected nodes in...

📄 RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29538v1
👥 Authors: Yijia Fan, Zonglin Di, Zimo Wen, Yifan Yang (possible past Tencent (China) affiliation), Mingxi Cheng, Qi Dai, Bei Liu, Kai Qiu, Yue Dong, Ji Li, Chong Luo (possible past Google (United States) affiliation)
Abstract

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedural knowledge. Yet existing skill libraries are mostly hand-written, text-centric, or derived from agent traces, leaving tutorial videos and other multimodal human resources largely underused. We present RESOURCE2SKILL, a framework that distills multimodal resources, including tutorial videos, repositories, articles, and reference artifacts, into executable skills for software agents. RES...

📄 OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29537v1
👥 Authors: Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen (possible past Google (United States) affiliation), Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy (possible past University Of Edinburgh affiliation), Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu (possible past University Of Washington affiliation)
Abstract

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete a...

📄 MotionAtlas: Detailed Region Captioning for Motion-Centric Videos
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29531v1
👥 Authors: Weisong Liu, Haochen Wang, Kuan Gao, Yuhao Wang, Yikang Zhou, Zhongwei Ren, Jacky Mai, Anna Wang, Yanwei Li, Jason Li (possible past Nvidia (United States) affiliation), Zhaoxiang Zhang (possible past Beijing Academy Of Artificial Intelligence affiliation)
Abstract

We propose MotionAtlas, a system for detailed captioning of motion-centric videos, comprising (1) a dedicated human-annotated benchmark, (2) a scalable, high-quality pipeline to construct training samples, and (3) a family of powerful Video-MLLMs. Unlike conventional global motion captioning datasets, we focus on region-aware motion captioning: given a video and a spatiotemporal mask, the model generates precise descriptions of motion within the target region, thereby alleviating visual clutter ...

📄 Cognitive World Models for Process-Level Social Influence Evaluation
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29495v1
👥 Authors: Minghui Ma, Bin Guo, Han Wang (possible past Peking University affiliation), Mengqi Chen, Jingqi Liu, Yan Liu (possible past Tencent (China) affiliation), Zhiwen Yu
Abstract

Social influence dialogue changes user behavior by altering internal cognitive states. The central evaluation question is whether the user's beliefs, desires, intentions, and emotions measurably change over the course of conversation, a process-oriented criterion that neither surface-level text metrics (BLEU/ROUGE) nor single-score LLM judgments can capture. We propose the \textbf{Cog}nitive \textbf{W}orld \textbf{M}odel \textbf{(CogWM)}, an LLM-based user model that reframes multi-turn dialogue...

📄 To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29481v1
👥 Authors: Jiuheng Lin, Chen Zhang (possible past Peking University affiliation), Yansong Feng (possible past Peking University affiliation)
Abstract

While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by memorizing correct answers and fabricating post-hoc reasoning. To address this, we introduce HIPPO, a novel RL framework that integrates hint-injected aggregation with a tailored pairwise reward model. By utilizing hint injection to deliberately trigger overlap-induced ...

📄 Dynamic Parsing and Updating Natural Language Specification using VLMs for Robust Vision-Language Tracking
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29357v1
👥 Authors: Xiao Wang (possible past Google (United States) affiliation), Liye Jin, Dan Xu (possible past University Of Oxford affiliation), Yuehang Li, Lan Chen, Yaowei Wang, Yonghong Tian (possible past Peking University affiliation), Jin Tang
Abstract

Vision-language tracking guided by natural language specifications leverages high-level semantic cues of target objects to substantially boost tracking accuracy and robustness. Existing studies have verified that adaptively optimizing textual descriptions throughout the tracking process can effectively mitigate the semantic-visual mismatch induced by dynamic variations in target appearance, position, and other inherent attributes. Nevertheless, mainstream methods that directly generate textual i...

📄 Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29296v1
👥 Authors: Chao Wang (possible past Google (United States) affiliation), Hongtao Tian, Tao Yang, Yunsheng Shi (possible past Baidu (China) affiliation), Ting Yao, Wenbo Ding (possible past Tsinghua University affiliation)
Abstract

Group Relative Policy Optimization (GRPO) is a default recipe for process-supervised reinforcement learning of LLM reasoners, and dense process supervision -- via learned process reward models (PRMs) or on-policy-distillation KL signals -- is a common way to densify its otherwise weak outcome reward. Layering such a step-level signal on top of GRPO's group-standardized advantage, however, exposes three structural pathologies: \emph{channel contamination} between the pooled process, outcome, and ...

📄 A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29193v1
👥 Authors: Yuanhong Cai, Xiaohui Nie (possible past Tsinghua University affiliation), Kanglin Yin, Changhua Pei, Yongqian Sun (possible past Tsinghua University affiliation), Shenglin Zhang, Haibin Liu, Guiyang Liu, Xidao Wen, Fang Situ, Dan Pei (possible past Tsinghua University affiliation)
Abstract

LLM-based agents are reshaping microservice operations into AgentOps, where benchmarks are key to evaluating failure diagnosis over multimodal observability data. However, existing benchmarks remain largely outcome-oriented: they score only the final answer and fail to assess the systematic reasoning process in failure diagnosis. We address this gap by introducing two large-scale datasets (AIOps2025 and RCA100) under a reasoning-process evaluation paradigm that assesses agentic diagnostic capabi...

📄 A multi-architecture study of specificity refinement and false-positive mechanism analysis in prostate MRI
🗓️ Published: 6/29/2026
🔗 http://arxiv.org/abs/2606.29977v1
👥 Authors: Yongbo Shu, Kewen Chen, Yifeng Yuan, Zirui Xin, Luo Lei, Yang Yang (possible past Tencent (China) affiliation), Xi Chen (possible past University Of California, Berkeley affiliation), Aijing Luo
Abstract

Objectives: To characterize residual false positives in prostate MRI detection, and to evaluate a lightweight post-hoc refinement head for case-level specificity. Materials and Methods: This retrospective study used PI-CAI (5-fold cross-validation) and Prostate158 (n=158; external). A context-aware evidence head and an 89,216-parameter refinement head were trained on a frozen detection backbone; the evidence head was also trained on four further backbones (bare nnU-Net, bare U-Net, bare Mamba, M...

📄 Exploring the Cryptographic Limits of Transformer Networks
🗓️ Published: 6/28/2026
🔗 http://arxiv.org/abs/2606.29389v1
👥 Authors: Stefan Domunco, Andis Draguns, Philip Torr (possible past University Of Oxford affiliation), Isaac Robinson, Christian Schroeder De Witt (possible past University Of Oxford affiliation)
Abstract

In recent work it has been shown that colluding AI agents can use steganographic methods to exchange malicious information. Whether a transformer can implement steganographic methods depends on what cryptographic functions it can implement, since a transformer that can implement a cryptographic function within its layers has source-free randomness access. Despite existing circuit-complexity results, no prior work maps specific cryptographic constructions to transformer architectures. As Merrill ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.