πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14697v1
πŸ‘₯ Authors: Sicheng Yang, Hangjie Yuan, Wenjun Zhang (possible past Shanghai Jiao Tong University affiliation), Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang (possible past Baidu (China) affiliation), Lei Zhu
Abstract

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we intr...

πŸ“„ Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14672v1
πŸ‘₯ Authors: Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang (possible past Tencent (China) affiliation), Yinglong Xia, Hong Li, Hong Yan, Pan Li (possible past Baidu (China) affiliation)
Abstract

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill ...

πŸ“„ From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14502v1
πŸ‘₯ Authors: Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu (possible past Tencent (China) affiliation), Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang (possible past Tencent (China) affiliation), Ying Shen, Hai-Tao Zheng (possible past Tsinghua University affiliation), Yinghui Li, Di Yin, Xing Sun (possible past Tencent (China) affiliation), Philip S. Yu (possible past Tsinghua University affiliation)
Abstract

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token...

πŸ“„ Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14409v1
πŸ‘₯ Authors: He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang (possible past Tencent (China) affiliation), Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang (possible past Tencent (China) affiliation), Zisheng Lu, Han Hu, Zhengyou Zhang (possible past Tencent (China) affiliation)
Abstract

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack....

πŸ“„ Communication Policy Evolution for Proactive LLM Agents
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14314v1
πŸ‘₯ Authors: Xinbei Ma, Jiyang Qiu, Yao Yao (possible past Alibaba Group (China) affiliation), Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang (possible past Tencent (China) affiliation), Weiwen Liu, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Zhuosheng Zhang, Hai Zhao (possible past Shanghai Jiao Tong University affiliation)
Abstract

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asy...

πŸ“„ AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14295v1
πŸ‘₯ Authors: Fengyu Liu, Jiarun Dai, Yihe Fan, Wuyuao Mai, Ziao Li, Bofei Chen, Jie Zhang, Zheng Lou, Bocheng Xiang, Qiyi Zhang, Xudong Pan, Geng Hong, Yuan Zhang (possible past Google (United States) affiliation), Min Yang (possible past Baidu (China) affiliation)
Abstract

Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foo...

πŸ“„ OdysSim: Building Foundation Models for Human Behavior Simulation
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14199v1
πŸ‘₯ Authors: Xuhui Zhou, Weiwei Sun, Weihua Du, Jiarui Liu, Haojia Sun, Qianou Ma, Tongshuang Wu (possible past University Of Washington affiliation), Yiming Yang (possible past Microsoft (United States) affiliation), Maarten Sap
Abstract

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that u...

πŸ“„ Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14125v1
πŸ‘₯ Authors: Zheyuan Zhan, Hongchen Li, Can Wang (possible past Tsinghua University affiliation), Yinfei Ma, Mingzhen Huang, Ruoshi Bai, Jiawei Chen (possible past Tencent (China) affiliation), Siwei Lyu (possible past University Of Washington affiliation), Defang Chen
Abstract

Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modula...

πŸ“„ FEMOT: Multi-Object Tracking using Frame and Event Cameras
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14094v1
πŸ‘₯ Authors: Shiao Wang, Xiao Wang (possible past Google (United States) affiliation), Chao Wang (possible past Google (United States) affiliation), Yitao Li, Menghao Liu, Bo Jiang, Yaowei Wang, Yonghong Tian (possible past Peking University affiliation), Jin Tang
Abstract

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the ...

πŸ“„ Crypto x AI, AI x Crypto: A Survey
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13892v1
πŸ‘₯ Authors: Sarah Allen, Pranay Anchuri, James Austgen, Maryam Bahrani, Samuel Breckenridge, Aaron Buchwald, Christian Cachin, AndrΓ©s FΓ‘brega, Jared Fernandez, James Hsin-Yu Chiang, Marwa Mouallem, Roi Bar-Zur, Neil Desilva, Ittay Eyal, Giulia Fanti (possible past University Of California, Berkeley affiliation), Ari Juels, Andrew Miller, Christian Sillaber, Dani Vilardell, Pramod Viswanath, Wenhao Wang, Matt Weinberg, Sen Yang (possible past Tencent (China) affiliation), Jianzhu Yao, Fan Zhang
Abstract

The intersection of crypto x AI is spawning papers, products, online posts, and companies. All the surrounding buzz, though, obscures what exactly has been done, what the opportunities and challenges are, and what open questions deserve attention. This survey paper asks what AI can do for blockchain-based technologies (broadly construed as "crypto") (crypto x AI), and vice versa (AI x crypto). We systematize existing work, summarize key takeaways, highlight open research questions, and offer a p...

πŸ“„ MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13782v1
πŸ‘₯ Authors: Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li (possible past Google (United States) affiliation), Jie Zhou (possible past Tsinghua University affiliation), Yudong Wang
Abstract

Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-provi...

πŸ“„ Mana: Dexterous Manipulation of Articulated Tools
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13677v1
πŸ‘₯ Authors: Zhao-Heng Yin, Guanya Shi, Pieter Abbeel (possible past University Of California, Berkeley affiliation), C. Karen Liu (possible past Stanford University affiliation)
Abstract

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an an...

πŸ“„ SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13673v1
πŸ‘₯ Authors: Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su (possible past Tsinghua University affiliation), Byung-Kwan Lee, Chan Hee Song, Sifei Liu (possible past Nvidia (United States) affiliation), Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen
Abstract

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing sp...

πŸ“„ Agents-K1: Towards Agent-native Knowledge Orchestration
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13669v1
πŸ‘₯ Authors: Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou (possible past Tsinghua University affiliation), Liang He, Bo Zhang (possible past Tencent (China) affiliation), Lei Bai
Abstract

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge...

πŸ“„ EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13662v2
πŸ‘₯ Authors: Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song (possible past Tsinghua University affiliation), Lei Hou (possible past Tsinghua University affiliation), Juanzi Li
Abstract

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfac...

πŸ“„ AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13608v1
πŸ‘₯ Authors: Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy (possible past University Of Edinburgh affiliation), Alexandre Drouin, Alexandre Lacoste (possible past Google (United States) affiliation), Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang (possible past Amazon (United States) affiliation), Wenbo Guo, Dawn Song (possible past University Of California, Berkeley affiliation)
Abstract

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: ...

πŸ“„ Reward Modeling for Multi-Agent Orchestration
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13598v1
πŸ‘₯ Authors: King Yeung Tsang, Zihao Zhao (possible past Tsinghua University affiliation), Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz (possible past Google (United States) affiliation), Shafiq Joty, Hao Wang (possible past Tsinghua University affiliation)
Abstract

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model trai...

πŸ“„ LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13578v1
πŸ‘₯ Authors: Baochang Ren, Xinjie Liu, Xi Chen (possible past University Of California, Berkeley affiliation), Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li (possible past Google (United States) affiliation), Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang (possible past Tencent (China) affiliation), Huajun Chen (possible past Alibaba Group (China) affiliation)
Abstract

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demons...

πŸ“„ MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13473v1
πŸ‘₯ Authors: Jiacheng Chen, Xinyu Zhang (possible past Baidu (China) affiliation), Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang (possible past Eth Zurich affiliation), Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker...

πŸ“„ PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13400v1
πŸ‘₯ Authors: Jianming Ma, Qiyue Yang, Yang Zhang (possible past Tsinghua University affiliation), Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao (possible past Tsinghua University affiliation)
Abstract

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and fl...

πŸ“„ Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13385v1
πŸ‘₯ Authors: Zihao Wang, Yiming Li (possible past Tsinghua University affiliation), Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li (possible past Tencent (China) affiliation), Dacheng Tao, Tianwei Zhang
Abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the ...

πŸ“„ HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13289v1
πŸ‘₯ Authors: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang (possible past Tsinghua University affiliation), Yang Li (possible past Google (United States) affiliation), Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang
Abstract

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address ...

πŸ“„ Beyond task performance: Decoding bioacoustic embeddings with speech features
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14662v1
πŸ‘₯ Authors: Ines Nolasco, Jules Cauzinille, Marius Miron, Gagan Narula, Milad Alizadeh, Emmanuel Fernandez, Matthieu Geist (possible past Google (United States) affiliation), Ellen Gilsenan-Mcmahon, Olivier Pietquin (possible past Google (United States) affiliation), Emmanuel Chemla, Sara Keen
Abstract

Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model cap...

πŸ“„ Graph Diffusion Residuals for Control-Function Instrumental Variables
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14636v1
πŸ‘₯ Authors: Rui Wu (possible past Google (United States) affiliation), Zongyuan Chen, Hong Xie, Defu Lian, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Control-function instrumental variable estimators need a first-stage residual, not merely a first-stage prediction. High-capacity first stages can interpolate treatment and leave too little residual information for the outcome equation. We study Adaptive Anisotropic Instrumental Heat Flow (A-IHF), a deterministic graph-diffusion residual extractor for flexible control functions. A-IHF treats treatment as a signal on a graph of first-stage features, uses pilot diffusion to detect large treatment ...

πŸ“„ Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14397v1
πŸ‘₯ Authors: Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel, Michal Zakrzewski, Sebastian Montagna, Damian Rynczak, Shreyansh Padarha, Kumail Alhamoud, Zihao Fu, William Lugoloobi, Kai Rawal, Hanna Yershova, Xander Davies, Taras Rumezhak, Guohao Li, Fazl Barez, Baoyuan Wu (possible past Tencent (China) affiliation), Arkadiusz Drohomirecki, Yarin Gal, Chris Russell, Christopher Summerfield (possible past University Of Oxford affiliation), Adam Mahdi, Volodymyr Karpiv, Philip Torr (possible past University Of Oxford affiliation), Adel Bibi
Abstract

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for eval...

πŸ“„ A Low-Rank Subspace Analysis of LLM Interventions
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14388v1
πŸ‘₯ Authors: Angira Sharma, Christian Schroeder De Witt (possible past University Of Oxford affiliation), Philip Torr (possible past University Of Oxford affiliation), Anisoara Calinescu, Jialin Yu
Abstract

Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple inst...

πŸ“„ When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs
πŸ—“οΈ Published: 6/12/2026
πŸ”— http://arxiv.org/abs/2606.14347v1
πŸ‘₯ Authors: Boris Marinov, Angira Sharma, Christian Schroeder De Witt (possible past University Of Oxford affiliation), Philip Torr (possible past University Of Oxford affiliation), Anisoara Calinescu, Jialin Yu
Abstract

Large language models exhibit strong multilingual capabilities, however, their internal representations are difficult to interpret. Understanding these interactions is important for ensuring reliable behavior in multilingual systems. Recent work has shown that causal-geometric structure can explain how certain concepts are encoded as approximately linear and separable directions, but whether this framework extends to multilingual models, where language identity is correlated and hierarchical, is...

πŸ“„ Diffusion Policy Optimization without Drifting Apart
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13795v1
πŸ‘₯ Authors: Haozhe Jiang, Haiwen Feng, Pieter Abbeel (possible past University Of California, Berkeley affiliation), Jiantao Jiao, Angjoo Kanazawa (possible past University Of California, Berkeley affiliation), Nika Haghtalab
Abstract

RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbf{DiPOD}, a diffusion polic...

πŸ“„ NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13543v1
πŸ‘₯ Authors: Fabien Chraim, Jian Zhang (possible past Tencent (China) affiliation), Dominik Janzing, Xiang Song, Christos Faloutsos (possible past Carnegie Mellon University affiliation), John Evans
Abstract

Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models netwo...

πŸ“„ GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving
πŸ—“οΈ Published: 6/11/2026
πŸ”— http://arxiv.org/abs/2606.13501v1
πŸ‘₯ Authors: Xinwei Qiang, Yifan Hu (possible past Tencent (China) affiliation), Shixuan Sun, Jing Yang, Han Zhao, Chen Chen (possible past Tencent (China) affiliation), Yu Feng (possible past University Of California, Berkeley affiliation), Jingwen Leng, Minyi Guo
Abstract

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT se...

*Notable papers are those with at least two authors from a "big" AI/ML lab.