📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13473v1
👥 Authors: Jiacheng Chen, Xinyu Zhang (possible past Baidu (China) affiliation), Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang (possible past Eth Zurich affiliation), Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker...

📄 PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13400v1
👥 Authors: Jianming Ma, Qiyue Yang, Yang Zhang (possible past Tsinghua University affiliation), Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao (possible past Tsinghua University affiliation)
Abstract

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and fl...

📄 Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13385v1
👥 Authors: Zihao Wang, Yiming Li (possible past Tsinghua University affiliation), Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li (possible past Tencent (China) affiliation), Dacheng Tao, Tianwei Zhang
Abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the ...

📄 HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13289v1
👥 Authors: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang (possible past Tsinghua University affiliation), Yang Li (possible past Google (United States) affiliation), Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang
Abstract

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address ...

📄 ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13239v1
👥 Authors: Jiaxin Ai, Tao Hu (possible past Baidu (China) affiliation), Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang (possible past Tencent (China) affiliation), Licheng Wen
Abstract

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program ...

📄 Proprioceptive-visual correspondence enables self-other distinction in humanoid robots
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13222v1
👥 Authors: Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang (possible past Peking University affiliation), Hongkai Xiong, Wenjun Zeng, Wentao Zhu (possible past Nvidia (United States) affiliation)
Abstract

Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body chan...

📄 Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13176v1
👥 Authors: Xin Wang (possible past University Of Edinburgh affiliation), Boyan Gao, Yibo Yang, David A. Clifton (possible past University Of Oxford affiliation)
Abstract

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO),...

📄 TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13148v1
👥 Authors: Dat Tien Nguyen, Thao Nguyen (possible past Google (United States) affiliation), Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We ...

📄 The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13079v1
👥 Authors: Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang (possible past Google (United States) affiliation), Min Yang (possible past Baidu (China) affiliation)
Abstract

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthori...

📄 TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13035v1
👥 Authors: Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen (possible past Tsinghua University affiliation), Yong Li (possible past Tsinghua University affiliation), Xiao-Ping Zhang
Abstract

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, ...

📄 CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.13024v1
👥 Authors: Bo Liu (possible past Meta (United States) affiliation), Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong (possible past Peking University affiliation)
Abstract

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-lev...

📄 An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.12936v1
👥 Authors: Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang (possible past Deepmind (United Kingdom) affiliation), He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang (possible past Tsinghua University affiliation), Bin Ji, Ting Xiao
Abstract

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-edit...

📄 Zero-source LLM Hallucination Detection with Human-like Criteria Probing
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.12900v1
👥 Authors: Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen (possible past Baidu (China) affiliation), Mingkui Tan (possible past Baidu (China) affiliation)
Abstract

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted...

📄 Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.12886v1
👥 Authors: Tingyu Li, Le Zhou, Siyuan Li (possible past Tencent (China) affiliation), Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He (possible past Tsinghua University affiliation), Cheng Tan
Abstract

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information ...

📄 JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.12858v1
👥 Authors: Tong Wu, Zhiyong Chen, Guo Lu, Li Song, Feng Yang (possible past Google (United States) affiliation), Meixia Tao, Wenjun Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder wi...

📄 M*: A Modular, Extensible, Serving System for Multimodal Models
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12688v1
👥 Authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz (possible past Stanford University affiliation), Luke Zettlemoyer (possible past University Of Washington affiliation), Olivia Hsu, Jure Leskovec (possible past Stanford University affiliation), Baris Kasikci, Stephanie Wang
Abstract

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about mode...

📄 From AGI to ASI
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12683v1
👥 Authors: Tim Genewein, Matija Franklin, Alexander Lerchner (possible past Google (United States) affiliation), Laurent Orseau (possible past Deepmind (United Kingdom) affiliation), Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel (possible past Deepmind (United Kingdom) affiliation), Joel Z. Leibo (possible past Google (United States) affiliation), Allan Dafoe (possible past University Of Oxford affiliation), Marcus Hutter, Thore Graepel (possible past Google (United States) affiliation), Shane Legg (possible past Google (United States) affiliation)
Abstract

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Un...

📄 FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12406v1
👥 Authors: Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov (possible past University Of Toronto affiliation), Deepak Pathak (possible past University Of California, Berkeley affiliation)
Abstract

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy le...

📄 DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12402v1
👥 Authors: Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Marco Pavone (possible past Stanford University affiliation)
Abstract

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. ...

📄 Redesign Mixture-of-Experts Routers with Manifold Power Iteration
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12397v1
👥 Authors: Songhao Wu, Ang Lv, Ruobing Xie (possible past Tencent (China) affiliation), Yankai Lin (possible past Tsinghua University affiliation)
Abstract

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to ...

📄 TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12387v1
👥 Authors: Zhiyi Chen, Jie Song (possible past Eth Zurich affiliation), Peng Li (possible past Tsinghua University affiliation)
Abstract

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development...

📄 CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12352v1
👥 Authors: Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn (possible past University Of California, Berkeley affiliation), Jeannette Bohg (possible past Stanford University affiliation)
Abstract

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference ...

📄 DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12245v1
👥 Authors: Kangning Zhang, Yingjie Qin, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Yong Yu (possible past Shanghai Jiao Tong University affiliation), Jianghao Lin
Abstract

Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbf{seesaw dilemma}: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbf{distributional disparity}: warm item embeddings occupy a complex ``behavio...

📄 Adaptive Weighted Averaging
🗓️ Published: 6/11/2026
🔗 http://arxiv.org/abs/2606.12763v1
👥 Authors: Aditya Bhaskara (possible past Google (United States) affiliation), Ashok Cutkosky, Ravi Kumar (possible past Google (United States) affiliation), Manish Purohit (possible past Google (United States) affiliation)
Abstract

We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse tha...

📄 Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12731v1
👥 Authors: Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan (possible past Deepmind (United Kingdom) affiliation), Verena Rieser, Kristian Lum (possible past Google (United States) affiliation), Sydney Levine, Julia Haas (possible past Deepmind (United Kingdom) affiliation)
Abstract

As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non...

📄 ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12633v1
👥 Authors: Jiangtao Kong, Peijun Zhao (possible past University Of Oxford affiliation), Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou (possible past University Of Washington affiliation), Huajie Shao
Abstract

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre...

📄 Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12370v1
👥 Authors: Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang (possible past Google (United States) affiliation), Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men (possible past Peking University affiliation), Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou
Abstract

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-t...

📄 Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12360v1
👥 Authors: Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger (possible past Stanford University affiliation), Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas Mcgrath (possible past Google (United States) affiliation), Ekdeep Singh Lubana
Abstract

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization ...

📄 Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12344v1
👥 Authors: Mengyu Zheng, Kai Han, Boxun Li (possible past Tsinghua University affiliation), Haiyang Xu, Yuchuan Tian, Wei He (possible past Baidu (China) affiliation), Hang Zhou (possible past Baidu (China) affiliation), Jianyuan Guo, Hailin Hu, Lin Ma (possible past Tencent (China) affiliation), Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei (possible past National University Of Singapore affiliation), Yunhe Wang, Yu Wang (possible past Tsinghua University affiliation)
Abstract

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, worksp...

📄 Re-evaluating Confidence Remasking in Masked Diffusion Language Models
🗓️ Published: 6/10/2026
🔗 http://arxiv.org/abs/2606.12232v1
👥 Authors: Stipe Frkovic, Metod Jazbec, Dan Zhang (possible past Google (United States) affiliation), Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick (possible past Google (United States) affiliation)
Abstract

Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One ap...

*Notable papers are those with at least two authors from a "big" AI/ML lab.