πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24288v1
πŸ‘₯ Authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He (possible past Microsoft (United States) affiliation), Feng Yan (possible past Meta (United States) affiliation)
Abstract

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science in...

πŸ“„ CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24286v1
πŸ‘₯ Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-Ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu (possible past Microsoft (United States) affiliation), Mingxuan Wang (possible past Tencent (China) affiliation), Xin Liu, Hao Zhou
Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundame...

πŸ“„ A Mixed Diet Makes DINO An Omnivorous Vision Encoder
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24181v1
πŸ‘₯ Authors: Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia (possible past Google (United States) affiliation), Skanda Koppula (possible past Deepmind (United Kingdom) affiliation), Andre Araujo, Joao Carreira, Niloy J. Mitra
Abstract

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality...

πŸ“„ CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24142v1
πŸ‘₯ Authors: Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu (possible past Tsinghua University affiliation), Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan (possible past Peking University affiliation)
Abstract

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activa...

πŸ“„ DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24096v1
πŸ‘₯ Authors: Yuxuan Zhang, KatarΓ­na TΓ³thovΓ‘, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo De Lutio, Yen-Yu Chang, Or Litany (possible past Stanford University affiliation), Sanja Fidler (possible past University Of Toronto affiliation), Zan Gojcic
Abstract

Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted ...

πŸ“„ Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24080v1
πŸ‘₯ Authors: Xiang Li, Jiabao Gao, Sipei Lin, Xuan Zhou, Chi Zhang (possible past Peking University affiliation), Bo Cheng, Jiale Han, Benyou Wang (possible past Tencent (China) affiliation)
Abstract

The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant ga...

πŸ“„ RUMAD: Reinforcement-Unifying Multi-Agent Debate
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23864v1
πŸ‘₯ Authors: Chao Wang (possible past Google (United States) affiliation), Han Lin, Huaze Tang, Huijing Lin, Wenbo Ding (possible past Tsinghua University affiliation)
Abstract

Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that...

πŸ“„ SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23719v1
πŸ‘₯ Authors: Wenzhe Zhao, Yang Zhao (possible past Google (United States) affiliation), Ganchao Liu, Zhiyu Jiang, Dandan Ma, Zihao Li, Xuelong Li (possible past Tencent (China) affiliation)
Abstract

In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To bridge this gap, this paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level prec...

πŸ“„ ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23681v1
πŸ‘₯ Authors: Siyuan Ma, Bo Gao, Xiaojun Jia, Simeng Qin, Tianlin Li, Ke Ma, Xiaoshuang Jia, Wenqi Ren (possible past Tencent (China) affiliation), Yang Liu (possible past Tsinghua University affiliation)
Abstract

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimat...

πŸ“„ ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23653v1
πŸ‘₯ Authors: Wei Luo (possible past Baidu (China) affiliation), Yangfan Ou, Jin Deng, Zeshuai Deng (possible past Baidu (China) affiliation), Xiquan Yan, Zhiquan Wen, Mingkui Tan (possible past Baidu (China) affiliation)
Abstract

Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples ...

πŸ“„ KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23592v1
πŸ‘₯ Authors: Zebin Yang, Tong Xie, Baotong Lu, Shaoshan Liu, Bo Yu (possible past Baidu (China) affiliation), Meng Li (possible past Meta (United States) affiliation)
Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are great...

πŸ“„ Humans and LLMs Diverge on Probabilistic Inferences
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23546v1
πŸ‘₯ Authors: Gaurav Kamath, Sreenath Madathil, Sebastian Schuster (possible past Stanford University affiliation), Marie-Catherine De Marneffe (possible past Stanford University affiliation), Siva Reddy (possible past University Of Edinburgh affiliation)
Abstract

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inf...

πŸ“„ Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23335v1
πŸ‘₯ Authors: Dany Haddad, Dan Bareket, Joseph Chee Chang, Jay Deyoung, Jena D. Hwang, Uri Katz, Mark Polak, Sangho Suh, Harshit Surana, Aryeh Tiktinsky, Shriya Atmakuri, Jonathan Bragg, Mike D'arcy, Sergey Feldman, Amal Hassan-Ali, RubΓ©n Lozano, Bodhisattwa Prasad Majumder (possible past Google (United States) affiliation), Charles Mcgrady, Amanpreet Singh (possible past Meta (United States) affiliation), Brooke Vlahos, Yoav Goldberg (possible past Google (United States) affiliation), Doug Downey (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this ...

πŸ“„ AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23258v1
πŸ‘₯ Authors: Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang (possible past Stanford University affiliation), Min Zhang (possible past Tsinghua University affiliation)
Abstract

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting ...

πŸ“„ The Trinity of Consistency as a Defining Principle for General World Models
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23152v1
πŸ‘₯ Authors: Jingxuan Wei, Siyuan Li (possible past Tencent (China) affiliation), Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Xi Bai, Chang Yu, Yumou Liu, Junnan Zhu, Xuanhe Zhou, Jintao Chen, Xiaobin Hu (possible past Tencent (China) affiliation), Shancheng Pang, Bihui Yu, Ran He, Zhen Lei (possible past Beijing Academy Of Artificial Intelligence affiliation), Stan Z. Li, Conghui He (possible past Tsinghua University affiliation), Shuicheng Yan (possible past National University Of Singapore affiliation), Cheng Tan
Abstract

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reaso...

πŸ“„ MoDora: Tree-Based Semi-Structured Document Analysis System
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23061v2
πŸ‘₯ Authors: Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li (possible past Tsinghua University affiliation), Conghui He (possible past Tsinghua University affiliation), Fan Wu
Abstract

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of ...

πŸ“„ Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.22983v2
πŸ‘₯ Authors: Xun Huang (possible past Nvidia (United States) affiliation), Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang (possible past Meta (United States) affiliation), Yang Liu (possible past Tsinghua University affiliation), Xiaojun Jia
Abstract

As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this...

πŸ“„ FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.22963v1
πŸ‘₯ Authors: Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li (possible past Google (United States) affiliation), Zhenlong Yuan (possible past Tsinghua University affiliation), Yujun Cai, Zhaoqi Wang
Abstract

Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an ite...

πŸ“„ Mode Seeking meets Mean Seeking for Fast Long Video Generation
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24289v1
πŸ‘₯ Authors: Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala (possible past Stanford University affiliation), Leonidas Guibas (possible past Stanford University affiliation), Gordon Wetzstein (possible past Stanford University affiliation), Arash Vahdat (possible past Nvidia (United States) affiliation)
Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long vi...

πŸ“„ MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.24188v1
πŸ‘₯ Authors: Jacob Eisenstein (possible past Meta (United States) affiliation), Fantine Huot, Adam Fisch (possible past University Of Washington affiliation), Jonathan Berant (possible past Google (United States) affiliation), Mirella Lapata (possible past University Of Edinburgh affiliation)
Abstract

We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its...

πŸ“„ A Boundary Integral-based Neural Operator for Mesh Deformation
πŸ—“οΈ Published: 2/27/2026
πŸ”— http://arxiv.org/abs/2602.23703v1
πŸ‘₯ Authors: Zhengyu Wu, Jun Liu (possible past Tencent (China) affiliation), Wei Wang (possible past University Of Oxford affiliation)
Abstract

This paper presents an efficient mesh deformation method based on boundary integration and neural operators, formulating the problem as a linear elasticity boundary value problem (BVP). To overcome the high computational cost of traditional finite element methods and the limitations of existing neural operators in handling Dirichlet boundary conditions for vector fields, we introduce a direct boundary integral representation using a Dirichlet-type Green's tensor. This formulation expresses the i...

πŸ“„ EvoX: Meta-Evolution for Automated Discovery
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23413v1
πŸ‘₯ Authors: Shu Liu (possible past Tencent (China) affiliation), Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer (possible past University Of California, Berkeley affiliation), Alexandros G. Dimakis, Koushik Sen (possible past University Of California, Berkeley affiliation), Matei Zaharia (possible past University Of California, Berkeley affiliation), Ion Stoica (possible past University Of California, Berkeley affiliation)
Abstract

Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide the model toward new candidate solutions. Crucially, the effectiveness of this evolution process depends on the search strategy: how prior solutions are selected and varied to generate new candidates. However, most existing methods rely on fixed searc...

πŸ“„ ParamMem: Augmenting Language Agents with Parametric Reflective Memory
πŸ—“οΈ Published: 2/26/2026
πŸ”— http://arxiv.org/abs/2602.23320v2
πŸ‘₯ Authors: Tianjun Yao, Yongqiang Chen, Yujia Zheng, Pan Li (possible past Baidu (China) affiliation), Zhiqiang Shen, Kun Zhang (possible past Google (United States) affiliation)
Abstract

Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric mem...

*Notable papers are those with at least two authors from a "big" AI/ML lab.