📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Planning-aligned Token Compression for Long-Context Autonomous Driving
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07464v1
👥 Authors: Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang (possible past Tencent (China) affiliation), Maximilian Igl (possible past University Of Oxford affiliation), Yiming Li (possible past Tsinghua University affiliation), Danfei Xu, Nikolai Smolyanskiy (possible past Nvidia (United States) affiliation), Boris Ivanovic, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation), Marco Pavone (possible past Stanford University affiliation)
Abstract

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based h...

📄 Watch, Remember, Reason: Human-View Video Understanding with MLLMs
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07433v1
👥 Authors: Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li (possible past Nvidia (United States) affiliation), Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang (possible past Tencent (China) affiliation), Guangliang Cheng, Yunhai Tong, Lu Qi (possible past Tencent (China) affiliation), Minghsuan Yang
Abstract

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering...

📄 DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07299v1
👥 Authors: Lingyong Yan, Can Xu (possible past Google (United States) affiliation), Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang (possible past Baidu (China) affiliation), Jianmin Wu, Dawei Yin (possible past Baidu (China) affiliation)
Abstract

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limit...

📄 DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07108v1
👥 Authors: Tengyao Tu, Yulin Li (possible past Baidu (China) affiliation), Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li (possible past Tsinghua University affiliation), Min Zhang (possible past Tsinghua University affiliation)
Abstract

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolv...

📄 dots.tts Technical Report
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07080v1
👥 Authors: Shi Lian, Changtao Li, Bohan Li (possible past Google (United States) affiliation), Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu (possible past Baidu (China) affiliation)
Abstract

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift dur...

📄 SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07074v1
👥 Authors: Zequn Xie, Junjie Wang, Dan Yang, Jie Feng (possible past Tsinghua University affiliation), Yue Shen, Jian Wang (possible past Baidu (China) affiliation), Jinjie Gu
Abstract

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency tra...

📄 MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06696v1
👥 Authors: Ryan D'cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen (possible past Tencent (China) affiliation), Yue Yao, Robert Tibshirani (possible past Stanford University affiliation), Jeffrey J. Nirschl, Serena Yeung-Levy
Abstract

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the...

📄 What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06627v1
👥 Authors: Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta (possible past University Of California, Berkeley affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Pulkit Agrawal (possible past University Of California, Berkeley affiliation)
Abstract

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural mo...

📄 Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06481v1
👥 Authors: Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li (possible past Tsinghua University affiliation), Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation), Zhiqiang Shen
Abstract

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for study...

📄 MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06473v1
👥 Authors: Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li (possible past Google (United States) affiliation), Jie Zhou (possible past Tsinghua University affiliation), Liang He, Bo Zhang (possible past Tencent (China) affiliation), Lei Bai
Abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm di...

📄 Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06464v1
👥 Authors: Mandana Samiei, Eunice Yiu, Anthony Gx-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards (possible past University Of Toronto affiliation), Alison Gopnik, Doina Precup (possible past Deepmind (United Kingdom) affiliation)
Abstract

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency throug...

📄 Unsupervised Skill Discovery for Agentic Data Analysis
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06416v1
👥 Authors: Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen (possible past Alibaba Group (China) affiliation), Shumin Deng (possible past Alibaba Group (China) affiliation)
Abstract

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guid...

📄 Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06388v1
👥 Authors: Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu (possible past University Of Washington affiliation), Toby Jia-Jun Li, Dakuo Wang (possible past Tencent (China) affiliation), Bingsheng Yao
Abstract

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, a...

📄 OneReason Technical Report
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06260v1
👥 Authors: Onerec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang (possible past Huawei Technologies (China) affiliation), Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang (possible past Tencent (China) affiliation), Yifei Hu, Yingzhi He, Yufei Ye (possible past Carnegie Mellon University affiliation), Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu (possible past Tsinghua University affiliation), Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang (possible past Tencent (China) affiliation), Guowang Zhang, Hao Peng (possible past Tsinghua University affiliation), Jiayao Shen, Jie Chen (possible past Tencent (China) affiliation), Jun Xu (possible past Google (United States) affiliation), Junmin Chen, Kun Zhang (possible past Google (United States) affiliation), Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang (possible past Tencent (China) affiliation), Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang (possible past Tencent (China) affiliation), Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao
Abstract

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM f...

📄 CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06219v1
👥 Authors: Yining Xing, Zehong Ke, Zhiyuan Liu (possible past Tsinghua University affiliation), Yanbo Jiang, Wenhao Yu, Jianqiang Wang (possible past Tsinghua University affiliation)
Abstract

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA ...

📄 TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06218v1
👥 Authors: Dongwon Son, Florian Shkurti, Jason Lee (possible past Stanford University affiliation), Naman Shah, Beomjoon Kim, Dieter Fox (possible past University Of Washington affiliation)
Abstract

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that mus...

📄 DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06217v1
👥 Authors: Tan Zhang, Quanyou Li, Lu Zhang (possible past Tencent (China) affiliation), Jun Liu (possible past Tencent (China) affiliation), Xiaofeng Zhu, Ping Hu (possible past Ibm (United States) affiliation)
Abstract

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a mu...

📄 Closed-Form Spectral Regularization for Multi-Task Model Merging
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.07289v1
👥 Authors: Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen (possible past Tencent (China) affiliation), Chun Yuan, Peng Cui (possible past Tsinghua University affiliation), Dacheng Tao
Abstract

Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative...

📄 TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models
🗓️ Published: 6/5/2026
🔗 http://arxiv.org/abs/2606.06902v1
👥 Authors: Chengkai Zhang (possible past Massachusetts Institute Of Technology affiliation), Ziteng Liu, Junpu Wang, Zeyi Tao, Yang Wang (possible past Baidu (China) affiliation), Sagar Chordia, Qin Huang
Abstract

Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or inference-time steering. We introduce TALAN (Task-Aligned Latent Adaptation Networks), a sequence-conditioned latent side path inserted into a transformer's residual stream and co-trained with a low-rank adapter in one SFT loop. TALAN compresses the active sequence int...

📄 Skip a Layer or Loop It? Learning Program-of-Layers in LLMs
🗓️ Published: 6/4/2026
🔗 http://arxiv.org/abs/2606.06574v1
👥 Authors: Ziyue Li, Yang Li (possible past Google (United States) affiliation), Tianyi Zhou (possible past University Of Washington affiliation)
Abstract

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected...

*Notable papers are those with at least two authors from a "big" AI/ML lab.