πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ A New Lower Bound for the Random Offerer Mechanism in Bilateral Trade using AI-Guided Evolutionary Search
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08679v1
πŸ‘₯ Authors: Yang Cai, Vineet Gupta (possible past Google (United States) affiliation), Zun Li, Aranyak Mehta (possible past Google (United States) affiliation)
Abstract

The celebrated Myerson--Satterthwaite theorem shows that in bilateral trade, no mechanism can be simultaneously fully efficient, Bayesian incentive compatible (BIC), and budget balanced (BB). This naturally raises the question of how closely the gains from trade (GFT) achievable by a BIC and BB mechanism can approximate the first-best (fully efficient) benchmark. The optimal BIC and BB mechanism is typically complex and highly distribution-dependent, making it difficult to characterize directly....

πŸ“„ OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08655v1
πŸ‘₯ Authors: Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins (possible past Google (United States) affiliation), Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen (possible past Google (United States) affiliation), Michael Bendersky (possible past Google (United States) affiliation), Matei Zaharia (possible past University Of California, Berkeley affiliation), Xing Chen
Abstract

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, an...

πŸ“„ Towards Effective and Efficient Graph Alignment without Supervision
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08526v1
πŸ‘₯ Authors: Songyang Chen, Youfang Lin, Yu Liu, Shuai Zheng (possible past University Of Oxford affiliation), Lei Zou (possible past Peking University affiliation)
Abstract

Unsupervised graph alignment aims to find the node correspondence across different graphs without any anchor node pairs. Despite the recent efforts utilizing deep learning-based techniques, such as the embedding and optimal transport (OT)-based approaches, we observe their limitations in terms of model accuracy-efficiency tradeoff. By focusing on the exploitation of local and global graph information, we formalize them as the ``local representation, global alignment'' paradigm, and present a new...

πŸ“„ A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08448v1
πŸ‘₯ Authors: Peter Brodeur, Jacob M. Koshy, Anil Palepu (possible past Google (United States) affiliation), Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno (possible past Google (United States) affiliation), Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn (possible past Google (United States) affiliation), Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong (possible past Google (United States) affiliation), Tao Tu (possible past Google (United States) affiliation), Petar Sirkovic, Alessio Orlandi, Christopher Semturs (possible past Google (United States) affiliation), Yun Liu (possible past Google (United States) affiliation), Juraj Gottweis (possible past Google (United States) affiliation), Dale R. Webster (possible past Google (United States) affiliation), JoΓ«lle Barral (possible past Google (United States) affiliation), Katherine Chou (possible past Google (United States) affiliation), Pushmeet Kohli (possible past Google (United States) affiliation), Avinatan Hassidim (possible past Google (United States) affiliation), Yossi Matias (possible past Google (United States) affiliation), James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan (possible past Google (United States) affiliation), Mike Schaekermann (possible past Google (United States) affiliation), Alan Karthikesalingam (possible past Google (United States) affiliation), Adam Rodman
Abstract

Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients ...

πŸ“„ Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08398v1
πŸ‘₯ Authors: Liyuan Mao, Le Yu (possible past Tsinghua University affiliation), Jing Zhou, Chujie Zheng, Bowen Yu, Chang Gao, Shixuan Liu, An Yang, Weinan Zhang (possible past Shanghai Jiao Tong University affiliation), Junyang Lin
Abstract

In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reason...

πŸ“„ Detecting Fake Reviewer Groups in Dynamic Networks: An Adaptive Graph Learning Method
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08332v1
πŸ‘₯ Authors: Jing Zhang (possible past University Of Washington affiliation), Ke Huang, Yao Zhang (possible past Tsinghua University affiliation), Bin Guo, Zhiwen Yu
Abstract

The proliferation of fake reviews, often produced by organized groups, undermines consumer trust and fair competition on online platforms. These groups employ sophisticated strategies that evade traditional detection methods, particularly in cold-start scenarios involving newly launched products with sparse data. To address this, we propose the \underline{D}iversity- and \underline{S}imilarity-aware \underline{D}ynamic \underline{G}raph \underline{A}ttention-enhanced \underline{G}raph \underline...

πŸ“„ Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08291v1
πŸ‘₯ Authors: Tianyu Yang (possible past Tencent (China) affiliation), Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao (possible past Stanford University affiliation), Minhao Cheng, Arman Cohan, Xiangliang Zhang
Abstract

Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying t...

πŸ“„ Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08126v1
πŸ‘₯ Authors: Shentong Mo (possible past Baidu (China) affiliation), Yibing Song (possible past Tencent (China) affiliation)
Abstract

Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in align...

πŸ“„ \$OneMillion-Bench: How Far are Language Agents from Human Experts?
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.07980v1
πŸ‘₯ Authors: Qianyu Yang, Yang Liu (possible past Tsinghua University affiliation), Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu (possible past Tsinghua University affiliation), Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong
Abstract

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike ...

πŸ“„ Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.07972v1
πŸ‘₯ Authors: Wei Yang (possible past Tencent (China) affiliation), Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu (possible past Tencent (China) affiliation)
Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop M...

πŸ“„ Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.07915v1
πŸ‘₯ Authors: Jingbo Yang, Bairu Hou, Wei Wei (possible past Google (United States) affiliation), Yujia Bao, Shiyu Chang (possible past Tencent (China) affiliation)
Abstract

Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significant performance degradation, while random selection fails to preserve accuracy or provide meaningful cost reduction. However, agents should reserve high reasoning effort for difficul...

πŸ“„ How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07540v1
πŸ‘₯ Authors: Haoyu Chen, Qing Liu, Yuqian Zhou (possible past Google (United States) affiliation), He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang (possible past Tencent (China) affiliation), Zhe Lin, Lei Zhu
Abstract

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a...

πŸ“„ Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07452v1
πŸ‘₯ Authors: Yige Li, Wei Zhao (possible past Tencent (China) affiliation), Zhe Li (possible past Google (United States) affiliation), Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Yu-Gang Jiang, Jun Sun
Abstract

Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike ...

πŸ“„ How Far Can Unsupervised RLVR Scale LLM Training?
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08660v1
πŸ‘₯ Authors: Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui (possible past Tsinghua University affiliation), Xiusi Chen (possible past Peking University affiliation), Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng (possible past Google (United States) affiliation), Ran Li, Huan-Ang Gao, Yuchen Zhang (possible past University Of California, Berkeley affiliation), Bowen Zhou, Zhiyuan Liu (possible past Tsinghua University affiliation), Ning Ding (possible past Tsinghua University affiliation)
Abstract

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus exter...

πŸ“„ PolyFormer: learning efficient reformulations for scalable optimization under complex physical constraints
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08283v1
πŸ‘₯ Authors: Yilin Wen, Yi Guo, Bo Zhao (possible past National University Of Singapore affiliation), Wei Qi (possible past Baidu (China) affiliation), Zechun Hu, Colin Jones, Jian Sun (possible past Microsoft (United States) affiliation)
Abstract

Real-world optimization problems are often constrained by complex physical laws that limit computational scalability. These constraints are inherently tied to complex regions, and thus learning models that incorporate physical and geometric knowledge, i.e., physics-informed machine learning (PIML), offer a promising pathway for efficient solution. Here, we introduce PolyFormer, which opens a new direction for PIML in prescriptive optimization tasks, where physical and geometric knowledge is not ...

πŸ“„ C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.08155v1
πŸ‘₯ Authors: Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang (possible past Tencent (China) affiliation), Jinwei Chen, Peng-Tao Jiang, Bo Li (possible past Tencent (China) affiliation), Jia Wang
Abstract

Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffu...

πŸ“„ Scaling Machine Learning Interatomic Potentials with Mixtures of Experts
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.07977v1
πŸ‘₯ Authors: Yuzhi Liu, Duo Zhang (possible past Peking University affiliation), Anyang Peng, Weinan E, Linfeng Zhang, Han Wang (possible past Peking University affiliation)
Abstract

Machine Learning Interatomic Potentials (MLIPs) enable accurate large-scale atomistic simulations, yet improving their expressive capacity efficiently remains challenging. Here we systematically develop Mixture-of-Experts (MoE) and Mixture-of-Linear-Experts (MoLE) architectures for MLIPs and analyze the effects of routing strategies and expert designs. We show that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform ...

πŸ“„ DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action Models
πŸ—“οΈ Published: 3/9/2026
πŸ”— http://arxiv.org/abs/2603.07904v1
πŸ‘₯ Authors: Zihao Zheng, Hangyu Cao, Sicheng Tian, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei (possible past Peking University affiliation), Xiang Chen (possible past Tencent (China) affiliation)
Abstract

Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remain...

πŸ“„ Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07779v1
πŸ‘₯ Authors: Zongqian Li, Tengchao Lv, Shaohan Huang, Yixuan Su, Qinzheng Sun, Qiufeng Yin, Ying Xin (possible past Baidu (China) affiliation), Scarlett Li, Lei Cui (possible past Tsinghua University affiliation), Nigel Collier, Furu Wei
Abstract

Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi...

πŸ“„ Scalable Training of Mixture-of-Experts Models with Megatron Core
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07685v1
πŸ‘₯ Authors: Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, Robin Zhang, Yuzhong Wang, Shifang Xu, Jack Chang, Xuwen Chen, Kunlun Li, Yan Bai, Gao Deng, Nan Zheng, Vijay Anand Korthikanti, Abhinav Khattar, Ethan He, Soham Govande, Sangkug Lym, Zhongbo Zhu, Qi Zhang (possible past Tencent (China) affiliation), Haochen Yuan, Xiaowei Ren, Deyu Fu, Tailai Ma, Shunkang Zhang, Jiang Shao, Ray Wang, Santosh Bhavani, Xipeng Li, Chandler Zhou, David Wu, Yingcan Wei, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi (possible past Nvidia (United States) affiliation), Jiajie Yao, June Yang
Abstract

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimization...

πŸ“„ Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07642v1
πŸ‘₯ Authors: Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su (possible past Tsinghua University affiliation), Jun Zhu (possible past Tsinghua University affiliation)
Abstract

Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present HELIX -- a Hierarchical...

πŸ“„ Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models
πŸ—“οΈ Published: 3/8/2026
πŸ”— http://arxiv.org/abs/2603.07615v1
πŸ‘₯ Authors: Jiajun He, Zongyu Guo, Zhaoyang Jia, Xiaoyi Zhang (possible past Apple (United States) affiliation), Jiahao Li, Xiao Li, Bin Li, JosΓ© Miguel HernΓ‘ndez-Lobato (possible past University Of Cambridge affiliation), Yan Lu
Abstract

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual...

*Notable papers are those with at least two authors from a "big" AI/ML lab.