πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20842v1
πŸ‘₯ Authors: Ruohan Liu, Shukang Yin, Tao Wang (possible past Stanford University affiliation), Dong Zhang (possible past Nvidia (United States) affiliation), Weiji Zhuang, Shuhuai Ren, Ran He, Caifeng Shan, Chaoyou Fu
Abstract

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech ...

πŸ“„ Convergent Evolution: How Different Language Models Learn Similar Number Representations
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20817v1
πŸ‘₯ Authors: Deqing Fu, Tianyi Zhou (possible past University Of Washington affiliation), Mikhail Belkin, Vatsal Sharan, Robin Jia (possible past Stanford University affiliation)
Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, w...

πŸ“„ OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20806v1
πŸ‘₯ Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Qiming Yu, Yi Yang (possible past Baidu (China) affiliation), Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che
Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematic...

πŸ“„ VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20444v1
πŸ‘₯ Authors: Qianxi Hua, Xinyue Li, Zheng Yan, Yang Li (possible past Google (United States) affiliation), Chi Zhang (possible past Peking University affiliation), Yongyao Li, Yufei Liu
Abstract

Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic lear...

πŸ“„ Image Generators are Generalist Vision Learners
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20329v1
πŸ‘₯ Authors: Valentin Gabeur (possible past Google (United States) affiliation), Shangbang Long, Songyou Peng (possible past Google (United States) affiliation), Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang (possible past Google (United States) affiliation), Wenlei Zhou, Jonathan T. Barron (possible past Google (United States) affiliation), Kyle Genova (possible past Google (United States) affiliation), Nithish Kannen, Sherry Ben, Yandong Li (possible past Google (United States) affiliation), Mandy Guo (possible past Google (United States) affiliation), Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie (possible past Meta (United States) affiliation), Howard Zhou (possible past Google (United States) affiliation), Kaiming He (possible past Microsoft (United States) affiliation), Thomas Funkhouser (possible past Google (United States) affiliation), Jean-Baptiste Alayrac (possible past Google (United States) affiliation), Radu Soricut (possible past Google (United States) affiliation)
Abstract

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image ge...

πŸ“„ FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20300v1
πŸ‘₯ Authors: Yingjie Gu, Bo Xiong, Yijuan Guo, Chao Li (possible past Baidu (China) affiliation), Xiaojing Zhang, Liqiang Wang, Pengcheng Ren, Qi Sun (possible past Google (United States) affiliation), Jingyao Ma, Shidang Shi
Abstract

For LLM agents, memory management critically impacts efficiency, quality, and security. While much research focuses on retention, selective forgetting--inspired by human cognitive processes (hippocampal indexing/consolidation theory and Ebbinghaus forgetting curve)--remains underexplored. We argue that in resource-constrained environments, a well-designed forgetting mechanism is as crucial as remembering, delivering benefits across three dimensions: (1) efficiency via intelligent memory pruning,...

πŸ“„ Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20261v1
πŸ‘₯ Authors: Fengxian Dong, Zhi Zheng, Xiao Han (possible past Tencent (China) affiliation), Wei Chen, Jingqing Ruan, Tong Xu (possible past Baidu (China) affiliation), Yong Chen, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed genera...

πŸ“„ Hybrid Policy Distillation for LLMs
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20244v1
πŸ‘₯ Authors: Wenhong Zhu, Ruobing Xie (possible past Tencent (China) affiliation), Rui Wang (possible past Tencent (China) affiliation), Pengfei Liu
Abstract

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary adv...

πŸ“„ From Scene to Object: Text-Guided Dual-Gaze Prediction
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20191v1
πŸ‘₯ Authors: Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu (possible past Tsinghua University affiliation), Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang (possible past Tsinghua University affiliation)
Abstract

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve...

πŸ“„ Normalizing Flows with Iterative Denoising
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.20041v1
πŸ‘₯ Authors: Tianrong Chen, Jiatao Gu (possible past Meta (United States) affiliation), David Berthelot (possible past Google (United States) affiliation), Joshua Susskind, Shuangfei Zhai
Abstract

Normalizing Flows (NFs) are a classical family of likelihood-based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable of achieving promising performance on image modeling tasks, making them viable alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully end-to-...

πŸ“„ scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.20003v1
πŸ‘₯ Authors: Qifeng Zhou, Lei Yu (possible past University Of Oxford affiliation), Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang (possible past Tencent (China) affiliation)
Abstract

The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable p...

πŸ“„ Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19635v1
πŸ‘₯ Authors: Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang (possible past Tencent (China) affiliation), Zhiyong Wu (possible past Tsinghua University affiliation)
Abstract

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing...

πŸ“„ EgoSelf: From Memory to Personalized Egocentric Assistant
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19564v2
πŸ‘₯ Authors: Yanshuo Wang, Yuan Xu, Xuesong Li, Jie Hong, Yizhou Wang (possible past Peking University affiliation), Chang Wen Chen, Wentao Zhu (possible past Nvidia (United States) affiliation)
Abstract

Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a ded...

πŸ“„ Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19548v1
πŸ‘₯ Authors: Bobo Li, Rui Wu (possible past Google (United States) affiliation), Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang (possible past Tsinghua University affiliation), Mong-Li Lee, Wynne Hsu (possible past National University Of Singapore affiliation)
Abstract

Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting ...

πŸ“„ DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19544v1
πŸ‘₯ Authors: Zhihong Zhang, Jie Zhao (possible past Baidu (China) affiliation), Xiaojian Huang, Jin Xu (possible past Tencent (China) affiliation), Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen
Abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable...

πŸ“„ From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19516v1
πŸ‘₯ Authors: Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang (possible past Tsinghua University affiliation), Jiaxuan Lu, Yifu Guo, Siyu Zhang, Yifan Wu (possible past Carnegie Mellon University affiliation), Ying Huang, Fu Li (possible past Baidu (China) affiliation)
Abstract

Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editi...

πŸ“„ LASER: Learning Active Sensing for Continuum Field Reconstruction
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19355v1
πŸ‘₯ Authors: Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang (possible past Tsinghua University affiliation), Xiaokang Yang (possible past Shanghai Jiao Tong University affiliation)
Abstract

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent worl...

πŸ“„ Evaluation-driven Scaling for Scientific Discovery
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19341v1
πŸ‘₯ Authors: Haotian Ye (possible past Peking University affiliation), Haowei Lin, Jingyi Tang, Yizhen Luo (possible past Tsinghua University affiliation), Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li (possible past Peking University affiliation), Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon (possible past Stanford University affiliation), Tongyang Li, Xiaowen Chu, James Zou, Yuzhi Xu
Abstract

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be s...

πŸ“„ Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19292v1
πŸ‘₯ Authors: Guy Mor-Lan, Omer Goldman, Matan Eyal, Adi Mayrav Gilady, Sivan Eiger, Idan Szpektor (possible past Google (United States) affiliation), Avinatan Hassidim (possible past Google (United States) affiliation), Yossi Matias (possible past Google (United States) affiliation), Reut Tsarfaty
Abstract

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates...

πŸ“„ CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19262v1
πŸ‘₯ Authors: Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye (possible past Peking University affiliation), Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Bo Zeng, Fan Jiang (possible past Shanghai Jiao Tong University affiliation), Yuanbin Cao, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena MarinkoviΔ‡, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do, Sarakmatak Sophy, Baotian Hu, Guanhua Chen, Gongbo Tang, Alham Fikri Aji, Longyue Wang (possible past Tencent (China) affiliation), Weihua Luo
Abstract

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and ...

πŸ“„ Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20472v1
πŸ‘₯ Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano (possible past Technion – Israel Institute Of Technology affiliation), Aviv Tamar (possible past University Of California, Berkeley affiliation)
Abstract

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequ...

πŸ“„ Scaling Self-Play with Self-Guidance
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20209v1
πŸ‘₯ Authors: Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto (possible past Stanford University affiliation), Tengyu Ma (possible past Stanford University affiliation)
Abstract

LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we...

πŸ“„ Planning in entropy-regularized Markov decision processes and games
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19695v1
πŸ‘₯ Authors: Jean-Bastien Grill (possible past Deepmind (United Kingdom) affiliation), Omar Darwiche Domingues, Pierre MΓ©nard, RΓ©mi Munos (possible past Google (United States) affiliation), Michal Valko
Abstract

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sa...

πŸ“„ TEMPO: Scaling Test-time Training for Large Reasoning Models
πŸ—“οΈ Published: 4/21/2026
πŸ”— http://arxiv.org/abs/2604.19295v1
πŸ‘₯ Authors: Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang (possible past Tencent (China) affiliation), Yu Cheng (possible past National University Of Singapore affiliation), Yun Luo, Ganqu Cui (possible past Tsinghua University affiliation), Changqing Zhang
Abstract

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT fr...

*Notable papers are those with at least two authors from a "big" AI/ML lab.