📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08540v1
👥 Authors: Ziwei Zhou, Zeyuan Lai, Rui Wang (possible past Tencent (China) affiliation), Yifan Yang (possible past Tencent (China) affiliation), Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, Chong Luo (possible past Google (United States) affiliation)
Abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we pro...

📄 ClawBench: Can AI Agents Complete Everyday Online Tasks?
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08523v1
👥 Authors: Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu (possible past Tencent (China) affiliation), Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen (possible past Google (United States) affiliation), Dongfu Jiang, Wenhu Chen, Kelsey R. Allen
Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These ...

📄 Synthetic Data for any Differentiable Target
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08423v1
👥 Authors: Tristan Thrush (possible past Hugging Face affiliation), Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts (possible past Tencent (China) affiliation), Tatsunori Hashimoto (possible past Stanford University affiliation)
Abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradi...

📄 TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08384v1
👥 Authors: Jing Peng, Chenghao Wang, Yi Yang (possible past Baidu (China) affiliation), Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu (possible past Baidu (China) affiliation)
Abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributio...

📄 SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08377v1
👥 Authors: Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang (possible past Baidu (China) affiliation), Yiming Hu (possible past Tsinghua University affiliation), Tongwen Huang, Xiangxiang Chu
Abstract

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experi...

📄 PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08340v1
👥 Authors: Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li (possible past Eth Zurich affiliation), Lixin Duan (possible past Amazon (United States) affiliation)
Abstract

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. ...

📄 HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08232v1
👥 Authors: He Zhao (possible past Tencent (China) affiliation), Yijun Yang, Zichuan Lin, Deheng Ye (possible past Tencent (China) affiliation), Chunyan Miao
Abstract

Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberat...

📄 AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08184v1
👥 Authors: Yuankun Xie, Haonan Cheng, Jiayi Zhou, Xiaoxuan Guo, Tao Wang (possible past Stanford University affiliation), Jian Liu, Weiqiang Wang, Ruibo Fu, Xiaopeng Wang, Hengyan Huang, Xiaoying Huang, Long Ye, Guangtao Zhai (possible past Shanghai Jiao Tong University affiliation)
Abstract

The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchma...

📄 ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08168v1
👥 Authors: Jindi Lv, Hao Li (possible past Tsinghua University affiliation), Jie Li, Yifei Nie, Fankun Kong, Yang Wang (possible past Baidu (China) affiliation), Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang
Abstract

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, ...

📄 Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08159v1
👥 Authors: Yushuo Zhang, Yu Cheng (possible past National University Of Singapore affiliation), Yongkang Hu, Jiuan Zhou, Jiawei Chen (possible past Tencent (China) affiliation), Yuan Xie, Zhaoxia Yin
Abstract

The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-...

📄 LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08123v1
👥 Authors: Lingyun Yang, Suyi Li (possible past Google (United States) affiliation), Tianyu Feng, Xiaoxiao Jiang, Zhipeng Di, Weiyi Lu, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Text-to-image generation executes a diffusion workflow comprising multiple models centered on a base diffusion model. Existing serving systems treat each workflow as an opaque monolith, provisioning, placing, and scaling all constituent models together, which obscures internal dataflow, prevents model sharing, and enforces coarse-grained resource management. In this paper, we make a case for micro-serving diffusion workflows with LegoDiffusion, a system that decomposes a workflow into loosely co...

📄 Small Vision-Language Models are Smart Compressors for Long Video Understanding
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08120v1
👥 Authors: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen (possible past Google (United States) affiliation), Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra (possible past Meta (United States) affiliation), Mohamed Elhoseiny (possible past Meta (United States) affiliation), Chenchen Zhu
Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small ...

📄 PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08000v1
👥 Authors: Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang (possible past Google (United States) affiliation), Haobo Chai, Zihang Liu, Pengcheng Wu, Guibin Zhang, Yue Liao, Xiaobin Hu (possible past Tencent (China) affiliation), Deheng Ye (possible past Tencent (China) affiliation), Chunyan Miao, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) ...

📄 LogAct: Enabling Agentic Reliability via Shared Logs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07988v1
👥 Authors: Mahesh Balakrishnan, Ashwin Bharambe (possible past Meta (United States) affiliation), Davide Testuggine, David Geraghty, David Mao, Vidhya Venkat, Ilya Mironov (possible past Meta (United States) affiliation), Rithesh Baradi, Gayathri Aiyer, Victoria Dudin
Abstract

Agents are LLM-driven components that can mutate environments in powerful, arbitrary ways. Extracting guarantees for the execution of agents in production environments can be challenging due to asynchrony and failures. In this paper, we propose a new abstraction called LogAct, where each agent is a deconstructed state machine playing a shared log. In LogAct, agentic actions are visible in the shared log before they are executed; can be stopped prior to execution by pluggable, decoupled voters; a...

📄 A Decomposition Perspective to Long-context Reasoning for LLMs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07981v1
👥 Authors: Yanling Xiao, Huaibing Xie, Guoliang Zhao, Shihan Dou, Shaolei Wang, Yiting Liu, Nantao Zheng, Cheng Zhang, Pluto Zhou, Zhisong Zhang (possible past Shanghai Jiao Tong University affiliation), Lemao Liu (possible past Tencent (China) affiliation)
Abstract

Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each expli...

📄 How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07973v1
👥 Authors: Baining Zhao, Ziyou Wang, Jianjie Fang, Zile Zhou, Yanggang Xu, Yatai Ji, Jiacheng Xu, Qian Zhang (possible past University Of Washington affiliation), Weichen Zhang, Chen Gao, Xinlei Chen (possible past Tsinghua University affiliation)
Abstract

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a dataset comprising 5,037 high-quality goal-oriented navigation samples, with an emphasis on 3D vertical actions and rich urban semantic informa...

📄 WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07957v1
👥 Authors: Hongjin Chen, Shangyun Jiang, Tonghua Su, Chen Gao, Xinlei Chen (possible past Tsinghua University affiliation), Yong Li (possible past Tsinghua University affiliation), Zhibo Chen (possible past Tencent (China) affiliation)
Abstract

Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the gro...

📄 Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07914v1
👥 Authors: Yuanhong Zhang, Zhaoyang Wang, Xin Zhang (possible past Google (United States) affiliation), Weizhan Zhang, Joey Tianyi Zhou (possible past Tencent (China) affiliation)
Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disr...

📄 Data Selection for Multi-turn Dialogue Instruction Tuning
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07892v1
👥 Authors: Bo Li (possible past Tencent (China) affiliation), Shikun Zhang, Wei Ye (possible past Meta (United States) affiliation)
Abstract

Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise select...

📄 QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07853v1
👥 Authors: Hao Gu, Hao Wang (possible past Tsinghua University affiliation), Jiacheng Liu, Lujun Li, Qiyuan Zhu, Bei Liu, Binxing Xu, Lei Wang (possible past Baidu (China) affiliation), Xintong Yang, Sida Lin, Sirui Han, Yike Guo
Abstract

Large language model (LLM) reinforcement learning (RL) pipelines are often bottlenecked by rollout generation, making end-to-end training slow. Recent work mitigates this by running rollouts with quantization to accelerate decoding, which is the most expensive stage of the RL loop. However, these setups destabilize optimization by amplifying the training-inference gap: rollouts are operated at low precision, while learning updates are computed at full precision. To address this challenge, we pro...

📄 SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07837v1
👥 Authors: Xuyang Zhi, Peilun Zhou, Chengqiang Lu, Hang Lv, Yiwei Liang, Rongyang Zhang, Yan Gao, Yi Wu (possible past University Of California, Berkeley affiliation), Yao Hu, Hongchao Gu, Defu Lian, Hao Wang (possible past Tsinghua University affiliation), Enhong Chen (possible past Baidu (China) affiliation)
Abstract

The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights...

📄 Lightweight LLM Agent Memory with Small Language Models
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07798v1
👥 Authors: Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Zhenzhen Huang, Pengcheng Zheng, Zhicheng Wang (possible past Google (United States) affiliation), Ping Guo, Fan Mo, Sung-Ho Bae, Jie Zou, Jiwei Wei, Yang Yang (possible past Tencent (China) affiliation)
Abstract

Although LLM agents can leverage tools for complex tasks, they still need memory to maintain cross-turn consistency and accumulate reusable information in long-horizon interactions. However, retrieval-based external memory systems incur low online overhead but suffer from unstable accuracy due to limited query construction and candidate filtering. In contrast, many systems use repeated large-model calls for online memory operations, improving accuracy but accumulating latency over long interacti...

📄 Emotion Concepts and their Function in a Large Language Model
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07729v1
👥 Authors: Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan (possible past Openai (United States) affiliation), Sasha Hydrie, Craig Citro (possible past Google (United States) affiliation), Adam Pearce (possible past Google (United States) affiliation), Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah (possible past Openai (United States) affiliation), Jack Lindsey
Abstract

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance wi...

📄 Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07725v1
👥 Authors: Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang (possible past Tencent (China) affiliation), Harman Singh, Qingyang Wu, Yuqing Jian, Ce Zhang (possible past Eth Zurich affiliation), Kurt Keutzer (possible past University Of California, Berkeley affiliation), Tri Dao, Xiaoxia Wu, Ben Athiwaratkun, James Zou, Chenfeng Xu (possible past University Of California, Berkeley affiliation)
Abstract

We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest margi...

📄 How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07650v1
👥 Authors: Chenchen Kuai, Jiwan Jiang, Zihao Zhu, Hao Wang (possible past Tsinghua University affiliation), Keshu Wu, Zihao Li, Yunlong Zhang, Chenxi Liu, Zhengzhong Tu (possible past Google (United States) affiliation), Zhiwen Fan, Yang Zhou
Abstract

The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where ap...

📄 Exponential quantum advantage in processing massive classical data
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07639v1
👥 Authors: Haimeng Zhao, Alexander Zlokapa (possible past Massachusetts Institute Of Technology affiliation), Hartmut Neven (possible past Google (United States) affiliation), Ryan Babbush (possible past Google (United States) affiliation), John Preskill, Jarrod R. Mcclean (possible past Google (United States) affiliation), Hsin-Yuan Huang (possible past Google (United States) affiliation)
Abstract

Broadly applicable quantum advantage, particularly in classical data processing and machine learning, has been a fundamental open problem. In this work, we prove that a small quantum computer of polylogarithmic size can perform large-scale classification and dimension reduction on massive classical data by processing samples on the fly, whereas any classical machine achieving the same prediction performance requires exponentially larger size. Furthermore, classical machines that are exponentiall...

📄 Learning is Forgetting: LLM Training As Lossy Compression
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07569v1
👥 Authors: Henry C. Conklin, Tom Hosking, Tan Yi-Chern, Julian Gold, Jonathan D. Cohen (possible past Deepmind (United Kingdom) affiliation), Thomas L. Griffiths (possible past University Of California, Berkeley affiliation), Max Bartolo, Seraphina Goldfarb-Tarrant
Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally comp...

📄 SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.08368v1
👥 Authors: Seyed Mahmoud Sajjadi Mohammadabadi, Xiaolong Ma, Lei Yang (possible past Google (United States) affiliation), Feng Yan (possible past Meta (United States) affiliation), Junshan Zhang
Abstract

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses eac...

📄 Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
🗓️ Published: 4/9/2026
🔗 http://arxiv.org/abs/2604.07888v1
👥 Authors: Binxing Xu, Hao Gu, Lujun Li, Hao Wang (possible past Tsinghua University affiliation), Bei Liu, Jiacheng Liu, Qiyuan Zhu, Xintong Yang, Chao Li (possible past Baidu (China) affiliation), Sirui Han, Yike Guo
Abstract

Training LLMs at ultra-low precision remains a formidable challenge. Direct low-bit QAT often suffers from convergence instability and substantial training costs, exacerbated by quantization noise from heavy-tailed outlier channels and error accumulation across layers. To address these issues, we present Bit-by-Bit, a progressive QAT framework with outlier channel splitting. Our approach integrates three key components: (1) block-wise progressive training that reduces precision stage by stage, e...

📄 ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07484v1
👥 Authors: Yu Liang, Liangxin Liu, Longzheng Wang, Yan Wang (possible past Tencent (China) affiliation), Yueyang Zhang (possible past Baidu (China) affiliation), Long Xia, Zhiyuan Sun, Daiting Shi
Abstract

Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training fr...

📄 CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07457v1
👥 Authors: Ziyang Cheng, Haoyu Wei, Hang Yin, Xiuwei Xu, Bingyao Yu, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

While decoupled control schemes for legged mobile manipulators have shown robustness, learning holistic whole-body control policies for tracking global end-effector poses remains fragile against Out-of-Distribution (OOD) inputs induced by sensor noise or infeasible user commands. To improve robustness against these perturbations without sacrificing task performance and continuity, we propose Competence Manifold Projection (CMP). Specifically, we utilize a Frame-Wise Safety Scheme that transforms...

📄 MoRight: Motion Control Done Right
🗓️ Published: 4/8/2026
🔗 http://arxiv.org/abs/2604.07348v1
👥 Authors: Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta (possible past University Of California, Berkeley affiliation), Shenlong Wang (possible past University Of Toronto affiliation), Sanja Fidler (possible past University Of Toronto affiliation), Jun Gao (possible past Nvidia (United States) affiliation)
Abstract

Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and obje...

*Notable papers are those with at least two authors from a "big" AI/ML lab.