πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ World Models Can Leverage Human Videos for Dexterous Manipulation
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13644v1
πŸ‘₯ Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat (possible past Meta (United States) affiliation), Farshad Khorrami, Yann Lecun (possible past Meta (United States) affiliation)
Abstract

Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that pr...

πŸ“„ Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13607v1
πŸ‘₯ Authors: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen (possible past Tencent (China) affiliation), Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi (possible past Nvidia (United States) affiliation), Bryan Catanzaro (possible past University Of California, Berkeley affiliation), Wei Ping (possible past Baidu (China) affiliation)
Abstract

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoni...

πŸ“„ DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13583v1
πŸ‘₯ Authors: Zehan Zhu, Heng Zhao, Yan Huang (possible past Tencent (China) affiliation), Joey Tianyi Zhou (possible past Tencent (China) affiliation), Shouling Ji, Jinming Xu
Abstract

In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O...

πŸ“„ Memory in the Age of AI Agents
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13564v1
πŸ‘₯ Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang (possible past Tsinghua University affiliation), Zhenfei Yin, Xiaobin Hu (possible past Tencent (China) affiliation), Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang (possible past Tencent (China) affiliation), Tao Gui, Shirui Pan, Yan Zhang, Philip Torr (possible past University Of Oxford affiliation), Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Tradition...

πŸ“„ From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13438v1
πŸ‘₯ Authors: Dezhi Ran, Zhi Gong, Yuzhe Guo, Mengzhou Wu, Yuan Cao (possible past Google (United States) affiliation), Haochuan Lu, Hengyu Zhang, Xia Zeng (possible past Tencent (China) affiliation), Gang Cao, Liangchao Yao, Yuetang Deng (possible past Tencent (China) affiliation), Wei Yang (possible past Tencent (China) affiliation), Tao Xie
Abstract

While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which tra...

πŸ“„ A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13164v1
πŸ‘₯ Authors: Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou, Zengyang Che, Hongxue Meng, Xin Li (possible past Google (United States) affiliation), Yaowei Wang, Hongpeng Wang, Min Zhang (possible past Tsinghua University affiliation), Heng Tao Shen, Zheng Zhang, Yongbing Zhang
Abstract

The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. B...

πŸ“„ SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13159v1
πŸ‘₯ Authors: Emre Can Acikgoz, Jinoh Oh, Jie Hao (possible past Tencent (China) affiliation), Joo Hyuk Jeon, Heng Ji, Dilek Hakkani-TΓΌr (possible past Google (United States) affiliation), Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan
Abstract

Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambi...

πŸ“„ MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13154v1
πŸ‘₯ Authors: Emre Can Acikgoz, Jinoh Oh, Joo Hyuk Jeon, Jie Hao (possible past Tencent (China) affiliation), Heng Ji, Dilek Hakkani-TΓΌr (possible past Google (United States) affiliation), Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan
Abstract

Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge--particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with u...

πŸ“„ OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13100v1
πŸ‘₯ Authors: Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu (possible past University Of California, Berkeley affiliation), Ken Goldberg (possible past University Of California, Berkeley affiliation)
Abstract

Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments -- robot arm and gripper combinations -- across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datase...

πŸ“„ GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13043v1
πŸ‘₯ Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing (possible past Tsinghua University affiliation), Yuanchun Shi, Zongqing Lu, Deheng Ye (possible past Tencent (China) affiliation)
Abstract

Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which mat...

πŸ“„ Investigating Data Pruning for Pretraining Biological Foundation Models at Scale
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.12932v1
πŸ‘₯ Authors: Yifan Wu (possible past Carnegie Mellon University affiliation), Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li (possible past Tencent (China) affiliation)
Abstract

Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate t...

πŸ“„ Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
πŸ—“οΈ Published: 12/12/2025
πŸ”— http://arxiv.org/abs/2512.12069v1
πŸ‘₯ Authors: Peichun Hua, Hao Li (possible past Tsinghua University affiliation), Shanghao Shi, Zhiyuan Yu, Ning Zhang (possible past University Of California, Berkeley affiliation)
Abstract

Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel ...

πŸ“„ Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
πŸ—“οΈ Published: 12/12/2025
πŸ”— http://arxiv.org/abs/2512.12008v1
πŸ‘₯ Authors: Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang, Dixi Yao, Shayan Shabihi, Fuheng Zhao, Tian Li (possible past Carnegie Mellon University affiliation), Ce Zhang (possible past Eth Zurich affiliation), Furong Huang, Kunpeng Zhang
Abstract

Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their...

πŸ“„ Particulate: Feed-Forward 3D Object Articulation
πŸ—“οΈ Published: 12/12/2025
πŸ”— http://arxiv.org/abs/2512.11798v1
πŸ‘₯ Authors: Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht (possible past University Of Oxford affiliation), Joan Lasenby (possible past University Of Cambridge affiliation), Shangzhe Wu, Andrea Vedaldi (possible past University Of Oxford affiliation)
Abstract

We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the n...

πŸ“„ Image Diffusion Preview with Consistency Solver
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13592v1
πŸ‘₯ Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan (possible past Google (United States) affiliation), Sanghyun Woo, Boqing Gong (possible past Tencent (China) affiliation), Bohyung Han (possible past Google (United States) affiliation), Ming-Hsuan Yang, Han Zhang (possible past Tsinghua University affiliation), Yukun Zhu (possible past Google (United States) affiliation), Ting Liu (possible past Google (United States) affiliation), Long Zhao
Abstract

The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between ...

πŸ“„ Towards Practical Large-scale Dynamical Heterogeneous Graph Embedding: Cold-start Resilient Recommendation
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13120v1
πŸ‘₯ Authors: Mabiao Long, Jiaxi Liu, Yufeng Li, Hao Xiong (possible past Baidu (China) affiliation), Junchi Yan (possible past Shanghai Jiao Tong University affiliation), Kefan Wang, Yi Cao, Jiandong Ding (possible past Google (United States) affiliation)
Abstract

Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with...

πŸ“„ PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13093v1
πŸ‘₯ Authors: Mingqi Yuan, Tao Yu (possible past University Of Washington affiliation), Haolin Song, Bo Li (possible past Tencent (China) affiliation), Xin Jin, Hua Chen, Wenjun Zeng
Abstract

Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity betwe...

πŸ“„ Motus: A Unified Latent Action World Model
πŸ—“οΈ Published: 12/15/2025
πŸ”— http://arxiv.org/abs/2512.13030v1
πŸ‘₯ Authors: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su (possible past Baidu (China) affiliation), Lei Ma, Hang Su (possible past Tsinghua University affiliation), Jun Zhu (possible past Tsinghua University affiliation)
Abstract

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) arch...

πŸ“„ Animus3D: Text-driven 3D Animation via Motion Score Distillation
πŸ—“οΈ Published: 12/14/2025
πŸ”— http://arxiv.org/abs/2512.12534v1
πŸ‘₯ Authors: Qi Sun (possible past Google (United States) affiliation), Can Wang (possible past Tsinghua University affiliation), Jiaxiang Shang, Wensen Feng, Jing Liao
Abstract

We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion m...

*Notable papers are those with at least two authors from a "big" AI/ML lab.