📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 ActionParty: Multi-Subject Action Binding in Generative Video Games
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02330v1
👥 Authors: Alexander Pondaven, Ziyi Wu, Igor Gilitschenski (possible past Massachusetts Institute Of Technology affiliation), Philip Torr (possible past University Of Oxford affiliation), Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin
Abstract

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action control...

📄 Steerable Visual Representations
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02327v1
👥 Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan (possible past Carnegie Mellon University affiliation), Makarand Tapaswi (possible past University Of Toronto affiliation), Yuki M. Asano (possible past University Of Oxford affiliation)
Abstract

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their...

📄 Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02324v1
👥 Authors: Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang (possible past Baidu (China) affiliation), Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk (possible past Meta (United States) affiliation), Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak
Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a deg...

📄 Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02288v1
👥 Authors: Gengsheng Li, Tianyu Yang (possible past Tencent (China) affiliation), Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang (possible past Google (United States) affiliation), Jinqiao Wang, Tat-Seng Chua
Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement,...

📄 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02029v1
👥 Authors: Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang (possible past Tsinghua University affiliation), Chengming Xu, Yue Ma, Xiaobin Hu (possible past Tencent (China) affiliation), Zhe Cao (possible past University Of California, Berkeley affiliation), Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang (possible past Tencent (China) affiliation), Wenqi Ren (possible past Tencent (China) affiliation), Yanwei Fu, Yong Liu, Yu Wang (possible past Tsinghua University affiliation), Xiangyu Yue (possible past University Of California, Berkeley affiliation), Yu-Gang Jiang, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential ineff...

📄 ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02022v1
👥 Authors: Yu Li (possible past Tencent (China) affiliation), Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang (possible past Tsinghua University affiliation), Jing Shao, Xia Hu, Dongrui Liu
Abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes age...

📄 Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01989v1
👥 Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require i...

📄 World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01985v1
👥 Authors: Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang (possible past Google (United States) affiliation), Kevin Murphy (possible past Google (United States) affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation)
Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify...

📄 Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01941v1
👥 Authors: Sixing Li, Zhibin Gu, Ziqi Zhang (possible past Tencent (China) affiliation), Weiguo Pan, Bing Li, Ying Wang (possible past Tsinghua University affiliation), Hongzhe Liu
Abstract

Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, a...

📄 Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01690v1
👥 Authors: Tianhao Shi, Yang Zhang (possible past Tsinghua University affiliation), Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song (possible past Stanford University affiliation), Yongdong Zhang, Fuli Feng (possible past National University Of Singapore affiliation)
Abstract

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a pre...

📄 GPA: Learning GUI Process Automation from Demonstrations
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01676v1
👥 Authors: Zirui Zhao, Jun Hao Liew, Yan Yang (possible past Google (United States) affiliation), Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese (possible past Stanford University affiliation), Junnan Li
Abstract

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readi...

📄 Can Heterogeneous Language Models Be Fused?
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01674v1
👥 Authors: Shilian Chen, Jie Zhou (possible past Tsinghua University affiliation), Qin Chen, Wen Wu, Xin Li (possible past Google (United States) affiliation), Qi Feng, Liang He
Abstract

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful e...

📄 ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01664v1
👥 Authors: Yong Wu, Yanzhao Zheng, Tianze Xu, Zhentao Zhang, Yuanqiang Yu, Jihuai Zhu, Chao Ma (possible past Shanghai Jiao Tong University affiliation), Binbin Lin, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu (possible past Tencent (China) affiliation)
Abstract

LLM-based agents show strong potential for long-horizon reasoning, yet their context size is limited by deployment factors (e.g., memory, latency, and cost), yielding a constrained context budget. As interaction histories grow, this induces a trade-off between retaining past information and staying within the context limit. To address this challenge, we propose Budget-Aware Context Management (BACM), which formulates context management as a sequential decision problem with a context budget const...

📄 MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01600v1
👥 Authors: Zitian Tang, Xu Zhang (possible past Tencent (China) affiliation), Jianbo Yuan, Yang Zou (possible past Carnegie Mellon University affiliation), Varad Gunjal, Songyao Jiang, Davide Modolo
Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs...

📄 SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01586v1
👥 Authors: Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan (possible past National University Of Singapore affiliation), Yixin Chen, John Young, Saijun Zhang, Bo Wang (possible past Tencent (China) affiliation)
Abstract

Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limitin...

📄 LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01520v1
👥 Authors: Lei Wang (possible past Baidu (China) affiliation), Yuanzi Li, Jinchao Wu, Heyang Gao, Xiaohe Bo, Xu Chen (possible past Tencent (China) affiliation), Ji-Rong Wen
Abstract

Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a l...

📄 LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02097v1
👥 Authors: Jiachun Jin, Zetong Zhou, Xiao Yang (possible past Tencent (China) affiliation), Hao Zhang (possible past Tencent (China) affiliation), Pengfei Liu, Jun Zhu (possible past Tsinghua University affiliation), Zhijie Deng
Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necess...

📄 CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01634v1
👥 Authors: Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani (possible past University Of Oxford affiliation), Paul Hongsuck Seo (possible past Google (United States) affiliation)
Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VL...

📄 Efficient Equivariant Transformer for Self-Driving Agent Modeling
🗓️ Published: 4/1/2026
🔗 http://arxiv.org/abs/2604.01466v1
👥 Authors: Scott Xu, Dian Chen (possible past University Of California, Berkeley affiliation), Kelvin Wong, Chris Zhang, Kion Fallah, Raquel Urtasun (possible past University Of Toronto affiliation)
Abstract

Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the ...

📄 Descending into the Modular Bootstrap
🗓️ Published: 4/1/2026
🔗 http://arxiv.org/abs/2604.01275v1
👥 Authors: Nathan Benjamin, A. Liam Fitzpatrick, Wei Li (possible past Peking University affiliation), Jesse Thaler (possible past Massachusetts Institute Of Technology affiliation)
Abstract

In this paper, we attempt to explore the landscape of two-dimensional conformal field theories (2d CFTs) by efficiently searching for numerical solutions to the modular bootstrap equation using machine-learning-style optimization. The torus partition function of a 2d CFT is fixed by the spectrum of its primary operators and its chiral algebra, which we take to be the Virasoro algebra with $c>1$. We translate the requirement that this partition function is modular invariant into a loss function, ...

📄 Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
🗓️ Published: 4/1/2026
🔗 http://arxiv.org/abs/2604.01130v2
👥 Authors: Zhantao Chen, Dongyi He, Jin Fang (possible past Baidu (China) affiliation), Xi Chen (possible past University Of California, Berkeley affiliation), Yishuo Liu, Xiaozhen Zhong, Xuejun Hu
Abstract

As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and ...

📄 Do Phone-Use Agents Respect Your Privacy?
🗓️ Published: 4/1/2026
🔗 http://arxiv.org/abs/2604.00986v2
👥 Authors: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu (possible past Tencent (China) affiliation), Junyi Li, Can Xu (possible past Google (United States) affiliation), Chengquan Zhang (possible past Baidu (China) affiliation), Han Hu, Ming Yan, Benyou Wang (possible past Tencent (China) affiliation)
Abstract

We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as pe...

📄 Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
🗓️ Published: 4/1/2026
🔗 http://arxiv.org/abs/2604.00842v1
👥 Authors: Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang (possible past Google (United States) affiliation), Alkesh Patel, Zhe Gan (possible past Microsoft (United States) affiliation), William Yang Wang (possible past University Of California, Berkeley affiliation), Michael Saxon, Xin Eric Wang
Abstract

Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.