📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.16258v1
👥 Authors: Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a conti...

📄 Look Before You Leap: Autonomous Exploration for LLM Agents
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.16143v1
👥 Authors: Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang (possible past Tsinghua University affiliation), Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng (possible past National University Of Singapore affiliation)
Abstract

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances...

📄 GenShield: Unified Detection and Artifact Correction for AI-Generated Images
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.16122v1
👥 Authors: Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao (possible past Tencent (China) affiliation), Shouhong Ding (possible past Tencent (China) affiliation), Jian Zhang (possible past Tencent (China) affiliation)
Abstract

Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection an...

📄 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15963v1
👥 Authors: Jingxuan Wei, Xi Bai, Shan Liu (possible past Tencent (China) affiliation), Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li (possible past Tencent (China) affiliation), Linzhuang Sun, Bihui Yu, Conghui He (possible past Tsinghua University affiliation), Cheng Tan
Abstract

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate ...

📄 Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15913v1
👥 Authors: Shuaiyi Li, Zhisong Zhang (possible past Shanghai Jiao Tong University affiliation), Yan Wang (possible past Tencent (China) affiliation), Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam
Abstract

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct...

📄 Generative Long-term User Interest Modeling for Click-Through Rate Prediction
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15905v1
👥 Authors: Jiangli Shao, Kaifu Zheng, Hao Fang (possible past University Of Washington affiliation), Huimu Ye, Zhiwei Liu, Bo Zhang (possible past Tencent (China) affiliation), Shu Han, Xingxing Wang
Abstract

Modeling long-term user interests with massive historical user behaviors enhances click-through rate (CTR) prediction performance in advertising and recommendation systems. Typically, a two-stage framework is widely adopted, where a general search unit (GSU) first retrieves top-$k$ relevant behaviors towards the target item, and an exact search unit (ESU) generates interest features via tailored attention. However, current target-centered GSU would ignore other latent user interests, leading to ...

📄 Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15871v1
👥 Authors: Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu (possible past Meta (United States) affiliation), Yoram Bachrach (possible past Deepmind (United Kingdom) affiliation)
Abstract

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 archit...

📄 RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15846v1
👥 Authors: Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu (possible past Microsoft (United States) affiliation), Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen (possible past Google (United States) affiliation), Kuan Li, Baobao Chang (possible past Peking University affiliation)
Abstract

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in...

📄 SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15777v1
👥 Authors: Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu (possible past Microsoft (United States) affiliation), Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen (possible past Google (United States) affiliation), Kuan Li, Baobao Chang (possible past Peking University affiliation)
Abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choi...

📄 DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15532v1
👥 Authors: Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna (possible past University Of Toronto affiliation), Prithviraj Ammanabrolu, Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minima...

📄 From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15412v1
👥 Authors: Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Zixuan Xie, Chiming Duan, Minghua He, Philip S. Yu (possible past Tsinghua University affiliation), Ying Li (possible past Meta (United States) affiliation)
Abstract

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates a...

📄 HoloMotion-1 Technical Report
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15336v1
👥 Authors: Maiyue Chen, Kaihui Wang, Bo Zhang (possible past Tencent (China) affiliation), Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, Zhizhong Su (possible past Baidu (China) affiliation)
Abstract

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to ...

📄 From I/O to Code with Discovery Agent
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15334v1
👥 Authors: Yihong Dong, Jiaru Qian, Haoran Zhang, Peixu Wang, Binhua Li, Zhi Jin (possible past Peking University affiliation), Yongbin Li, Ge Li (possible past Peking University affiliation), Xiaokang Yang (possible past Shanghai Jiao Tong University affiliation), Xue Jiang
Abstract

The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from conc...

📄 PhysBrain 1.0 Technical Report
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15298v1
👥 Authors: Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan, Xiaolin Hu (possible past Tsinghua University affiliation), Zhaolong Shen, Yuzhuo Miao, Haishan Liu, Yuxuan Tian, Yukun Shi, Cong Huang, Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The...

📄 GQA-μP: The maximal parameterization update for grouped query attention
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15290v1
👥 Authors: Kyle R. Chickering, Huijuan Wang, Mengxi Wu, Alexander Moreno, Muhao Chen, Xuezhe Ma (possible past Carnegie Mellon University affiliation), Daria Soboleva, Joel Hestness, Zhengzhong Liu (possible past Tencent (China) affiliation), Eric Xing
Abstract

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization (μP) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of featu...

📄 VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15186v1
👥 Authors: Kaixin Zhu, Yiwen Tang, Yifan Yang (possible past Tencent (China) affiliation), Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang (possible past Mila - Quebec Artificial Intelligence Institute affiliation)
Abstract

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and th...

📄 MeMo: Memory as a Model
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15156v1
👥 Authors: Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus (possible past Massachusetts Institute Of Technology affiliation), Armando Solar-Lezama (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods...

📄 Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15153v1
👥 Authors: Yi Zhang (possible past Google (United States) affiliation), Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu (possible past Tencent (China) affiliation), Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Yong Dai, Jian Tang, Xiaozhu Ju
Abstract

We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a den...

📄 EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15042v1
👥 Authors: Wuyang Li, Yang Gao (possible past Tencent (China) affiliation), Mariam Hassan, Lan Feng, Wentao Pan (possible past Tsinghua University affiliation), Po-Chien Luan, Alexandre Alahi
Abstract

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character ...

📄 SEED: Targeted Data Selection by Weighted Independent Set
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15691v1
👥 Authors: Yuan Zhang (possible past Google (United States) affiliation), Lifeng Guo, Junwen Pan, Chang Liu, Wenzhao Zheng, Kuan Cheng, Kurt Keutzer (possible past University Of California, Berkeley affiliation), Shanghang Zhang
Abstract

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights...

📄 CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage
🗓️ Published: 5/15/2026
🔗 http://arxiv.org/abs/2605.15597v1
👥 Authors: Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang (possible past Baidu (China) affiliation), Lidong Chen, Yang Zou (possible past Carnegie Mellon University affiliation), Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng
Abstract

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into ...

📄 $φ$-Balancing for Mixture-of-Experts Training
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15403v1
👥 Authors: Lizhang Chen, Jonathan Li, Qi Wang (possible past Tsinghua University affiliation), Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao (possible past Google (United States) affiliation), Qiang Liu
Abstract

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $φ$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using ...

📄 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15342v1
👥 Authors: Arsha Nagrani (possible past University Of Oxford affiliation), Jasper Uijilings, Shyamal Buch, Tobias Weyand (possible past Google (United States) affiliation), Sudheendra Vijayanarasimhan (possible past Google (United States) affiliation), Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid (possible past Google (United States) affiliation)
Abstract

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of ...

📄 FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15320v1
👥 Authors: Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li (possible past Tsinghua University affiliation), Gordon Guocheng Qian, Jian Wang (possible past Baidu (China) affiliation)
Abstract

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated ...

📄 Learning from Language Feedback via Variational Policy Distillation
🗓️ Published: 5/14/2026
🔗 http://arxiv.org/abs/2605.15113v1
👥 Authors: Yang Li (possible past Google (United States) affiliation), Erik Nijkamp, Semih Yavuz (possible past Google (United States) affiliation), Shafiq Rayhan Joty
Abstract

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting furth...

*Notable papers are those with at least two authors from a "big" AI/ML lab.