Recent Notable AI/ML arXiv Papers

📄 Particulate: Feed-Forward 3D Object Articulation

🗓️ Published: 12/12/2025
🔗 http://arxiv.org/abs/2512.11798v1
👥 Authors: Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht (possible past University Of Oxford affiliation), Joan Lasenby (possible past University Of Cambridge affiliation), Shangzhe Wu, Andrea Vedaldi (possible past University Of Oxford affiliation)

Abstract

We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the n...

📄 NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics

🗓️ Published: 12/12/2025
🔗 http://arxiv.org/abs/2512.11525v1
👥 Authors: Hao Wu (possible past Tencent (China) affiliation), Yuan Gao (possible past Tencent (China) affiliation), Fan Xu, Fan Zhang, Guangliang Liu, Yuxuan Liang, Xiaomeng Huang

Abstract

High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms ke...

📄 VFMF: World Modeling by Forecasting Vision Foundation Model Features

🗓️ Published: 12/12/2025
🔗 http://arxiv.org/abs/2512.11225v1
👥 Authors: Gabrijel Boduljak, Yushi Lan, Christian Rupprecht (possible past University Of Oxford affiliation), Andrea Vedaldi (possible past University Of Oxford affiliation)

Abstract

Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representati...

📄 Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10949v1
👥 Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu (possible past University Of Oxford affiliation), Wentao Zhang (possible past Mila - Quebec Artificial Intelligence Institute affiliation), Bin Zhao

Abstract

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the fir...

📄 Mull-Tokens: Modality-Agnostic Latent Thinking

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10941v1
👥 Authors: Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko (possible past Ibm (United States) affiliation), Ranjay Krishna (possible past University Of Washington affiliation), Leonidas Guibas (possible past Stanford University affiliation), Wen-Sheng Chu

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold...

📄 Any4D: Unified Feed-Forward Metric 4D Reconstruction

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10935v1
👥 Authors: Jay Karhade, Nikhil Keetha, Yuchen Zhang (possible past University Of California, Berkeley affiliation), Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan (possible past Carnegie Mellon University affiliation)

Abstract

We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler...

📄 BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10932v1
👥 Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko (possible past Ibm (United States) affiliation), Venkatesh Saligrama, Boqing Gong (possible past Tencent (China) affiliation)

Abstract

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, inf...

📄 The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10791v1
👥 Authors: Aileen Cheng, Alon Jacovi, Amir Globerson (possible past Google (United States) affiliation), Ben Golan, Charles Kwong, Chris Alberti (possible past Google (United States) affiliation), Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz (possible past Google (United States) affiliation), Aijun Bai, Andrew Wang (possible past University Of California, Berkeley affiliation), Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry, Daniel Balle, Deepanway Ghosal, Doron Kukliansky, Dror Marcus, Elena Gribovskaya, Eran Ofek, Honglei Zhuang, Itay Laish, Jan Ackermann, Lily Wang, Meg Risdal, Megan Barnes, Michael Fink, Mohamed Amin (possible past Google (United States) affiliation), Moran Ambar, Natan Potikha, Nikita Gupta, Nitzan Katz, Noam Velan, Ofir Roval, Ori Ram, Polina Zablotskaia, Prathamesh Bang, Priyanka Agrawal, Rakesh Ghiya, Sanjay Ganapathy, Simon Baumgartner, Sofia Erell, Sushant Prakash (possible past Google (United States) affiliation), Thibault Sellam (possible past Google (United States) affiliation), Vikram Rao, Xuanhui Wang (possible past Meta (United States) affiliation), Yaroslav Akulov, Yulong Yang, Zhen Yang (possible past Tsinghua University affiliation), Zhixin Lai, Zhongru Wu, Anca Dragan, Avinatan Hassidim (possible past Google (United States) affiliation), Fernando Pereira (possible past Google (United States) affiliation), Slav Petrov (possible past Google (United States) affiliation), Srinivasan Venkatachary, Tulsee Doshi (possible past Google (United States) affiliation), Yossi Matias (possible past Google (United States) affiliation), Sasha Goldshtein, Dipanjan Das (possible past University Of Washington affiliation)

Abstract

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge...

📄 COMPARE: Clinical Optimization with Modular Planning and Assessment via RAG-Enhanced AI-OCT: Superior Decision Support for Percutaneous Coronary Intervention Compared to ChatGPT-5 and Junior Operators

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10702v1
👥 Authors: Wei Fang, Chiyao Wang, Wenshuai Ma, Hui Liu, Jianqiang Hu, Xiaona Niu, Yi Chu, Mingming Zhang, Jingxiao Yang, Dongwei Zhang, Zelin Li, Pengyun Liu, Jiawei Zheng, Pengke Zhang, Chaoshi Qin, Wangang Guo, Bin Wang, Yugang Xue, Wei Zhang (possible past Tsinghua University affiliation), Zikuan Wang, Rui Zhu, Yihui Cao, Quanmao Lu, Rui Meng, Yan Li (possible past Tencent (China) affiliation)

Abstract

Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the general-purpose ChatGPT-5 and junior physicians for OCT-guided PCI planning and assessment. Method...

📄 Rethinking Popularity Bias in Collaborative Filtering via Analytical Vector Decomposition

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10688v1
👥 Authors: Lingfeng Liu, Yixin Song, Dazhong Shen, Bing Yin, Hao Li (possible past Tsinghua University affiliation), Yanyong Zhang, Chao Wang (possible past Google (United States) affiliation)

Abstract

Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR system...

📄 Evaluating Gemini Robotics Policies in a Veo World Simulator

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10675v1
👥 Authors: Gemini Robotics Team, Coline Devin, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Debidatta Dwibedi (possible past Google (United States) affiliation), Ruiqi Gao (possible past Google (United States) affiliation), Abhishek Jindal, Thomas Kipf (possible past Google (United States) affiliation), Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada (possible past Google (United States) affiliation), Yulia Rubanova, Dhruv Shah, Vikas Sindhwani (possible past Google (United States) affiliation), Jie Tan (possible past Google (United States) affiliation), Fei Xia (possible past Stanford University affiliation), Ted Xiao (possible past Google (United States) affiliation), Sherry Yang, Wenhao Yu, Allan Zhou

Abstract

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate tha...

📄 Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10398v2
👥 Authors: Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Yining Yang, Ben Maurer, Wenlin Chen (possible past Meta (United States) affiliation), David Recordon, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Minlan Yu, Ying Zhang (possible past Tencent (China) affiliation)

Abstract

Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confuci...

📄 Translating Informal Proofs into Formal Proofs Using a Chain of States

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10317v2
👥 Authors: Ziyu Wang (possible past University Of Oxford affiliation), Bowen Yang, Chenyi Li, Yuan Zhang (possible past Google (United States) affiliation), Shihao Zhou, Bin Dong, Zaiwen Wen

Abstract

We address the problem of translating informal mathematical proofs expressed in natural language into formal proofs in Lean4 under a constrained computational budget. Our approach is grounded in two key insights. First, informal proofs tend to proceed via a sequence of logical transitions - often implications or equivalences - without explicitly specifying intermediate results or auxiliary lemmas. In contrast, formal systems like Lean require an explicit representation of each proof state and th...

📄 MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10284v1
👥 Authors: Yixin Wan, Lei Ke (possible past Tencent (China) affiliation), Wenhao Yu, Kai-Wei Chang, Dong Yu (possible past Tencent (China) affiliation)

Abstract

We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically ...

📄 The 2025 Foundation Model Transparency Index

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10169v1
👥 Authors: Alexander Wan, Kevin Klyman, Sayash Kapoor, Nestor Maslej, Shayne Longpre (possible past Apple (United States) affiliation), Betty Xiong, Percy Liang (possible past Stanford University affiliation), Rishi Bommasani

Abstract

Foundation model developers are among the world's most important companies. As these companies become increasingly consequential, how do their transparency practices evolve? The 2025 Foundation Model Transparency Index is the third edition of an annual effort to characterize and quantify the transparency of foundation model developers. The 2025 FMTI introduces new indicators related to data acquisition, usage data, and monitoring and evaluates companies like Alibaba, DeepSeek, and xAI for the fi...

📄 What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models

🗓️ Published: 12/10/2025
🔗 http://arxiv.org/abs/2512.10080v1
👥 Authors: Luciano Floridi (possible past Google (United States) affiliation), Jessica Morley (possible past University Of Oxford affiliation), Claudio Novelli, David Watson

Abstract

This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show...

📄 Attacking and Securing Community Detection: A Game-Theoretic Framework

🗓️ Published: 12/12/2025
🔗 http://arxiv.org/abs/2512.11359v1
👥 Authors: Yifan Niu, Aochuan Chen, Tingyang Xu (possible past Tencent (China) affiliation), Jia Li (possible past Google (United States) affiliation)

Abstract

It has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations, can cause deep graph models to fail on classification tasks. In this work, we extend the concept of adversarial graphs to the community detection problem, which is more challenging. We propose novel attack and defense techniques for community detection problem, with the objective of hiding targeted individuals from detection models and enhancing the robustness of community detection models, respectiv...

📄 OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10756v1
👥 Authors: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li (possible past Peking University affiliation), Conghui He (possible past Tsinghua University affiliation), Dahua Lin, Kai Chen (possible past Shanghai Jiao Tong University affiliation)

Abstract

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the ...

📄 Sharp Monocular View Synthesis in Less Than a Second

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10685v1
👥 Authors: Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen (possible past Apple (United States) affiliation), Amaël Delaunoy, Tian Fang (possible past Apple (United States) affiliation), Yanghai Tsin (possible past Apple (United States) affiliation), Stephan R. Richter, Vladlen Koltun

Abstract

We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute s...

📄 Authority Backdoor: A Certifiable Backdoor Mechanism for Authoring DNNs

🗓️ Published: 12/11/2025
🔗 http://arxiv.org/abs/2512.10600v1
👥 Authors: Han Yang (possible past Eth Zurich affiliation), Shaofeng Li, Tian Dong, Xiangyu Xu (possible past Tsinghua University affiliation), Guangchi Liu, Zhen Ling

Abstract

Deep Neural Networks (DNNs), as valuable intellectual property, face unauthorized use. Existing protections, such as digital watermarking, are largely passive; they provide only post-hoc ownership verification and cannot actively prevent the illicit use of a stolen model. This work proposes a proactive protection scheme, dubbed ``Authority Backdoor," which embeds access constraints directly into the model. In particular, the scheme utilizes a backdoor learning framework to intrinsically lock a m...

📄 Notable* Recent AI/ML arXiv Papers

📄 Notable^* Recent AI/ML arXiv Papers