πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Seeing Fast and Slow: Learning the Flow of Time in Videos
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21931v1
πŸ‘₯ Authors: Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu (possible past Google (United States) affiliation), Ali Farhadi (possible past University Of Washington affiliation), Matthew Wallingford, Yu-Chiang Frank Wang, Steve Marschner, Wei-Chiu Ma
Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to l...

πŸ“„ TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21889v1
πŸ‘₯ Authors: Jun Wang (possible past Tencent (China) affiliation), Ziyin Zhang, Rui Wang (possible past Tencent (China) affiliation), Hang Yu, Peng Di, Rui Wang (possible past Tencent (China) affiliation)
Abstract

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end...

πŸ“„ Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21836v1
πŸ‘₯ Authors: Eghbal A. Hosseini, Brian Cheung (possible past University Of California, Berkeley affiliation), Evelina Fedorenko, Alex H. Williams (possible past Stanford University affiliation)
Abstract

Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differ...

πŸ“„ BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21508v1
πŸ‘₯ Authors: Jiaxian Yan, Jintao Zhu, Yuhang Yang, Qi Liu (possible past Tencent (China) affiliation), Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction fram...

πŸ“„ MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21489v1
πŸ‘₯ Authors: Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu (possible past Tsinghua University affiliation), Wenhao Yu, Jianqiang Wang (possible past Tsinghua University affiliation)
Abstract

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, ...

πŸ“„ VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21450v1
πŸ‘₯ Authors: Yixuan Zhu, Shilin Ma, Haolin Wang, Ao Li, Yanzhe Jing, Yansong Tang, Lei Chen, Jiwen Lu (possible past Tsinghua University affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative pred...

πŸ“„ VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21375v1
πŸ‘₯ Authors: Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong (possible past Salesforce (United States) affiliation), Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie (possible past Google (United States) affiliation)
Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish...

πŸ“„ The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21312v1
πŸ‘₯ Authors: Kai Liu (possible past Baidu (China) affiliation), Haoyang Yue, Zeli Lin, Zheng Chen, Jingkai Wang, Jue Gong, Jiatong Li, Xianglong Yan, Libo Zhu, Jianze Li, Ziqing Zhang, Zihan Zhou, Xiaoyang Liu, Radu Timofte (possible past Eth Zurich affiliation), Yulun Zhang, Junye Chen, Zhenming Yan, Yucong Hong, Ruize Han, Song Wang, Li Pang, Heng Zhao, Xinqiao Wu, Deyu Meng, Xiangyong Cao, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Zengyuan Zuo, Junjun Jiang, Saiprasad Meesiyawar, Sulocha Yatageri, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Cici Liu, Tongyao Mu, Qiong Cao, Yifan Wang (possible past Stanford University affiliation), Kosuke Shigematsu, Hiroto Shirono, Asuka Shin, Wei Zhou, Linfeng Li, Lingdong Kong, Ce Wang, Xingwei Zhong, Wanjie Sun, Dafeng Zhang, Hongxin Lan, Qisheng Xu, Mingyue He, Hui Geng, Tianjiao Wan, Kele Xu, Changjian Wang, Antoine Carreaud, Nicola Santacroce, Shanci Li, Jan Skaloud, Adrien Gressin
Abstract

This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infra...

πŸ“„ Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21268v1
πŸ‘₯ Authors: Wenkai Wang, Xiyun Li, Hongcan Guo, Wenhao Yu, Tianqing Fang, Haitao Mi, Dong Yu (possible past Tencent (China) affiliation), Shengyu Zhang (possible past Tencent (China) affiliation)
Abstract

Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. ...

πŸ“„ On Reasoning Behind Next Occupation Recommendation
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21204v1
πŸ‘₯ Authors: Shan Dong, Palakorn Achananuparp, Hieu Hien Mai, Lei Wang (possible past Baidu (China) affiliation), Yao Lu (possible past Google (United States) affiliation), Ee-Peng Lim
Abstract

In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned w...

πŸ“„ Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.21138v1
πŸ‘₯ Authors: Jiabao Ji, Yongchao Chen, Yang Zhang (possible past Tsinghua University affiliation), Ramana Rao Kompella, Chuchu Fan, Gaowen Liu, Shiyu Chang (possible past Tencent (China) affiliation)
Abstract

Multi-robot control in cluttered environments is a challenging problem that involves complex physical constraints, including robot-robot collisions, robot-obstacle collisions, and unreachable motions. Successful planning in such settings requires joint optimization over high-level task planning and low-level motion planning, as violations of physical constraints may arise from failures at either level. However, jointly optimizing task and motion planning is difficult due to the complex parameter...

πŸ“„ Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.21017v1
πŸ‘₯ Authors: Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin CalmΓ©, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. Delong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn (possible past University Of California, Berkeley affiliation), Philipp FΓΌrnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg (possible past University Of California, Berkeley affiliation), Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, TamΓ‘s Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong, Kim, PrzemysΕ‚aw Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li (possible past Huawei Technologies (China) affiliation), Zhen Li (possible past Google (United States) affiliation), Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-Hui Liu, Wanli Liuchen, Eszter LukΓ‘cs, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, MichaΕ‚ NaskrΔ™t, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura (possible past Stanford University affiliation), Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez Y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy Xiaoyang Shi, Zoe SoulΓ©, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, KristΓ³f TakΓ‘cs, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong (possible past Google (United States) affiliation), Martin Wagner, Farong Wang, Wei Wang (possible past University Of Oxford affiliation), Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz WΓ³jcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang (possible past Tencent (China) affiliation), Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang (possible past Tsinghua University affiliation), Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang (possible past Google (United States) affiliation), Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou (possible past Tencent (China) affiliation), Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger
Abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with s...

πŸ“„ Differentially Private Model Merging
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20985v1
πŸ‘₯ Authors: Qichuan Yin, Manzil Zaheer (possible past Google (United States) affiliation), Tian Li (possible past Carnegie Mellon University affiliation)
Abstract

In machine learning applications, privacy requirements during inference or deployment time could change constantly due to varying policies, regulations, or user experience. In this work, we aim to generate a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training steps, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post processing techniques, namely random selection and linea...

πŸ“„ SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20842v1
πŸ‘₯ Authors: Ruohan Liu, Shukang Yin, Tao Wang (possible past Stanford University affiliation), Dong Zhang (possible past Nvidia (United States) affiliation), Weiji Zhuang, Shuhuai Ren, Ran He, Caifeng Shan, Chaoyou Fu
Abstract

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech ...

πŸ“„ Convergent Evolution: How Different Language Models Learn Similar Number Representations
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20817v1
πŸ‘₯ Authors: Deqing Fu, Tianyi Zhou (possible past University Of Washington affiliation), Mikhail Belkin, Vatsal Sharan, Robin Jia (possible past Stanford University affiliation)
Abstract

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, w...

πŸ“„ OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20806v1
πŸ‘₯ Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Qiming Yu, Yi Yang (possible past Baidu (China) affiliation), Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che
Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematic...

πŸ“„ Decoupled Travel Planning with Behavior Forest
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21354v1
πŸ‘₯ Authors: Duanyang Yuan, Sihang Zhou (possible past National University Of Defense Technology affiliation), Yanning Hou, Xiaoshu Chen, Haoyuan Chen, Ke Liang, Jiyuan Liu, Chuan Ma, Xinwang Liu (possible past National University Of Defense Technology affiliation), Jian Huang
Abstract

Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between loca...

πŸ“„ Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
πŸ—“οΈ Published: 4/23/2026
πŸ”— http://arxiv.org/abs/2604.21335v1
πŸ‘₯ Authors: Wei Jiang (possible past Apple (United States) affiliation), Wei Wang (possible past University Of Oxford affiliation)
Abstract

Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or...

πŸ“„ Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20472v1
πŸ‘₯ Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano (possible past Technion – Israel Institute Of Technology affiliation), Aviv Tamar (possible past University Of California, Berkeley affiliation)
Abstract

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequ...

πŸ“„ VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.20444v1
πŸ‘₯ Authors: Qianxi Hua, Xinyue Li, Zheng Yan, Yang Li (possible past Google (United States) affiliation), Chi Zhang (possible past Peking University affiliation), Yongyao Li, Yufei Liu
Abstract

Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic lear...

πŸ“„ Building a Precise Video Language with Human-AI Oversight
πŸ—“οΈ Published: 4/22/2026
πŸ”— http://arxiv.org/abs/2604.21718v1
πŸ‘₯ Authors: Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Deva Ramanan (possible past Carnegie Mellon University affiliation)
Abstract

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality caption...

*Notable papers are those with at least two authors from a "big" AI/ML lab.