📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 InCoder-32B-Thinking: Industrial Code World Model for Thinking
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.03144v1
👥 Authors: Jian Yang, Wei Zhang (possible past Tsinghua University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Junhang Cheng, Tuney Zheng, Fanglin Xu, Weicheng Gu, Lin Jing, Yaxin Du, Joseph Li, Yizhi Li, Yan Xing, Chuan Hao, Ran Tao, Ruihao Gong, Aishan Liu, Zhoujun Li, Mingjie Tang, Chenghua Lin, Siheng Chen, Wayne Xin Zhao (possible past Baidu (China) affiliation), Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv
Abstract

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-t...

📄 Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.03081v1
👥 Authors: Yubin Qu, Yi Liu (possible past Google (United States) affiliation), Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang (possible past Tencent (China) affiliation), Lei Ma
Abstract

LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent's action space, such as file writes, shell commands, and network requests, despite existing safegu...

📄 Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.03070v1
👥 Authors: Zhihao Chen, Ying Zhang (possible past Tencent (China) affiliation), Yi Liu (possible past Google (United States) affiliation), Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, Zhiqiang Li
Abstract

Third-party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large-scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,708 issues and derive a taxonomy of 10 leakage patterns (4 accidental and 6 adversarial). We find th...

📄 Verbalizing LLMs' assumptions to explain and control sycophancy
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.03058v1
👥 Authors: Myra Cheng (possible past Deepmind (United Kingdom) affiliation), Isabel Sieh, Humishka Zope, Sunny Yu, Lujain Ibrahim, Aryaman Arora, Jared Moore, Desmond Ong, Dan Jurafsky (possible past Stanford University affiliation), Diyi Yang (possible past Stanford University affiliation)
Abstract

LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the ...

📄 JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.03044v1
👥 Authors: Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang (possible past Tencent (China) affiliation), Bohua Cai, Chang Li, Changjian Jiang, Changkai Lu, Chao Xue, Chaocai Liang, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang (possible past Tsinghua University affiliation), Ji Miao, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang (possible past Tsinghua University affiliation), Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qi Wang (possible past Tsinghua University affiliation), Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shi Suo, Shijie Ren, Shijin Zhang, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu (possible past Tsinghua University affiliation), Wentao Tan, Xianghan Meng, Xiaodong He (possible past Microsoft (United States) affiliation), Xing Pan, Xiran Wang, Xuyang Peng, Ya Zhang, Yang Liu (possible past Tsinghua University affiliation), Yangyang Duan, Yanxu Chen, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng
Abstract

We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve to...

📄 InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02971v1
👥 Authors: Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang (possible past Tencent (China) affiliation), Jun Wang (possible past Tencent (China) affiliation)
Abstract

Recent agentic search systems have made substantial progress by emphasising deep, multi-step reasoning. However, this focus often overlooks the challenges of wide-scale information synthesis, where agents must aggregate large volumes of heterogeneous evidence across many sources. As a result, most existing large language model agent systems face severe limitations in data-intensive settings, including context saturation, cascading error propagation, and high end-to-end latency. To address these ...

📄 Analysis of Optimality of Large Language Models on Planning Problems
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02910v1
👥 Authors: Bernd Bohnet (possible past Google (United States) affiliation), Michael C. Mozer, Kevin Swersky (possible past Google (United States) affiliation), Wil Cunningham, Aaron Parisi, Kathleen Kenealy, Noah Fiedel (possible past Google (United States) affiliation)
Abstract

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equi...

📄 ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02834v1
👥 Authors: Chao Li (possible past Baidu (China) affiliation), Cailiang Liu, Ang Gao, Kexin Deng, Shu Zhang (possible past Google (United States) affiliation), Langping Xu, Xiaotong Shi, Xionghao Ding, Jian Pei, Xun Jiang
Abstract

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a healt...

📄 QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02816v1
👥 Authors: Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li (possible past Google (United States) affiliation), Yongtao Wang (possible past Peking University affiliation)
Abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outli...

📄 ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02811v1
👥 Authors: Lik Tung Fu, Jie Zhou (possible past Tsinghua University affiliation), Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang (possible past Tsinghua University affiliation), Jun Yang (possible past Tsinghua University affiliation)
Abstract

Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation-based debugging. However, manual SVA authoring is labor-intensive and error-prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scarcity of domain-specific data. To address these challenges, we introduce ChatSVA, an end-to-end SVA g...

📄 Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02795v1
👥 Authors: Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma (possible past Shanghai Jiao Tong University affiliation), Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu (possible past Tencent (China) affiliation)
Abstract

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT...

📄 LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02787v1
👥 Authors: Shreshth Saini, Hakan Gedik, Neil Birkbeck (possible past Google (United States) affiliation), Yilin Wang (possible past Google (United States) affiliation), Balu Adsumilli (possible past Google (United States) affiliation), Alan C. Bovik
Abstract

The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaF...

📄 Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02709v1
👥 Authors: Yihong Dong, Xiaoha Jian, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li (possible past Google (United States) affiliation), Zhi Jin (possible past Peking University affiliation), Ge Li (possible past Peking University affiliation)
Abstract

The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systemat...

📄 DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02694v1
👥 Authors: Fanwei Zeng, Changtao Miao, Jing Huang (possible past Meta (United States) affiliation), Zhiya Tan, Shutao Gong, Xiaoming Yu (possible past Peking University affiliation), Yang Wang (possible past Baidu (China) affiliation), Weibin Yao, Joey Tianyi Zhou (possible past Tencent (China) affiliation), Jianshu Li (possible past National University Of Singapore affiliation), Yin Yan
Abstract

The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery...

📄 VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02467v1
👥 Authors: Mengtian Li, Yuwei Lu, Feifei Li (possible past Google (United States) affiliation), Chenqi Gan, Zhifeng Xie, Xi Wang (possible past Tsinghua University affiliation)
Abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this pap...

📄 ActionParty: Multi-Subject Action Binding in Generative Video Games
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02330v1
👥 Authors: Alexander Pondaven, Ziyi Wu, Igor Gilitschenski (possible past Massachusetts Institute Of Technology affiliation), Philip Torr (possible past University Of Oxford affiliation), Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin
Abstract

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action control...

📄 Steerable Visual Representations
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02327v1
👥 Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan (possible past Carnegie Mellon University affiliation), Makarand Tapaswi (possible past University Of Toronto affiliation), Yuki M. Asano (possible past University Of Oxford affiliation)
Abstract

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their...

📄 Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02324v1
👥 Authors: Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang (possible past Baidu (China) affiliation), Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk (possible past Meta (United States) affiliation), Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak
Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a deg...

📄 Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02288v1
👥 Authors: Gengsheng Li, Tianyu Yang (possible past Tencent (China) affiliation), Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang (possible past Google (United States) affiliation), Jinqiao Wang, Tat-Seng Chua
Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement,...

📄 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02029v1
👥 Authors: Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang (possible past Tsinghua University affiliation), Chengming Xu, Yue Ma, Xiaobin Hu (possible past Tencent (China) affiliation), Zhe Cao (possible past University Of California, Berkeley affiliation), Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang (possible past Tencent (China) affiliation), Wenqi Ren (possible past Tencent (China) affiliation), Yanwei Fu, Yong Liu, Yu Wang (possible past Tsinghua University affiliation), Xiangyu Yue (possible past University Of California, Berkeley affiliation), Yu-Gang Jiang, Shuicheng Yan (possible past National University Of Singapore affiliation)
Abstract

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential ineff...

📄 ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02022v1
👥 Authors: Yu Li (possible past Tencent (China) affiliation), Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang (possible past Tsinghua University affiliation), Jing Shao, Xia Hu, Dongrui Liu
Abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes age...

📄 Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01989v2
👥 Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou (possible past Tsinghua University affiliation), Jiwen Lu (possible past Tsinghua University affiliation)
Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require i...

📄 World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01985v1
👥 Authors: Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang (possible past Google (United States) affiliation), Kevin Murphy (possible past Google (United States) affiliation), Chelsea Finn (possible past University Of California, Berkeley affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation)
Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify...

📄 State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference
🗓️ Published: 4/3/2026
🔗 http://arxiv.org/abs/2604.02738v1
👥 Authors: Peng Sun (possible past Tencent (China) affiliation), Ruoyu Wang (possible past University Of Edinburgh affiliation), Xue Luo
Abstract

This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unknown noise covariances coexist. To tackle this challenge, we formulate the joint estimation of system states, noise parameters, and network reliability as a Bayesian variational inference problem, and propose a novel variational Bayesian adaptive Kalman filter (VB-AKF) to approximate the joint posterior probability densities of the latent parameter...

📄 LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.02097v1
👥 Authors: Jiachun Jin, Zetong Zhou, Xiao Yang (possible past Tencent (China) affiliation), Hao Zhang (possible past Tencent (China) affiliation), Pengfei Liu, Jun Zhu (possible past Tsinghua University affiliation), Zhijie Deng
Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necess...

📄 CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
🗓️ Published: 4/2/2026
🔗 http://arxiv.org/abs/2604.01634v1
👥 Authors: Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani (possible past University Of Oxford affiliation), Paul Hongsuck Seo (possible past Google (United States) affiliation)
Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VL...

*Notable papers are those with at least two authors from a "big" AI/ML lab.