📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22748v1
👥 Authors: Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin (possible past National University Of Singapore affiliation), Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang (possible past Google (United States) affiliation), Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen (possible past Tencent (China) affiliation), Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi (possible past University Of Oxford affiliation), Yang Deng, Yanwei Li, Mike Zheng Shou (possible past National University Of Singapore affiliation), Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr (possible past University Of Oxford affiliation), Jiaya Jia (possible past Tencent (China) affiliation)
Abstract

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predi...

📄 Learning Evidence Highlighting for Frozen LLMs
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22565v1
👥 Authors: Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu (possible past Meta (United States) affiliation), Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li (possible past Tencent (China) affiliation)
Abstract

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream ...

📄 CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22498v1
👥 Authors: Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang (possible past Google (United States) affiliation), Xintian Shen, Jiawei Chen (possible past Tencent (China) affiliation), Hao Ma (possible past Meta (United States) affiliation), Tao Wei (possible past Baidu (China) affiliation)
Abstract

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image un...

📄 From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22446v1
👥 Authors: Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang (possible past Tencent (China) affiliation), Weilin Luo, Jun Wang (possible past Tencent (China) affiliation)
Abstract

Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)},...

📄 LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22363v1
👥 Authors: Zeyi Li, Yushi Yang (possible past Stanford University affiliation), Shawn Xie, Kyle Xu, Tianxing Chen, Yuran Wang, Zhenhao Shen, Yan Shen, Yue Chen (possible past Google (United States) affiliation), Wenjun Li, Yukun Zheng, Chaorui Zhang, Siyi Lin, Fei Teng, Hongjun Yang, Ming Chen, Steve Xie, Ruihai Wu
Abstract

Household environments present one of the most common, impactful yet challenging application domains for robotics. Within household scenarios, manipulating deformable objects is particularly difficult, both in simulation and real-world execution, due to varied categories and shapes, complex dynamics, and diverse material properties, as well as the lack of reliable deformable-object support in existing simulations. We introduce LeHome, a comprehensive simulation environment designed for deformabl...

📄 GenMatter: Perceiving Physical Objects with Generative Matter Models
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22160v1
👥 Authors: Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka, Thomas O'connell, William T. Freeman (possible past Massachusetts Institute Of Technology affiliation), Joshua B. Tenenbaum (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative m...

📄 Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.22061v1
👥 Authors: Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones (possible past Google (United States) affiliation), Terence T. Sio, Wei Liu (possible past Tsinghua University affiliation), Maria Vassilaki, Nansu Zong
Abstract

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight ...

📄 Shared Lexical Task Representations Explain Behavioral Variability In LLMs
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.22027v1
👥 Authors: Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau (possible past Google (United States) affiliation), Michael L. Littman, Stephen H. Bach, Ellie Pavlick (possible past Google (United States) affiliation)
Abstract

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstrat...

📄 Rethinking Publication: A Certification Framework for AI-Enabled Research
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.22026v1
👥 Authors: Yang Lu (possible past Meta (United States) affiliation), Rabimba Karanjai, Lei Xu (possible past Tsinghua University affiliation), Weidong Shi
Abstract

AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems...

📄 Seeing Fast and Slow: Learning the Flow of Time in Videos
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21931v1
👥 Authors: Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu (possible past Google (United States) affiliation), Ali Farhadi (possible past University Of Washington affiliation), Matthew Wallingford, Yu-Chiang Frank Wang, Steve Marschner, Wei-Chiu Ma
Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to l...

📄 TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21889v1
👥 Authors: Jun Wang (possible past Tencent (China) affiliation), Ziyin Zhang, Rui Wang (possible past Tencent (China) affiliation), Hang Yu, Peng Di, Rui Wang (possible past Tencent (China) affiliation)
Abstract

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end...

📄 Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21836v1
👥 Authors: Eghbal A. Hosseini, Brian Cheung (possible past University Of California, Berkeley affiliation), Evelina Fedorenko, Alex H. Williams (possible past Stanford University affiliation)
Abstract

Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differ...

📄 BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21508v1
👥 Authors: Jiaxian Yan, Jintao Zhu, Yuhang Yang, Qi Liu (possible past Tencent (China) affiliation), Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction fram...

📄 MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21489v1
👥 Authors: Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu (possible past Tsinghua University affiliation), Wenhao Yu, Jianqiang Wang (possible past Tsinghua University affiliation)
Abstract

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, ...

📄 VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21450v1
👥 Authors: Yixuan Zhu, Shilin Ma, Haolin Wang, Ao Li, Yanzhe Jing, Yansong Tang, Lei Chen, Jiwen Lu (possible past Tsinghua University affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative pred...

📄 VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21375v2
👥 Authors: Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong (possible past Salesforce (United States) affiliation), Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie (possible past Google (United States) affiliation)
Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish...

📄 The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21312v1
👥 Authors: Kai Liu (possible past Baidu (China) affiliation), Haoyang Yue, Zeli Lin, Zheng Chen, Jingkai Wang, Jue Gong, Jiatong Li, Xianglong Yan, Libo Zhu, Jianze Li, Ziqing Zhang, Zihan Zhou, Xiaoyang Liu, Radu Timofte (possible past Eth Zurich affiliation), Yulun Zhang, Junye Chen, Zhenming Yan, Yucong Hong, Ruize Han, Song Wang, Li Pang, Heng Zhao, Xinqiao Wu, Deyu Meng, Xiangyong Cao, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Zengyuan Zuo, Junjun Jiang, Saiprasad Meesiyawar, Sulocha Yatageri, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Cici Liu, Tongyao Mu, Qiong Cao, Yifan Wang (possible past Stanford University affiliation), Kosuke Shigematsu, Hiroto Shirono, Asuka Shin, Wei Zhou, Linfeng Li, Lingdong Kong, Ce Wang, Xingwei Zhong, Wanjie Sun, Dafeng Zhang, Hongxin Lan, Qisheng Xu, Mingyue He, Hui Geng, Tianjiao Wan, Kele Xu, Changjian Wang, Antoine Carreaud, Nicola Santacroce, Shanci Li, Jan Skaloud, Adrien Gressin
Abstract

This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infra...

📄 Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21268v1
👥 Authors: Wenkai Wang, Xiyun Li, Hongcan Guo, Wenhao Yu, Tianqing Fang, Haitao Mi, Dong Yu (possible past Tencent (China) affiliation), Shengyu Zhang (possible past Tencent (China) affiliation)
Abstract

Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. ...

📄 On Reasoning Behind Next Occupation Recommendation
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21204v1
👥 Authors: Shan Dong, Palakorn Achananuparp, Hieu Hien Mai, Lei Wang (possible past Baidu (China) affiliation), Yao Lu (possible past Google (United States) affiliation), Ee-Peng Lim
Abstract

In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned w...

📄 Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22753v1
👥 Authors: Sijie Li, Shanda Li, Haowei Lin, Weiwei Sun, Ameet Talwalkar (possible past University Of California, Berkeley affiliation), Yiming Yang (possible past Microsoft (United States) affiliation)
Abstract

Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions. In modern large-scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget-allocation problem rather than a routine preprocessing step. We formulate scaling-law fitting as budget-aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extr...

📄 A Brain-Inspired Deep Separation Network for Single Channel Raman Spectra Unmixing
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22324v1
👥 Authors: Gaoruishu Long, Jinchao Liu, Bo Liu (possible past Meta (United States) affiliation), Jie Liu (possible past Tencent (China) affiliation), Xiaolin Hu (possible past Tsinghua University affiliation)
Abstract

Raman spectra obtained in real world applications are often a noisy combination of several spectra of various substances in a tested sample. Unmixing such spectra into individual components corresponding to each of the substances is of great value and has been a longstanding challenge in Raman spectroscopy. Existing unmixing methods are predominantly designed to invert an overdetermined mixed model and therefore require multiple mixed spectra as input. However, open domain and/or non-cooperative...

📄 How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
🗓️ Published: 4/24/2026
🔗 http://arxiv.org/abs/2604.22271v1
👥 Authors: Dharshan Kumaran (possible past Google (United States) affiliation), Viorica Patraucean, Simon Osindero (possible past Google (United States) affiliation), Petar Velickovic, Nathaniel Daw
Abstract

Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with t...

📄 Decoupled Travel Planning with Behavior Forest
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21354v1
👥 Authors: Duanyang Yuan, Sihang Zhou (possible past National University Of Defense Technology affiliation), Yanning Hou, Xiaoshu Chen, Haoyuan Chen, Ke Liang, Jiyuan Liu, Chuan Ma, Xinwang Liu (possible past National University Of Defense Technology affiliation), Jian Huang
Abstract

Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between loca...

📄 Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
🗓️ Published: 4/23/2026
🔗 http://arxiv.org/abs/2604.21335v1
👥 Authors: Wei Jiang (possible past Apple (United States) affiliation), Wei Wang (possible past University Of Oxford affiliation)
Abstract

Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or...

*Notable papers are those with at least two authors from a "big" AI/ML lab.