📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28787v1
👥 Authors: Shiyu Chen, Tarfah Alrashed, Alon Halevy (possible past Google (United States) affiliation), Natasha Noy (possible past Stanford University affiliation)
Abstract

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data...

📄 Rethinking Memory as Continuously Evolving Connectivity
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28773v1
👥 Authors: Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu (possible past Tencent (China) affiliation), Ying Wei (possible past Tencent (China) affiliation), Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen (possible past Alibaba Group (China) affiliation), Ningyu Zhang (possible past Tencent (China) affiliation)
Abstract

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through th...

📄 CubePart: An Open-Vocabulary Part-Controllable 3D Generator
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28763v1
👥 Authors: Yiheng Zhu, Kangle Deng (possible past Carnegie Mellon University affiliation), Jean-Philippe Fauconnier, Inaki Navarro, Daiqing Li, Ava Pun, Yinan Zhang, Peiye Zhuang, Xiaoxia Sun, Maneesh Agrawala (possible past Stanford University affiliation), Kiran Bhat, Tinghui Zhou (possible past University Of California, Berkeley affiliation)
Abstract

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given...

📄 Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28713v1
👥 Authors: Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang (possible past Tencent (China) affiliation), Yueyang Zhang (possible past Baidu (China) affiliation), Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi
Abstract

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compre...

📄 GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28520v1
👥 Authors: Yang Zhang (possible past Tsinghua University affiliation), En Chun, Ziyun Mao, Yulu Wu, Jun Wang (possible past Tencent (China) affiliation)
Abstract

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that empl...

📄 Measuring Progress Toward AGI: A Cognitive Framework
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28405v1
👥 Authors: Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris (possible past Google (United States) affiliation), Allan Dafoe (possible past University Of Oxford affiliation), Alison M. Snyder, Noah D. Goodman (possible past Stanford University affiliation), Matthew Botvinick (possible past Google (United States) affiliation), Shane Legg (possible past Google (United States) affiliation)
Abstract

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstruc...

📄 FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28347v1
👥 Authors: Xucong Wang, Pengkun Wang, Zhe Zhao (possible past Tencent (China) affiliation), Liheng Yu, Shuang Wang, Yang Wang (possible past Baidu (China) affiliation)
Abstract

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle t...

📄 How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28302v1
👥 Authors: Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian (possible past Google (United States) affiliation), Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna (possible past Nvidia (United States) affiliation)
Abstract

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct re...

📄 Global Policy-Space Response Oracles for Two-Player Zero-Sum Games
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28273v1
👥 Authors: Junyu Zhang, Feihong Yang, Jian Wang (possible past Baidu (China) affiliation), Chao Wang (possible past Google (United States) affiliation), Xudong Zhang
Abstract

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead...

📄 GUI Agents for Continual Game Generation
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28258v1
👥 Authors: Yixu Huang, Bo Li (possible past Tencent (China) affiliation), Na Li (possible past Tencent (China) affiliation), Zhe Wang (possible past Deepmind (United Kingdom) affiliation), Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo
Abstract

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that p...

📄 Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28160v1
👥 Authors: Yang Zhang (possible past Tsinghua University affiliation), Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen (possible past Shanghai Jiao Tong University affiliation), Rongrong Ji (possible past Tencent (China) affiliation)
Abstract

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint opt...

📄 SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28122v1
👥 Authors: Yubin Qu, Yi Liu (possible past Google (United States) affiliation), Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang (possible past Tencent (China) affiliation), Leo Yu Zhang
Abstract

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fi...

📄 MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28116v1
👥 Authors: Ruoqi Guo, Yi Liu (possible past Google (United States) affiliation), Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang (possible past Tencent (China) affiliation), Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu
Abstract

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modi...

📄 Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28104v1
👥 Authors: Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu (possible past Baidu (China) affiliation), Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding d...

📄 MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28077v1
👥 Authors: Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang (possible past Tsinghua University affiliation), Jinrui Zhou, Xin Li (possible past Google (United States) affiliation), Mingjun Xiao, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates...

📄 ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28069v1
👥 Authors: Zhexin Hu, Li Wang (possible past Tesla (United States) affiliation), Xiaohan Wang (possible past Baidu (China) affiliation), Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin
Abstract

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewa...

📄 BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28067v1
👥 Authors: Fei Deng, Yanwu Xu (possible past Baidu (China) affiliation), Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei (possible past Google (United States) affiliation)
Abstract

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly eff...

📄 PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28032v1
👥 Authors: Xiang Wang (possible past Tencent (China) affiliation), Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou (possible past Tencent (China) affiliation), Peng Li (possible past Tsinghua University affiliation)
Abstract

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineer...

📄 VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28023v1
👥 Authors: Xingyu Lu, Jinpeng Wang (possible past Tencent (China) affiliation), Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang (possible past Tencent (China) affiliation), Tingting Gao, Han Li, Chun Yuan
Abstract

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To ...

📄 Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28214v1
👥 Authors: Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei (possible past Tencent (China) affiliation), Yifan Wu (possible past Carnegie Mellon University affiliation)
Abstract

Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework ...

📄 Skillful high-resolution weather forecasting independent of physical models
🗓️ Published: 5/27/2026
🔗 http://arxiv.org/abs/2605.28153v1
👥 Authors: Pengcheng Zhao, Siqi Xiang, Weixin Jin, Zekun Ni, Jiang Bian (possible past Baidu (China) affiliation), Zuliang Fang, Hongyu Sun, Bin Zhang, Richard E. Turner (possible past University Of Cambridge affiliation), Jonathan Weyn, Haiyu Dong, Kit Thambiratnam, Qi Zhang (possible past Tencent (China) affiliation)
Abstract

Accurate and timely weather forecasts are critical for high-impact decisions in modern society. Machine-learning-based weather prediction is emerging as an alternative for producing initial conditions, forecasts, and even both in end-to-end systems. These methods deliver predictions faster and often with higher skill than traditional numerical weather prediction (NWP). However, even end-to-end models typically rely on NWP-generated reanalyses for supervision, thereby inheriting the biases and re...

📄 High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27758v1
👥 Authors: Deepak Akhare, Mohammad Amin Nabian (possible past Nvidia (United States) affiliation), Corey Adams, Sudeep Chavare, Sanjay Choudhry (possible past Nvidia (United States) affiliation)
Abstract

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving tra...

*Notable papers are those with at least two authors from a "big" AI/ML lab.