📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00814v1
👥 Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu (possible past Nvidia (United States) affiliation), Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel...

📄 Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00731v1
👥 Authors: Ziyu Zheng, Yaming Yang, Zhe Wang (possible past Deepmind (United Kingdom) affiliation), Ziyu Guan, Wei Zhao (possible past Tencent (China) affiliation)
Abstract

While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi-domain heterogeneous graphs (MDHGs) remains a formidable challenge due to cross-type feature shifts and intra-domain relation gaps. Existing global feature alignment methods (PCA or SVD) enforce a shared feature space blindly, which distorts type-specific semantics and disrupts original topologies, inevitably leading to "Type Collapse" and "Relation Confusion". To address these fu...

📄 LLM-Oriented Information Retrieval: A Denoising-First Perspective
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00505v1
👥 Authors: Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang, Hao Liu (possible past Tencent (China) affiliation), Hui Xiong (possible past Baidu (China) affiliation)
Abstract

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence ...

📄 PAMod: Modeling Cyclical Shifts via Phase-Amplitude Modulation for Non-stationary Time Series Forecasting
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00466v1
👥 Authors: Yingbo Zhou, Yutong Ye, Shuhao Li, Rui Qian (possible past Shanghai Jiao Tong University affiliation), Qiang Huang, Lemao Liu (possible past Tencent (China) affiliation), Li Sun, Dejing Dou (possible past Baidu (China) affiliation)
Abstract

Real-world time series forecasting faces the fundamental challenge of non-stationary statistical properties, including shifts in mean and variance over time. While reversible instance normalization (RevIN) has shown promise by stationarizing inputs and denormalizing outputs, it relies on the strong assumption that historical and future distributions remain identical. We observe that in many practical applications, distribution shifts follow cyclical patterns that correlate with periodic position...

📄 Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00438v1
👥 Authors: Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, Yuan Wang, Long Chen (possible past Tencent (China) affiliation), Hangjun Ye, Xiaoshuai Hao, Wenbo Ding (possible past Tsinghua University affiliation)
Abstract

Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \t...

📄 AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00425v1
👥 Authors: Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S. -T. Yau, Wenyu Zhang (possible past Tencent (China) affiliation), Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong (possible past Baidu (China) affiliation), Jianmin Wu
Abstract

Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning ...

📄 DynamicPO: Dynamic Preference Optimization for Recommendation
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00327v1
👥 Authors: Xingyu Hu, Kai Zhang, Jiancan Wu, Shuli Wang, Chi Wang (possible past Microsoft (United States) affiliation), Wenshuai Chen, Yinhua Zhu, Haitao Wang, Xingxing Wang, Xiang Wang (possible past Tencent (China) affiliation)
Abstract

In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decre...

📄 Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00314v1
👥 Authors: Hongbo Wen, Ying Li (possible past Meta (United States) affiliation), Hanzhi Liu, Chaofan Shou, Yanju Chen, Yuan Tian, Yu Feng (possible past University Of California, Berkeley affiliation)
Abstract

An agent skill is a configuration package that equips an LLM-driven agent with a concrete capability, such as reading email, executing shell commands, or signing blockchain transactions. Each skill is a hybrid artifact-a structured half declares executable interfaces, while a prose half dictates when and how those interfaces fire-and the prose is reinterpreted probabilistically on every invocation. Conventional static analyzers parse the structured half but ignore the prose; LLM-based tools read...

📄 Caracal: Causal Architecture via Spectral Mixing
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2605.00292v1
👥 Authors: Bingzheng Gan, Tianyi Zhang, Yusu Li, Jing Huang (possible past Meta (United States) affiliation), Wei Shi, Yangkai Ding, Tao Yu (possible past University Of Washington affiliation)
Abstract

The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, $\mathcal{O}(L \log L)$ Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequ...

📄 When Do Diffusion Models learn to Generate Multiple Objects?
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2605.00273v1
👥 Authors: Yujin Jeong, Arnas Uselis, Iro Laina (possible past University Of Oxford affiliation), Seong Joon Oh, Anna Rohrbach (possible past University Of California, Berkeley affiliation)
Abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data dis...

📄 Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2605.00254v1
👥 Authors: Junsun Choi, Sam Son, Sunjin Choi, Hansung Kim, Yakun Sophia Shao (possible past University Of California, Berkeley affiliation), Scott Shenker (possible past University Of California, Berkeley affiliation), Sylvia Ratnasamy (possible past University Of California, Berkeley affiliation), Borivoje Nikolic
Abstract

Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (...

📄 Synthetic Computers at Scale for Long-Horizon Productivity Simulation
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.28181v1
👥 Authors: Tao Ge, Baolin Peng, Hao Cheng (possible past Tencent (China) affiliation), Jianfeng Gao (possible past Microsoft (United States) affiliation)
Abstract

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditi...

📄 Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.28158v2
👥 Authors: Yujun Wu, Dongxu Zhang, Xinchen Li, Jinhang Xu, Yiling Duan, Yumou Liu, Jiabao Pan, Qiyuan Zhu, Xuanhe Zhou, Jingxuan Wei, Siyuan Li (possible past Tencent (China) affiliation), Jintao Chen, Conghui He (possible past Tsinghua University affiliation), Cheng Tan
Abstract

Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of methodological evolution. In particular, it does not capture the structured relationships that explain how and why research methods emerge, adapt, and build upon one another. With the rise of AI-driven research agents as a new class of consumers of scientific knowledge, this limitation becomes increasingly consequential, as such agents cannot reliabl...

📄 Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.28139v2
👥 Authors: Chenxin Li, Zhengyang Tang, Mingxin Huang, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li (possible past Carnegie Mellon University affiliation), Benyou Wang (possible past Tencent (China) affiliation), Yixuan Yuan
Abstract

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-deman...

📄 Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.28138v1
👥 Authors: Tianyuan Wu, Chaokun Chang, Lunxi Cao, Wei Gao (possible past Peking University affiliation), Wei Wang (possible past University Of Oxford affiliation)
Abstract

Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-...

📄 From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.27969v1
👥 Authors: Guang Yang, Xing Hu (possible past Baidu (China) affiliation), Xiang Chen (possible past Tencent (China) affiliation), Xin Xi
Abstract

Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be viewed as a visual domain-specific language for hardware: it encodes timing, topology, and bit level semantics that are invisible to casual inspection yet safety critical once fabricated in silicon. Translating such diagrams into register-transfer-level(RTL) code therefore represents an extreme reliabil...

📄 CastFlow: Learning Role-Specialized Agentic Workflows for Time Series Forecasting
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.27840v1
👥 Authors: Bokai Pan, Mingyue Cheng, Zhiding Liu, Shuo Yu, Xiaoyu Tao, Yuchong Wu, Qi Liu (possible past Tencent (China) affiliation), Defu Lian, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Recently, large language models (LLMs) have shown great promise in time series forecasting. However, most existing LLM-based forecasting methods still follow a static generative paradigm that directly maps historical observations to future values in a single pass. Under this paradigm, forecasting is constrained by limited temporal pattern extraction, single-round acquisition of contextual features, one-shot forecast generation, and lack of support from ensemble forecasts. To address these limita...

📄 XekRung Technical Report
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2605.00072v1
👥 Authors: Jiutian Zeng, Junjie Li, Chengwei Dai, Jie Liang, Zhaoyu Hu, Yiliang Zhang, Ziang Weng, Longtao Huang, Dongjie Zhang, Libin Dong, Yang Ge, Yuanda Wang, Kaiwen Lv Kacuila, Bingyu Zhu, Jing Wang (possible past Google (United States) affiliation), Jin Xu (possible past Tencent (China) affiliation)
Abstract

We present XekRung, a frontier large language model for cybersecurity, designed to provide comprehensive security capabilities. To achieve this, we develop diverse data synthesis pipelines tailored to the cybersecurity domain, enabling the scalable construction of high-quality training data and providing a strong foundation for cybersecurity knowledge and understanding. Building on this foundation, we establish a complete training pipeline spanning continued pre-training (CPT), supervised fine-t...

📄 AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.27725v1
👥 Authors: Jiaju Chen, Jinghua Piao, Xia Xu, Songwei Li, Tong Xia, Xiangnan He (possible past National University Of Singapore affiliation), Yong Li (possible past Tsinghua University affiliation)
Abstract

A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end interactive system designed to translate abstract intuitions into executable computational experiments. Grounded in a domain-specific knowledge base covering over 13,000 high-quality academic papers, the system employs a modular multi-stage architecture. Specifically, the I...

📄 Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.27699v1
👥 Authors: Chunhui Zhang, Yuxuan Wang (possible past Google (United States) affiliation), Aoyang Qin, Yi-Long Lu, Kunlun Wu, Yizhou Wang (possible past Peking University affiliation), Wei Wang (possible past University Of Oxford affiliation)
Abstract

Current embodied agents are often limited to passive instruction-following or reactive need-satisfaction, lacking a stable, high-order value framework essential for long-term, self-directed behavior and resolving motivational conflicts. We introduce \textit{ValuePlanner}, a hierarchical cognitive architecture that decouples high-level value scheduling from low-level action execution. \textit{ValuePlanner} employs an LLM-based cognitive module to generate symbolic subgoals by reasoning through ab...

📄 The Power of Order: Fooling LLMs with Adversarial Table Permutations
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00445v1
👥 Authors: Xinshuai Dong, Haifeng Chen, Xuyuan Liu, Shengyu Chen, Haoyu Wang (possible past Tencent (China) affiliation), Shaoan Xie, Kun Zhang (possible past Google (United States) affiliation), Zhengzhang Chen
Abstract

Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper demonstrates that modern LLMs exhibit a significant vulnerability to the layout of tabular data. Specifically, we show that semantically-invariant permutations of rows and columns - rearrangements that do not alter the tab...

📄 ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
🗓️ Published: 5/1/2026
🔗 http://arxiv.org/abs/2605.00380v1
👥 Authors: Zihan Lin, Xiaohan Wang (possible past Baidu (China) affiliation), Jie Cao, Jiajun Chai, Li Wang (possible past Tesla (United States) affiliation), Xiaodong Lu, Wei Lin, Ran He, Guojun Yin
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes ne...

📄 Cost-Aware Learning
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.28020v1
👥 Authors: Clara Mohri, Amir Globerson (possible past Google (United States) affiliation), Haim Kaplan, Tomer Koren, Yishay Mansour (possible past Google (United States) affiliation)
Abstract

We consider the problem of Cost-Aware Learning, where sampling different component functions of a finite-sum objective incurs different costs. The objective is to reach a target error while minimizing the total cost. First, we propose the Cost-Aware Stochastic Gradient Descent algorithm for convex functions, and derive its cost complexity to attain an error of $ε$. Furthermore, we establish a lower bound for this setting and provide a subset selection algorithm to further reduce the cost of trai...

📄 Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification
🗓️ Published: 4/30/2026
🔗 http://arxiv.org/abs/2604.27936v1
👥 Authors: Eklavya Sarkar, Marius Miron, David Robinson, Gagan Narula, Milad Alizadeh, Ellen Gilsenan-Mcmahon, Emmanuel Chemla, Olivier Pietquin (possible past Google (United States) affiliation), Matthieu Geist (possible past Google (United States) affiliation)
Abstract

Animals hear and vocalize across frequency ranges that differ substantially from humans, often extending into the ultrasonic domain. Yet most computational bioacoustics systems rely on audio models pre-trained at 16 kHz, restricting their usable bandwidth to the 0-8 kHz baseband and discarding higher-frequency information present in many bioacoustic recordings. We investigate a multi-band encoding framework that decomposes the full spectrum of animal calls into band features and fuses them into ...

*Notable papers are those with at least two authors from a "big" AI/ML lab.