πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03988v1
πŸ‘₯ Authors: Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dangjoo Kim, Zelun Luo (possible past Stanford University affiliation), Linda Shapiro, Ranjay Krishna (possible past University Of Washington affiliation)
Abstract

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive ...

πŸ“„ Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03985v1
πŸ‘₯ Authors: Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang (possible past Stanford University affiliation), Li Yi (possible past Stanford University affiliation)
Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achie...

πŸ“„ Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03962v1
πŸ‘₯ Authors: Anthony Gx-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser AygΓΌn, David Smalling, Shibl Mourad, Doina Precup (possible past Deepmind (United Kingdom) affiliation), AndrΓ© Barreto (possible past Deepmind (United Kingdom) affiliation), Mark Rowland (possible past University Of Cambridge affiliation)
Abstract

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as t...

πŸ“„ scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03906v1
πŸ‘₯ Authors: Jiabei Cheng, Jingbo Zhou (possible past Baidu (China) affiliation), Jun Xia, Changkai Li, Zhen Lei (possible past Beijing Academy Of Artificial Intelligence affiliation), Chang Yu, Stan Z. Li
Abstract

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation ...

πŸ“„ Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03879v1
πŸ‘₯ Authors: Wei Ding, Yudong Zhang, Ruobing Xie (possible past Tencent (China) affiliation), Xingwu Sun (possible past Baidu (China) affiliation), Jiansheng Chen, Yu Wang (possible past Tsinghua University affiliation)
Abstract

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five commo...

πŸ“„ SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03692v1
πŸ‘₯ Authors: Yuan Xiong, Ziqi Miao, Qian Chen (possible past Shanghai Jiao Tong University affiliation), Lijun Li, Yequan Wang (possible past Tsinghua University affiliation), Shizhu He, Jun Zhao, Kang Liu
Abstract

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address t...

πŸ“„ From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03660v1
πŸ‘₯ Authors: Hongyu Guo, Hao Li (possible past Tsinghua University affiliation), He Cao, Gongbo Zhang, Li Yuan (possible past National University Of Singapore affiliation)
Abstract

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic...

πŸ“„ Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03624v1
πŸ‘₯ Authors: Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng (possible past Tencent (China) affiliation), Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu (possible past Tencent (China) affiliation)
Abstract

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constrain...

πŸ“„ StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03467v1
πŸ‘₯ Authors: Taiyu Zhu, Yifan Wu (possible past Carnegie Mellon University affiliation), Weilin Jin, Ying Li (possible past Meta (United States) affiliation), Gang Huang
Abstract

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mai...

πŸ“„ Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03382v1
πŸ‘₯ Authors: Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang (possible past Tsinghua University affiliation), Runze Liu, Pablo Samuel Castro (possible past Google (United States) affiliation), Aaron Courville, Ling Pan
Abstract

While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions ...

πŸ“„ LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03303v1
πŸ‘₯ Authors: Po-Nien Kung, Linfeng Song, Dawsen Hwang, Jinsung Yoon (possible past Google (United States) affiliation), Chun-Liang Li, Simone Severini, Mirek OlΕ‘Γ‘k, Edward Lockhart (possible past Google (United States) affiliation), Quoc V Le, Burak Gokturk, Thang Luong (possible past Stanford University affiliation), Tomas Pfister (possible past University Of Oxford affiliation), Nanyun Peng
Abstract

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller un...

πŸ“„ Solipsistic Superintelligence is Unlikely to be Cooperative
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03237v1
πŸ‘₯ Authors: Rakshit S Trivedi, Natasha Jaques (possible past University Of California, Berkeley affiliation), Logan Cross, Alexander Sasha Vezhnevets (possible past Google (United States) affiliation), Joel Z Leibo
Abstract

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverg...

πŸ“„ Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03236v1
πŸ‘₯ Authors: Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li (possible past Carnegie Mellon University affiliation), Dezhi Kong, Hao Wang (possible past Tsinghua University affiliation), Peng Zhou (possible past Tencent (China) affiliation), Xuchu Jiang, Jiaming Xu
Abstract

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent...

πŸ“„ BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03223v1
πŸ‘₯ Authors: Zhe Sun (possible past Tsinghua University affiliation), Meng Wang (possible past Google (United States) affiliation), Lei Wang (possible past Baidu (China) affiliation), Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang
Abstract

Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map ...

πŸ“„ MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03203v1
πŸ‘₯ Authors: Jia Yu, Zilong Wang, Xinyang Jiang (possible past Tencent (China) affiliation), Dongsheng Li, Shuo Wang (possible past Nvidia (United States) affiliation)
Abstract

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark f...

πŸ“„ NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03159v1
πŸ‘₯ Authors: Nvidia, :, Aarti Basant, Amlan Kar (possible past University Of Toronto affiliation), Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas (possible past University Of Toronto affiliation), Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao (possible past Nvidia (United States) affiliation), Kai He, Katarina Tothova, Kevin Xie, MichaΕ‚ Tyszkiewicz, Qi Wu, Riccardo De Lutio, Ruilong Li (possible past Tsinghua University affiliation), Sanja Fidler (possible past University Of Toronto affiliation), Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff (possible past Google (United States) affiliation), William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang
Abstract

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and ...

πŸ“„ Uncertainty-Aware Clarification in LLM Agents with Information Gain
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03135v1
πŸ‘₯ Authors: Mengyi Deng, Zhiwei Li, Xin Li (possible past Google (United States) affiliation), Tingyu Zhu, Ying Zhao (possible past Stanford University affiliation), Zhijiang Guo, Wei Wang (possible past University Of Oxford affiliation)
Abstract

Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced...

πŸ“„ DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03103v1
πŸ‘₯ Authors: Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang (possible past Tencent (China) affiliation), Shengyu Zhang (possible past Tencent (China) affiliation)
Abstract

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce De...

πŸ“„ PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03099v1
πŸ‘₯ Authors: Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu (possible past Tsinghua University affiliation), Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang (possible past Tencent (China) affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, Ph...

πŸ“„ DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03083v1
πŸ‘₯ Authors: Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li (possible past Google (United States) affiliation), Xu Chen (possible past Tencent (China) affiliation)
Abstract

Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of exi...

πŸ“„ ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03054v1
πŸ‘₯ Authors: Anjie Liu, Yan Song (possible past Tencent (China) affiliation), Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang (possible past Tencent (China) affiliation)
Abstract

Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful c...

πŸ“„ Pretraining Language Models on Historical Text
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.02991v1
πŸ‘₯ Authors: Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr (possible past University Of Oxford affiliation), Yao Lu (possible past Google (United States) affiliation)
Abstract

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive da...

πŸ“„ Neuron Populations Exhibit Divergent Selectivity with Scale
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03990v1
πŸ‘₯ Authors: Amil Dravid, Yasaman Bahri (possible past Google (United States) affiliation), Alexei A. Efros (possible past University Of California, Berkeley affiliation), Yossi Gandelsman (possible past Google (United States) affiliation)
Abstract

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neu...

πŸ“„ Value-Aware Stochastic KV Cache Eviction for Reasoning Models
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03928v1
πŸ‘₯ Authors: Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason (possible past University Of Washington affiliation), Robin Jia (possible past Stanford University affiliation)
Abstract

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting ...

πŸ“„ Text-attributed Graph Condensation via Text Selection and Attribute Matching
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03839v1
πŸ‘₯ Authors: Haowei Han, Yuxiang Wang, Guojia Wan, Hao Wang (possible past Tsinghua University affiliation), Shanshan Feng, Hao Huang, Jiawei Jiang (possible past Tencent (China) affiliation), Xiao Yan
Abstract

Text-Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which com...

πŸ“„ SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03169v1
πŸ‘₯ Authors: Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan, Jiatao Chen, Jinchao Zhang (possible past Tencent (China) affiliation), Jie Zhou (possible past Tsinghua University affiliation)
Abstract

Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures the...

πŸ“„ FederatedSkill: Federated Learning for Agentic Skill Evolution
πŸ—“οΈ Published: 6/2/2026
πŸ”— http://arxiv.org/abs/2606.03143v1
πŸ‘₯ Authors: Jingbo Yang, Guanyu Yao, Yang Zhang (possible past Tsinghua University affiliation), Ramana Rao Kompella, Gaowen Liu, Shiyu Chang (possible past Tencent (China) affiliation)
Abstract

Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preser...

πŸ“„ KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
πŸ—“οΈ Published: 6/1/2026
πŸ”— http://arxiv.org/abs/2606.02963v1
πŸ‘₯ Authors: Taras Sereda, Burak Bartan, Ankita Nayak (possible past Stanford University affiliation), Tom St. John (possible past Google (United States) affiliation), Natalie Serrino, Zain Asgar
Abstract

Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi-agent coordination, each with distinct compute and memory profiles. For optimal efficiency, each stage should run on the accelerator best suited to it. This creates a systems challenge: each pipeline now requires high-performance kernels across a growing set of hardware backends and programming models. Writing these kernels by hand is time-consuming, demand...

πŸ“„ Cosmos 3: Omnimodal World Models for Physical AI
πŸ—“οΈ Published: 6/1/2026
πŸ”— http://arxiv.org/abs/2606.02800v1
πŸ‘₯ Authors: Aditi, Niket Agarwal (possible past Nvidia (United States) affiliation), Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao (possible past Nvidia (United States) affiliation), Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen (possible past Tencent (China) affiliation), Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen (possible past Meta (United States) affiliation), An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui (possible past Google (United States) affiliation), Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler (possible past University Of Toronto affiliation), Xiao Fu, Vikram Fugro, Prashant Gaikwad, Tj Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju (possible past Google (United States) affiliation), Jinwei Gu (possible past Shanghai Artificial Intelligence Laboratory affiliation), Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han (possible past Stanford University affiliation), Ankur Handa (possible past Nvidia (United States) affiliation), Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He (possible past Nvidia (United States) affiliation), Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, Delesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz (possible past Nvidia (United States) affiliation), Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, Jf Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li (possible past Baidu (China) affiliation), Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin (possible past Nvidia (United States) affiliation), Tsung-Yi Lin (possible past Nvidia (United States) affiliation), Ming-Yu Liu (possible past Nvidia (United States) affiliation), Sifei Liu (possible past Nvidia (United States) affiliation), Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo (possible past Stanford University affiliation), Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, David Page, Yatian Pang (possible past National University Of Singapore affiliation), Sehwi Park, Mahesh Patekar, Mostofa Patwary (possible past Nvidia (United States) affiliation), Marco Pavone (possible past Stanford University affiliation), Trung Pham, Wei Ping (possible past Baidu (China) affiliation), Soha Pouya, Shrimai Prabhumoye (possible past Carnegie Mellon University affiliation), Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi (possible past Nvidia (United States) affiliation), Mateusz Sieniawski, Shuran Song (possible past Google (United States) affiliation), Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang (possible past University Of California, Berkeley affiliation), Yan Wang (possible past Tencent (China) affiliation), Yu Wang (possible past Tsinghua University affiliation), David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu (possible past Nvidia (United States) affiliation), Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang (possible past Nvidia (United States) affiliation), Hans Yang, Xiaodong Yang (possible past Nvidia (United States) affiliation), Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu (possible past Nvidia (United States) affiliation), Hao Yuan, Simon Yuen, Xiaohui Zeng (possible past University Of Toronto affiliation), Pengcuo Zeren, Cindy Zha, Haotian Zhang (possible past Stanford University affiliation), Jenny Zhang, Jing Zhang (possible past University Of Washington affiliation), Liangkai Zhang (possible past Google (United States) affiliation), Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu (possible past Stanford University affiliation), Dima Zhylko, Artur Zolkowski
Abstract

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosm...

*Notable papers are those with at least two authors from a "big" AI/ML lab.