πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Demystifing Video Reasoning
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16870v1
πŸ‘₯ Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li (possible past Tencent (China) affiliation), Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang (possible past Google (United States) affiliation)
Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative a...

πŸ“„ ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16866v1
πŸ‘₯ Authors: Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen (possible past Google (United States) affiliation), Huan-Ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo (possible past Shanghai Artificial Intelligence Laboratory affiliation)
Abstract

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generat...

πŸ“„ SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16864v1
πŸ‘₯ Authors: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang (possible past Google (United States) affiliation), Balu Adsumilli (possible past Google (United States) affiliation), Zhengzhong Tu (possible past Google (United States) affiliation)
Abstract

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a smal...

πŸ“„ SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16859v1
πŸ‘₯ Authors: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang (possible past Google (United States) affiliation), Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji (possible past Tencent (China) affiliation)
Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimension...

πŸ“„ SOMA: Unifying Parametric Human Body Models
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16858v1
πŸ‘₯ Authors: Jun Saito, Jiefeng Li, Michael De Ruyter, Miguel Guerrero, Edy Lim, Ehsan Hassani, Roger Blanco Ribera, Hyejin Moon, Magdalena Dadela, Marco Di Lucca, Qiao Wang, Xueting Li, Jan Kautz (possible past Nvidia (United States) affiliation), Simon Yuen, Umar Iqbal (possible past Nvidia (United States) affiliation)
Abstract

Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology...

πŸ“„ Surg$Ξ£$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16822v1
πŸ‘₯ Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He (possible past Nvidia (United States) affiliation), Dong Yang (possible past Nvidia (United States) affiliation), Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu (possible past Nvidia (United States) affiliation), Qi Dou, Yueming Jin
Abstract

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal d...

πŸ“„ InCoder-32B: Code Foundation Model for Industrial Scenarios
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16790v1
πŸ‘₯ Authors: Jian Yang, Wei Zhang (possible past Tsinghua University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui (possible past Tsinghua University affiliation), Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao (possible past Baidu (China) affiliation), Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv
Abstract

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded sys...

πŸ“„ TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16759v1
πŸ‘₯ Authors: Victoria Graf, Valentina Pyatkin, Nouha Dziri, Nathan Lambert (possible past University Of California, Berkeley affiliation), Hannaneh Hajishirzi (possible past University Of Washington affiliation)
Abstract

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through p...

πŸ“„ IQuest-Coder-V1 Technical Report
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16733v1
πŸ‘₯ Authors: Jian Yang, Wei Zhang (possible past Tsinghua University affiliation), Shawn Guo, Zhengmao Ye, Lin Jing, Shark Liu, Yizhi Li, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Cening Liu, X. Ma, Yuyang Song, Siwei Wu, Yuwen Li, L. Liao, T. Zheng, Ziling Huang, Zelong Huang, Che Liu, Yan Xing, Renyuan Li, Qingsong Cai, Hanxu Yan, Siyue Wang, Shikai Li, Jason Klein Liu, An Huang, Yongsheng Kang, Jinxing Zhang, Chuan Hao, Haowen Wang, Weicheng Gu, Ran Tao, Mingjie Tang, Peihao Wu, Jianzhou Wang, Xianglong Liu, Weifeng Lv, Bryan Dai
Abstract

In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, ...

πŸ“„ When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16673v1
πŸ‘₯ Authors: Jun Liu (possible past Tencent (China) affiliation), Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang (possible past Tencent (China) affiliation), Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang
Abstract

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a funda...

πŸ“„ Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16664v1
πŸ‘₯ Authors: Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang (possible past Tencent (China) affiliation), Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie (possible past Google (United States) affiliation), Yuyin Zhou
Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework...

πŸ“„ RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16453v1
πŸ‘₯ Authors: Linghua Zhang, Jun Wang (possible past Tencent (China) affiliation), Jingtong Wu, Zhisong Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We f...

πŸ“„ Fanar 2.0: Arabic Generative AI Stack
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16397v1
πŸ‘₯ Authors: Fanar Team, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky (possible past Google (United States) affiliation), Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak (possible past Tencent (China) affiliation), Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang
Abstract

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of...

πŸ“„ VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16289v1
πŸ‘₯ Authors: Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang (possible past Tencent (China) affiliation), Zhe Li (possible past Google (United States) affiliation), Shiming Xiang, Ying Yan
Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple ...

πŸ“„ DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16157v1
πŸ‘₯ Authors: Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang (possible past Baidu (China) affiliation), Wei Chu, Zhe Wang (possible past Deepmind (United Kingdom) affiliation), Shirui Pan, Chao Qu, Yuan Qi
Abstract

While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen...

πŸ“„ Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15981v1
πŸ‘₯ Authors: Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze (possible past Carnegie Mellon University affiliation), Ahmed A Aly, Anuj Kumar (possible past Meta (United States) affiliation), Ariya Rastrow, Zhaojiang Lin
Abstract

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware s...

πŸ“„ MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15954v1
πŸ‘₯ Authors: Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar (possible past Google (United States) affiliation), Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra (possible past Meta (United States) affiliation), Raghuraman Krishnamoorthi
Abstract

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom k...

πŸ“„ AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15888v1
πŸ‘₯ Authors: Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang (possible past Tsinghua University affiliation), Jianfeng Gao (possible past Microsoft (United States) affiliation)
Abstract

With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental fee...

πŸ“„ CUBE: A Standard for Unifying Agent Benchmarks
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15798v1
πŸ‘₯ Authors: Alexandre Lacoste (possible past Google (United States) affiliation), Nicolas Gontier, Oleh Shliazhko, Aman Jaiswal, Kusha Sareen, Shailesh Nanisetty, Joan Cabezas, Manuel Del Verme, Omar G. Younis, Simone Baratta, Matteo Avalle, Imene Kerboua, Xing Han LΓΉ, Elron Bandel, Michal Shmueli-Scheuer, Asaf Yehudai, Leshem Choshen, Jonathan Lebensold, Sean Hughes, Massimo Caccia, Alexandre Drouin, Siva Reddy (possible past University Of Edinburgh affiliation), Tao Yu (possible past University Of Washington affiliation), Yu Su, Graham Neubig (possible past Carnegie Mellon University affiliation), Dawn Song (possible past University Of California, Berkeley affiliation)
Abstract

The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CU...

πŸ“„ OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15797v1
πŸ‘₯ Authors: Hao Wu (possible past Tencent (China) affiliation), Yongheng Zhang, Yuan Gao (possible past Tencent (China) affiliation), Fan Xu, Fan Zhang, Ruobing Xie (possible past Tencent (China) affiliation), Ruijian Gou, Yuxuan Liang, Xiaomeng Huang, Xian Wu (possible past Tencent (China) affiliation)
Abstract

Large Language Models (LLMs) have demonstrated exceptional logical reasoning capabilities but frequently struggle with the continuous spatiotemporal dynamics governed by Partial Differential Equations (PDEs), often resulting in non-physical hallucinations. Existing approaches typically resort to costly, domain-specific fine-tuning, which severely limits cross-domain generalization and interpretability. To bridge this gap, we propose OMNIFLOW, a neuro-symbolic architecture designed to ground froz...

πŸ“„ Deep Tabular Representation Corrector
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16569v1
πŸ‘₯ Authors: Hangting Ye, Peng Wang (possible past Peking University affiliation), Wei Fan (possible past Tencent (China) affiliation), Xiaozhuang Song, He Zhao (possible past Tencent (China) affiliation), Dandan Gun, Yi Chang
Abstract

Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representati...

πŸ“„ The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
πŸ—“οΈ Published: 3/17/2026
πŸ”— http://arxiv.org/abs/2603.16177v1
πŸ‘₯ Authors: Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian BΓΆther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek (possible past Meta (United States) affiliation), Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter (possible past Carnegie Mellon University affiliation), Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini
Abstract

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPi...

πŸ“„ Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15759v1
πŸ‘₯ Authors: Jacob Levy, Tyler Westenbroek, Kevin Huang, Fernando Palafox, Patrick Yin, Shayegan Omidshafiei (possible past Deepmind (United Kingdom) affiliation), Dong-Ki Kim, Abhishek Gupta (possible past University Of California, Berkeley affiliation), David Fridovich-Keil
Abstract

Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce Simulation Distillation (SimDist), a sim-to-real framework that distills structural priors from a simulato...

πŸ“„ MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15726v1
πŸ‘₯ Authors: Miromind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen (possible past Google (United States) affiliation), Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang (possible past Google (United States) affiliation), N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu
Abstract

We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effecti...

πŸ“„ The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15563v2
πŸ‘₯ Authors: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang (possible past Baidu (China) affiliation), Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang (possible past University Of California, Berkeley affiliation), Fei Fang, Yuke Zhu (possible past Stanford University affiliation), Chi Jin
Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategi...

πŸ“„ FuXiWeather2: Learning accurate atmospheric state estimation for operational global weather forecasting
πŸ—“οΈ Published: 3/16/2026
πŸ”— http://arxiv.org/abs/2603.15358v1
πŸ‘₯ Authors: Xiaoze Xu, Xiuyu Sun, Songling Zhu, Xiaohui Zhong, Yuanqing Huang, Zijian Zhu, Jun Liu (possible past Tencent (China) affiliation), Hao Li (possible past Tsinghua University affiliation)
Abstract

Numerical weather prediction has long been constrained by the computational bottlenecks inherent in data assimilation and numerical modeling. While machine learning has accelerated forecasting, existing models largely serve as "emulators of reanalysis products," thereby retaining their systematic biases and operational latencies. Here, we present FuXiWeather2, a unified end-to-end neural framework for assimilation and forecasting. We align training objectives directly with a combination of real-...

*Notable papers are those with at least two authors from a "big" AI/ML lab.