πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Test-Time Training with KV Binding Is Secretly Linear Attention
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21204v1
πŸ‘₯ Authors: Junchen Liu, Sven Elflein, Or Litany (possible past Stanford University affiliation), Zan Gojcic, Ruilong Li (possible past Tsinghua University affiliation)
Abstract

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model...

πŸ“„ Aletheia tackles FirstProof autonomously
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21201v1
πŸ‘₯ Authors: Tony Feng, Junehyuk Jung, Sang-Hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh (possible past Google (United States) affiliation), Vahab Mirrokni (possible past Google (United States) affiliation), Quoc V. Le (possible past Stanford University affiliation), Thang Luong (possible past Stanford University affiliation)
Abstract

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as...

πŸ“„ Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21198v1
πŸ‘₯ Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei (possible past Stanford University affiliation), Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflec...

πŸ“„ Cooperative-Competitive Team Play of Real-World Craft Robots
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21119v1
πŸ‘₯ Authors: Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang (possible past Tencent (China) affiliation), Lei Han (possible past Tencent (China) affiliation)
Abstract

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement l...

πŸ“„ LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21044v1
πŸ‘₯ Authors: Yanrui Wu, Lingling Zhang (possible past Google (United States) affiliation), Xinyu Zhang (possible past Baidu (China) affiliation), Jiayu Chang, Pengyu Li, Xu Jiang, Jingtao Hu, Jun Liu (possible past Tencent (China) affiliation)
Abstract

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework ...

πŸ“„ CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20980v1
πŸ‘₯ Authors: Yang Zhang (possible past Tsinghua University affiliation), Danyang Li, Yuxuan Li, Xin Zhang (possible past Google (United States) affiliation), Tianyu Xie, Mingming Cheng, Xiang Li
Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent...

πŸ“„ AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20720v1
πŸ‘₯ Authors: Che Wang, Jiaming Zhang, Ziqi Zhang (possible past Tencent (China) affiliation), Zijie Wang, Yinghui Wang, Jianbo Gao, Tao Wei (possible past Baidu (China) affiliation), Zhong Chen, Wei Yang Bryan Lim
Abstract

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast-evolving nature of modern AI agents. We introduce AdapTools,...

πŸ“„ CAMEL: Confidence-Gated Reflection for Reward Modeling
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20670v1
πŸ‘₯ Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu (possible past Tsinghua University affiliation), Yang You (possible past University Of California, Berkeley affiliation)
Abstract

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance di...

πŸ“„ Personal Information Parroting in Language Models
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20580v1
πŸ‘₯ Authors: Nishant Subramani (possible past Allen Institute For Artificial Intelligence affiliation), Kshitish Ghate, Mona Diab (possible past Carnegie Mellon University affiliation)
Abstract

Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b m...

πŸ“„ From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20558v1
πŸ‘₯ Authors: Yucheng Shi, Ying Li (possible past Meta (United States) affiliation), Yu Wang (possible past Tsinghua University affiliation), Yesu Feng, Arjun Rao, Rein Houthooft (possible past University Of California, Berkeley affiliation), Shradha Sehgal, Jin Wang, Hao Zhen, Ninghao Liu, Linas Baltrunas
Abstract

Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalizat...

πŸ“„ PreScience: A Benchmark for Forecasting Scientific Contributions
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20459v1
πŸ‘₯ Authors: Anirudh Ajith, Amanpreet Singh (possible past Meta (United States) affiliation), Jay Deyoung, Nadav Kunievsky, Austin C. Kozlowski, Oyvind Tafjord, James Evans, Daniel S. Weld (possible past University Of Washington affiliation), Tom Hope (possible past Tencent (China) affiliation), Doug Downey (possible past Allen Institute For Artificial Intelligence affiliation)
Abstract

Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generatio...

πŸ“„ Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20323v1
πŸ‘₯ Authors: Haoyang Li, Yang You (possible past University Of California, Berkeley affiliation), Hao Su, Leonidas Guibas (possible past Stanford University affiliation)
Abstract

Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test t...

πŸ“„ InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20294v1
πŸ‘₯ Authors: Yu Li (possible past Tencent (China) affiliation), Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu (possible past Salesforce (United States) affiliation)
Abstract

Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview t...

πŸ“„ A Very Big Video Reasoning Suite
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20159v2
πŸ‘₯ Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, ThaddΓ€us Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, RaphaΓ«l MilliΓ¨re, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille (possible past Google (United States) affiliation), Yilun Du (possible past Massachusetts Institute Of Technology affiliation), Ziming Liu (possible past Massachusetts Institute Of Technology affiliation), Bo Li (possible past Tencent (China) affiliation), Dahua Lin, Ziwei Liu, Vikash Kumar (possible past University Of Washington affiliation), Yijiang Li, Lei Yang (possible past Google (United States) affiliation), Zhongang Cai, Hokin Deng
Abstract

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To addr...

πŸ“„ AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20133v1
πŸ‘₯ Authors: Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu (possible past Tencent (China) affiliation), Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen (possible past University Of California, Berkeley affiliation), Matei Zaharia (possible past University Of California, Berkeley affiliation), Alex Dimakis, Ion Stoica (possible past University Of California, Berkeley affiliation)
Abstract

The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promis...

πŸ“„ NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20119v1
πŸ‘₯ Authors: Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li (possible past Meta (United States) affiliation), Jianing Qian, Jennifer L. Barry, Kris Kitani (possible past Carnegie Mellon University affiliation), George Konidaris
Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high lev...

πŸ“„ Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20078v1
πŸ‘₯ Authors: Shan Yang (possible past Google (United States) affiliation), Yang Liu (possible past Tsinghua University affiliation)
Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/Ρ)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analy...

πŸ“„ Agents of Chaos
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20021v1
πŸ‘₯ Authors: Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, Eunjeong Hwang, Hadas Orgad, P Sam Sahil, Negev Taglicht, Tomer Shabtay, Atai Ambus, Nitay Alon, Shiri Oron, Ayelet Gordon-Tapiero, Yotam Kaplan, Vered Shwartz, Tamar Rott Shaham (possible past Technion – Israel Institute Of Technology affiliation), Christoph Riedl, Reuth Mirsky, Maarten Sap, David Manheim, Tomer Ullman (possible past Massachusetts Institute Of Technology affiliation), David Bau (possible past Google (United States) affiliation)
Abstract

We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case s...

πŸ“„ What Matters for Simulation to Online Reinforcement Learning on Real Robots
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20220v1
πŸ‘₯ Authors: Yarden As, Dhruva Tirumala (possible past Deepmind (United Kingdom) affiliation), RenΓ© ZurbrΓΌgg, Chenhao Li, Stelian Coros, Andreas Krause (possible past Eth Zurich affiliation), Markus Wulfmeier
Abstract

We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. Th...

πŸ“„ VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19622v1
πŸ‘₯ Authors: Jingbo Zhou (possible past Baidu (China) affiliation), Jun Xia, Siyuan Li (possible past Tencent (China) affiliation), Yunfan Liu, Wenjun Wang, Yufei Huang, Changxi Chi, Mutian Hong, Zhuoli Ouyang, Shu Wang, Zhongqi Wang, Xingyu Wu, Chang Yu, Stan Z. Li
Abstract

Graph Transformer has demonstrated impressive capabilities in the field of graph representation learning. However, existing approaches face two critical challenges: (1) most models suffer from exponentially increasing computational complexity, making it difficult to scale to large graphs; (2) attention mechanisms based on node-level operations limit the flexibility of the model and result in poor generalization performance in out-of-distribution (OOD) scenarios. To address these issues, we propo...

πŸ“„ SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21158v1
πŸ‘₯ Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang (possible past Baidu (China) affiliation), Kenton Murray, Hua Wei (possible past Google (United States) affiliation)
Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolv...

πŸ“„ LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.21142v1
πŸ‘₯ Authors: Zhifan Jiang, Dong Yang (possible past Nvidia (United States) affiliation), Vishwesh Nath (possible past Nvidia (United States) affiliation), Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu (possible past Nvidia (United States) affiliation), Daguang Xu (possible past Nvidia (United States) affiliation), Syed Muhammad Anwar, Holger R. Roth (possible past Nvidia (United States) affiliation), Marius George Linguraru
Abstract

Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are es...

πŸ“„ SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20901v1
πŸ‘₯ Authors: Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu (possible past Tsinghua University affiliation), Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song (possible past Eth Zurich affiliation)
Abstract

Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects i...

πŸ“„ Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20730v1
πŸ‘₯ Authors: Zhenxing Xu, Zeyuan Ma, Weidong Bao (possible past National University Of Defense Technology affiliation), Hui Yan, Yan Zheng, Ji Wang (possible past Tencent (China) affiliation)
Abstract

We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline para...

πŸ“„ Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20729v1
πŸ‘₯ Authors: Xu Wan, Chao Yang, Cheng Yang (possible past Tsinghua University affiliation), Jie Song (possible past Eth Zurich affiliation), Mingyang Sun
Abstract

Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value fun...

πŸ“„ QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20629v1
πŸ‘₯ Authors: Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang (possible past University Of Washington affiliation), Ruslans Aleksejevs, Todor AntiΔ‡, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany IstvΓ‘n Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo (possible past Tsinghua University affiliation), Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu RundstrΓΆm, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel YalΓ§Δ±n, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu
Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math...

πŸ“„ CGSTA: Cross-Scale Graph Contrast with Stability-Aware Alignment for Multivariate Time-Series Anomaly Detection
πŸ—“οΈ Published: 2/24/2026
πŸ”— http://arxiv.org/abs/2602.20468v1
πŸ‘₯ Authors: Zhongpeng Qi, Jun Zhang (possible past Tencent (China) affiliation), Wei Li (possible past Peking University affiliation), Zhuoxuan Liang
Abstract

Multivariate time-series anomaly detection is essential for reliable industrial control, telemetry, and service monitoring. However, the evolving inter-variable dependencies and inevitable noise render it challenging. Existing methods often use single-scale graphs or instance-level contrast. Moreover, learned dynamic graphs can overfit noise without a stable anchor, causing false alarms or misses. To address these challenges, we propose the CGSTA framework with two key innovations. First, Dynami...

πŸ“„ GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.20399v1
πŸ‘₯ Authors: Haixu Wu, Minghao Guo, Zongyi Li, Zhiyang Dou, Mingsheng Long (possible past Tsinghua University affiliation), Kaiming He (possible past Microsoft (United States) affiliation), Wojciech Matusik
Abstract

Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-trainin...

πŸ“„ DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
πŸ—“οΈ Published: 2/23/2026
πŸ”— http://arxiv.org/abs/2602.19895v1
πŸ‘₯ Authors: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang (possible past Google (United States) affiliation), Xin Wang (possible past University Of Edinburgh affiliation), Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
Abstract

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DS...

*Notable papers are those with at least two authors from a "big" AI/ML lab.