πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Atoms of Thought: Universal EEG Representation Learning with Microstates
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.20182v1
πŸ‘₯ Authors: Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue, Xin Wang (possible past University Of Edinburgh affiliation), Xuesong Chen (possible past Peking University affiliation)
Abstract

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time ...

πŸ“„ VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.20082v1
πŸ‘₯ Authors: Zhefan Xu, Ghassen Jerfel (possible past Google (United States) affiliation), Marina Haliem, Qi Zhao (possible past Google (United States) affiliation), Jeonhyung Kang, Khaled S. Refaat
Abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework ...

πŸ“„ AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.20025v1
πŸ‘₯ Authors: Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang (possible past Google (United States) affiliation), Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang (possible past Tencent (China) affiliation), Caiming Xiong (possible past Salesforce (United States) affiliation), James Zou, Zeyu Zheng, Cihang Xie (possible past Google (United States) affiliation), Mingyu Ding, Huaxiu Yao
Abstract

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research p...

πŸ“„ Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19833v1
πŸ‘₯ Authors: Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye (possible past Tencent (China) affiliation), Xiaobin Hu (possible past Tencent (China) affiliation), Shuicheng Yan (possible past National University Of Singapore affiliation), Chunyan Miao
Abstract

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-...

πŸ“„ ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19822v1
πŸ‘₯ Authors: Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu (possible past Tsinghua University affiliation), Zhidong Zhao, Huaming Wu, Feng Xia (possible past Tencent (China) affiliation), Shirui Pan
Abstract

Temporal graph neural networks (TGNNs) have gained significant traction for solving real-world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first-time interactions, which we term ...

πŸ“„ Stitched Value Model for Diffusion Alignment
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19804v1
πŸ‘₯ Authors: Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat (possible past Eth Zurich affiliation), Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie (possible past Google (United States) affiliation), Federico Tombari (possible past Google (United States) affiliation), Konrad Schindler
Abstract

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but bias...

πŸ“„ What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19762v1
πŸ‘₯ Authors: Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu (possible past Tencent (China) affiliation), Jun Zhou, Enhong Chen (possible past Baidu (China) affiliation)
Abstract

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general rea...

πŸ“„ TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19738v1
πŸ‘₯ Authors: Wen Shi, Zhe Wang (possible past Deepmind (United Kingdom) affiliation), Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia (possible past Tencent (China) affiliation)
Abstract

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topolo...

πŸ“„ optimize_anything: A Universal API for Optimizing any Text Parameter
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19633v1
πŸ‘₯ Authors: Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen (possible past University Of California, Berkeley affiliation), Dan Klein, Ion Stoica (possible past University Of California, Berkeley affiliation), Joseph E. Gonzalez (possible past University Of California, Berkeley affiliation), Omar Khattab, Alexandros G. Dimakis, Matei Zaharia (possible past University Of California, Berkeley affiliation)
Abstract

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemi...

πŸ“„ Implicit Action Chunking for Smooth Continuous Control
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19592v1
πŸ‘₯ Authors: Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun (possible past Google (United States) affiliation), Yuankai Wu, Huachun Tan, Yong Wang (possible past Baidu (China) affiliation)
Abstract

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit ...

πŸ“„ SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19587v1
πŸ‘₯ Authors: Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin (possible past National University Of Singapore affiliation), Yangguang Li, Yu Cheng (possible past National University Of Singapore affiliation)
Abstract

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap b...

πŸ“„ Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19523v1
πŸ‘₯ Authors: Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li (possible past Carnegie Mellon University affiliation), Hao Zhou, Fandong Meng (possible past Tencent (China) affiliation), Jie Zhou (possible past Tsinghua University affiliation), Xu Sun (possible past Peking University affiliation)
Abstract

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models ...

πŸ“„ Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19461v1
πŸ‘₯ Authors: Xiaozhe Li, Yang Li (possible past Google (United States) affiliation), Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), ...

πŸ“„ What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19447v1
πŸ‘₯ Authors: Xiaozhe Li, Tianyi Lyu, Yang Li (possible past Google (United States) affiliation), Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen (possible past Shanghai Jiao Tong University affiliation)
Abstract

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five ...

πŸ“„ Generative Recursive Reasoning
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19376v1
πŸ‘₯ Authors: Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren (possible past University Of Toronto affiliation), Yoshua Bengio (possible past Mila - Quebec Artificial Intelligence Institute affiliation), Sungjin Ahn
Abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce \emph{Generative Recursive reAsoning Models (GRAM)}, a framework that turns recursive latent reasoning into ...

πŸ“„ HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19341v1
πŸ‘₯ Authors: Emmy Liu, Varun Gangal (possible past Carnegie Mellon University affiliation), Michael Yu, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng (possible past Carnegie Mellon University affiliation)
Abstract

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult ...

πŸ“„ STAR-PΓ³lyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19338v1
πŸ‘₯ Authors: Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang (possible past Tencent (China) affiliation), Yinpeng Dong (possible past Tsinghua University affiliation)
Abstract

Frontier AI models and multi-agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long-horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. In this paper, we introduce STAR-PΓ³lyaMath, a multi-agent framework that systematically addresses these challenges through meta-level supervision and structured R...

πŸ“„ ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19314v1
πŸ‘₯ Authors: Shuhan Guo, Kun Zhang (possible past Google (United States) affiliation), Haifei Liu, Xingyu Gao, Yongqi Zhang, Yaqing Wang (possible past Baidu (China) affiliation), Quanming Yao
Abstract

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same nex...

πŸ“„ Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19228v1
πŸ‘₯ Authors: Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang (possible past Baidu (China) affiliation), Lu Cheng, Hua Wei (possible past Google (United States) affiliation)
Abstract

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only ...

πŸ“„ Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19227v1
πŸ‘₯ Authors: Tobias Braun, Jonas Henry Grebe, Hossein Shakibania, Anna Rohrbach (possible past University Of California, Berkeley affiliation), Marcus Rohrbach (possible past University Of California, Berkeley affiliation)
Abstract

Unified autoregressive models (UAMs) are transformer models that generate text as well as image tokens within a single autoregressive pass. Shared parameters and a multimodal vocabulary simplify the training pipeline and facilitate flexible multimodal generation, yet might introduce new vulnerabilities. In particular, we are the first to show that this unified architecture enables multimodal backdoor attacks, where a trigger can propagate malicious effects across multiple output modalities. Spec...

πŸ“„ Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
πŸ—“οΈ Published: 5/18/2026
πŸ”— http://arxiv.org/abs/2605.19093v1
πŸ‘₯ Authors: Zhiyuan Jerry Lin, Benjamin Letham (possible past Meta (United States) affiliation), Samuel Dooley, Maximilian Balandat, Eytan Bakshy (possible past Meta (United States) affiliation)
Abstract

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. G...

πŸ“„ Smooth Partial Lotteries for Stable Randomized Selection
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.20069v1
πŸ‘₯ Authors: Alexander Goldberg, Giulia Fanti (possible past University Of California, Berkeley affiliation), Nihar B. Shah (possible past University Of California, Berkeley affiliation)
Abstract

Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate's score can cause large shifts in their selection probabilities. This instability undermines a key goal ...

πŸ“„ OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19660v1
πŸ‘₯ Authors: Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu (possible past Tencent (China) affiliation), Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian (possible past Baidu (China) affiliation), Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong
Abstract

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspect...

πŸ“„ CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19436v1
πŸ‘₯ Authors: Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan (possible past Inception Institute Of Artificial Intelligence affiliation), Salman Khan (possible past Inception Institute Of Artificial Intelligence affiliation)
Abstract

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that...

πŸ“„ CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19350v1
πŸ‘₯ Authors: Habib Slim, Shariq Farooq Bhat, Mohamed Elhoseiny (possible past Meta (United States) affiliation), Yifan Wang (possible past Stanford University affiliation), Mike Roberts
Abstract

Creating and editing high-quality 3D content remains a central challenge in computer graphics. We address this challenge by introducing CompoSE, a novel method for Compositional Synthesis and Editing of 3D shapes via part-aware control. Our method takes as input a set of coarse geometric primitives (e.g., bounding boxes) that represent distinct object parts arranged in a particular spatial configuration, and synthesizes as output part-separated 3D objects that support localized granular (i.e., c...

πŸ“„ Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19293v1
πŸ‘₯ Authors: Caolu Xu, Zhiyong Chen, Meixia Tao, Li Song, Feng Yang (possible past Google (United States) affiliation), Wenjun Zhang (possible past Shanghai Jiao Tong University affiliation)
Abstract

Wireless extended reality (XR) teleoperation provides embodied interaction capability for collecting humanoid robot demonstrations, but the large-scale adoption is restricted by the overhead of high-frequency motion transmission. This paper develops a system framework that integrates sampling, transmission, interpolation, and reconstruction and formulates a communication-rate optimization that aims to minimize the communication energy while maintaining the reconstruction accuracy of robot motion...

πŸ“„ OpenCompass: A Universal Evaluation Platform for Large Language Models
πŸ—“οΈ Published: 5/19/2026
πŸ”— http://arxiv.org/abs/2605.19276v1
πŸ‘₯ Authors: Maosong Cao, Kai Chen (possible past Shanghai Jiao Tong University affiliation), Haodong Duan, Yixiao Fang, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu (possible past Google (United States) affiliation), Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun (possible past Baidu (China) affiliation), Zhiyong Wu (possible past Tsinghua University affiliation), Linchen Xiao, Jun Xu (possible past Google (United States) affiliation), Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo
Abstract

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluatio...

*Notable papers are those with at least two authors from a "big" AI/ML lab.