πŸ“„ Notable* Recent AI/ML arXiv Papers

Last updated just now...

πŸ“„ Fast Byte Latent Transformer
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.08044v1
πŸ‘₯ Authors: Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer (possible past University Of Washington affiliation), Christopher Potts (possible past Tencent (China) affiliation), Xiaochuang Han, Srinivasan Iyer
Abstract

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction ...

πŸ“„ Sycophantic AI makes human interaction feel more effortful and less satisfying over time
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07912v1
πŸ‘₯ Authors: Lujain Ibrahim, Franziska Sofia Hafner, Myra Cheng (possible past Deepmind (United Kingdom) affiliation), Cinoo Lee, Rebecca Anselmetti, Robb Willer, Luc Rocher, Diyi Yang (possible past Stanford University affiliation)
Abstract

Millions of people now turn to artificial intelligence (AI) systems for personal advice, guidance, and support. Such systems can be sycophantic, frequently affirming users' views and beliefs. Across five preregistered studies (N = 3,075 participants, 12,766 human-AI conversations), including a three-week study with a census-representative U.S. sample, we provide longitudinal experimental evidence that sycophantic AI shifts how users approach their closest relationships. We show that sycophantic ...

πŸ“„ CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07905v1
πŸ‘₯ Authors: Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong (possible past Tencent (China) affiliation), Yue Wang, Xuebo Liu, Min Zhang (possible past Tsinghua University affiliation)
Abstract

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-revie...

πŸ“„ Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07872v1
πŸ‘₯ Authors: Yuancheng Wei, Linli Yao, Lei Li (possible past Carnegie Mellon University affiliation), Haojie Zhang, Hao Zhou, Fandong Meng (possible past Tencent (China) affiliation), Xu Sun (possible past Peking University affiliation)
Abstract

Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (...

πŸ“„ LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07640v1
πŸ‘₯ Authors: Jun Wang (possible past Tencent (China) affiliation), Fengpeng Li, Hang Dong, Tianjin Huang, Wei Han (possible past Google (United States) affiliation)
Abstract

Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opport...

πŸ“„ Post-training makes large language models less human-like
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07632v1
πŸ‘₯ Authors: Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska BrΓ€ndle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, NoΓ©mi Γ‰ltetΕ‘, Michael Franke, Thomas L. Griffiths (possible past University Of California, Berkeley affiliation), Fritz GΓΌnther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield (possible past University Of Oxford affiliation), Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Robert C. Wilson, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang (possible past Baidu (China) affiliation), Jian-Qiao Zhu, Eric Schulz
Abstract

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in...

πŸ“„ Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07630v1
πŸ‘₯ Authors: Zhengyang Tang, Yi Zhang (possible past Google (United States) affiliation), Chenxin Li, Xin Lai, Pengyuan Lyu (possible past Tencent (China) affiliation), Yiduo Guo, Weinong Wang, Junyi Li, Yang Ding, Huawen Shen, Zhengyao Fang, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang (possible past Tencent (China) affiliation), Chengquan Zhang (possible past Baidu (China) affiliation), Han Hu
Abstract

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with Ph...

πŸ“„ SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07482v1
πŸ‘₯ Authors: Zizhao Hu, Ameya Godbole, Johnny Tian-Zheng Wei (possible past Tencent (China) affiliation), Mohammad Rostami, Jesse Thomason (possible past University Of Washington affiliation), Robin Jia (possible past Stanford University affiliation)
Abstract

Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlear...

πŸ“„ Rubric-based On-policy Distillation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07396v1
πŸ‘₯ Authors: Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang (possible past Google (United States) affiliation), Haiyun Guo, Xiang Wang (possible past Tencent (China) affiliation), Tat-Seng Chua
Abstract

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes...

πŸ“„ RELO: Reinforcement Learning to Localize for Visual Object Tracking
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07379v1
πŸ‘₯ Authors: Xin Chen (possible past Tencent (China) affiliation), Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang (possible past Tsinghua University affiliation), Huchuan Lu, Kede Ma
Abstract

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO re...

πŸ“„ MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07363v1
πŸ‘₯ Authors: Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang (possible past Meta (United States) affiliation), Wenjie Pei (possible past Tencent (China) affiliation)
Abstract

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Index...

πŸ“„ BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07306v1
πŸ‘₯ Authors: Zhaohui Du, Zhe Wang (possible past Deepmind (United Kingdom) affiliation), Hongmei Fei, Xiwen Cao, Ting Xiao, Qi Wang (possible past Tsinghua University affiliation), Huanbo Jin, Jiaming Gu, Quan Lu, Zhe Liu
Abstract

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce B...

πŸ“„ Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07276v1
πŸ‘₯ Authors: Jia Li (possible past Google (United States) affiliation), Yuxin Su, Ting Peng, Hailiang Huang, Yuetang Deng (possible past Tencent (China) affiliation), Michael R. Lyu
Abstract

Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intr...

πŸ“„ From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07273v1
πŸ‘₯ Authors: Jiaju Han, Chao Li (possible past Baidu (China) affiliation), Chengyin Hu, Qike Zhang, Xuemeng Sun, Xin Wang (possible past University Of Edinburgh affiliation), Fengyu Zhang, Xiang Chen (possible past Tencent (China) affiliation), Yiwei Wei, Jiahuan Long, Jiujiang Guo
Abstract

Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking atta...

πŸ“„ Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07250v1
πŸ‘₯ Authors: Zhixue Song, Boyan Han, Yiwei Wang (possible past Google (United States) affiliation), Chi Zhang (possible past Peking University affiliation)
Abstract

Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that...

πŸ“„ EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07247v1
πŸ‘₯ Authors: Yi Liu (possible past Google (United States) affiliation), Tingfeng Hui, Wei Zhang (possible past Tsinghua University affiliation), Li Sun, Ningxin Su, Jian Wang (possible past Baidu (China) affiliation), Sen Su
Abstract

Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer f...

πŸ“„ Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07111v1
πŸ‘₯ Authors: Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li (possible past Tsinghua University affiliation), Virginia Smith (possible past Carnegie Mellon University affiliation), Kevin Kuo
Abstract

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language m...

πŸ“„ TeamBench: Evaluating Agent Coordination under Enforced Role Separation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07073v1
πŸ‘₯ Authors: Yubin Kim, Chanwoo Park, Taehan Kim, Eugene Park, Samuel Schmidgall, Salman Rahman, Chunjong Park, Cynthia Breazeal (possible past Massachusetts Institute Of Technology affiliation), Xin Liu, Hamid Palangi, Hae Won Park, Daniel Mcduff (possible past Google (United States) affiliation)
Abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, ...

πŸ“„ Normalizing Trajectory Models
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.08078v1
πŸ‘₯ Authors: Jiatao Gu (possible past Meta (United States) affiliation), Tianrong Chen, Ying Shen, David Berthelot (possible past Google (United States) affiliation), Shuangfei Zhai, Josh Susskind
Abstract

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. ...

πŸ“„ STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.08029v1
πŸ‘₯ Authors: Ying Shen, Tianrong Chen, Yuan Gao (possible past Tencent (China) affiliation), Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu (possible past Meta (United States) affiliation)
Abstract

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask,...

πŸ“„ Toward Better Geometric Representations for Molecule Generative Models
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07693v1
πŸ‘₯ Authors: Shaoheng Yan, Zian Li, Cai Zhou, Qiaojing Huang, Kai Liu (possible past Baidu (China) affiliation), Muhan Zhang (possible past Meta (United States) affiliation)
Abstract

Geometric representation-conditioned molecule generation provides an effective paradigm that decouples molecule representation modeling from structure generation. By decoupling molecule generation into two stages-first generating a meaningful molecule representation, and then generating a 3D molecule conditioned on this representation-the efficiency and quality of the generation process can be significantly enhanced. However, its effectiveness is fundamentally limited by the quality of the repre...

πŸ“„ CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07335v1
πŸ‘₯ Authors: Mengran Li, Bo Li (possible past Tencent (China) affiliation), Jiaying Wang, Wenbin Xing, Yixuan Dong, Chengyang Zhang, Hongliang Zhang, Yuzhong Peng, Jinlin Wu, Bob Zhang, Bingo Wing-Kuen Ling, Fuji Yang, Zhen Lei (possible past Beijing Academy Of Artificial Intelligence affiliation), Jiebo Luo, Zelin Zang
Abstract

Virtual Cell Modeling (VCM) requires models that not only predict perturbation responses, but also support targeted revision when predictions fail. Current LLM-assisted modeling workflows face a refinement-routing problem: prediction discrepancies are observed through executable implementations, but the relevant revision may involve the modeling assumption, representation design, implementation, or task constraint. Without structured feedback propagation across these levels, iterative refinement...

πŸ“„ PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning
πŸ—“οΈ Published: 5/8/2026
πŸ”— http://arxiv.org/abs/2605.07267v1
πŸ‘₯ Authors: Elahe Khatibi, Ziyu Wang (possible past University Of Oxford affiliation), Saba A. Farahani, Di Huang (possible past Google (United States) affiliation), Hung Cao, Ramesh Jain, Amir M. Rahmani
Abstract

Personalized healthcare decisions require reasoning about how physiological and behavioral variables influence an individual patient over time. Existing temporal causal discovery methods are poorly matched to this setting: cohort-level models provide stable but non-personalized structures, while per-patient discovery is unreliable because individual trajectories are short, noisy, irregular, and non-stationary. This creates a fundamental gap between population-level causal modeling and the patien...

*Notable papers are those with at least two authors from a "big" AI/ML lab.