📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 InSight: Self-Guided Skill Acquisition via Steerable VLAs
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24884v1
👥 Authors: Maggie Wang, Lars Osterberg, Stephen Tian (possible past University Of California, Berkeley affiliation), Ola Shorinwa, Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Mac Schwager (possible past Stanford University affiliation)
Abstract

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VL...

📄 Solving Inverse Problems of Chaotic Systems with Bidirectional Conditional Flow Matching
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24824v1
👥 Authors: Peiyan Hu, Jian Zhang (possible past Tencent (China) affiliation), Jiashu Pan, Ruiqi Feng, Tao Zhang (possible past Nvidia (United States) affiliation), Zhi-Ming Ma, Yuan-Sen Ting, Gongjie Li, Tailin Wu
Abstract

Modeling chaotic systems is crucial yet challenging. Inverse problems in chaotic dynamics, namely inferring initial conditions from final states, remain largely unsolved because of ill-posedness, non-uniqueness, instability, and potentially chaotic time-reverse dynamics. We address this open problem with Bidirectional Conditional Flow Matching (Bi-CFM), which learns bidirectional mappings between distributions of initial and final states to capture the stochasticity of chaotic evolution and miti...

📄 CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24636v1
👥 Authors: Xinyu Mao, Yuhui Zeng, Xiaokun Liu, Wenyu Qin, Meng Wang (possible past Google (United States) affiliation), Xin Tao (possible past Tencent (China) affiliation), Pengfei Wan, Xiaohan Xing, Max Meng
Abstract

Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified ope...

📄 video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24477v1
👥 Authors: Yixuan Li (possible past Meta (United States) affiliation), Guangzhi Sun (possible past University Of Cambridge affiliation), Yudong Yang, Wei Li (possible past Peking University affiliation), Zejun Ma, Chao Zhang
Abstract

Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R$^3$, the first en...

📄 LemonHarness Technical Report
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24311v1
👥 Authors: Kailong Ren, Fubo Sun, Jiachen Liu (possible past Baidu (China) affiliation), Liu Yang (possible past Google (United States) affiliation), Zimo Yin, Jiaying Li, Congli Yin, Ming He, Yu Huo, Jiawei Liu, Zeping Chen, Yubin Huangfu, Ronghua Li, Yixuan Wu, Xing Su, Yanzhi Xu, Likang Wu, Hongke Zhao, Lei Zhang, Xiaohui Geng, Jianping Fan
Abstract

As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such ...

📄 Pigeonholing: Bad prompts hurt models to collapse and make mistakes
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24267v1
👥 Authors: Hyunji Nam, Keertana Chidambaram, Dorottya Demszky (possible past Stanford University affiliation), Natasha Jaques (possible past University Of California, Berkeley affiliation)
Abstract

While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution, a...

📄 FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24231v1
👥 Authors: Xirui Li, Zhe Liu, Xiaoqing Ye (possible past Baidu (China) affiliation), Wenhua Han, Yifeng Pan (possible past Baidu (China) affiliation), Junyu Han (possible past Baidu (China) affiliation), Hengshuang Zhao (possible past University Of Oxford affiliation)
Abstract

Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning th...

📄 ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24112v1
👥 Authors: Chenhao Dang, Dantong Zhu, Jun Yang (possible past Tsinghua University affiliation), Conghui He (possible past Tsinghua University affiliation), Weijia Li (possible past Tsinghua University affiliation)
Abstract

Multimodal misinformation detection is increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text--image framing errors. Existing benchmarks and methods remain poorly matched to this setting: they usually isolate short captions, single images, binary labels, or one manipulation source, while agentic verification remains costly under realistic evidence search. We present ReMMD, a realistic multilingual multi-image agentic...

📄 Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24064v1
👥 Authors: Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen (possible past Tencent (China) affiliation), Xiaojun Quan, Jingang Wang, Qifan Wang (possible past Google (United States) affiliation)
Abstract

Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation encourages memorization of instance-specific steps rather than acquisition of transferable problem-solving skills, limiting generalization to novel problems. We propose Strategy-Guided Policy Optimization (SGPO), which replaces instance-level trajectory imitation with r...

📄 Promise and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation: a feasibility study
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23879v1
👥 Authors: Jing Wang (possible past Google (United States) affiliation), Tong Yu (possible past Carnegie Mellon University affiliation), Hao-En Lu, Zixue Zeng, Joseph K. Leader, Xin Meng, Jianbing Zhu, Jiantao Pu
Abstract

Purpose: To evaluate the feasibility and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation and deep learning-based segmentation. Approach: We developed ChameleonNet, a framework utilizing the Contrastive Unpaired Translation (CUT) network with decoupled contrastive learning (DCL) loss to synthesize non-contrast CT from contrast CT scans. Using annotations of four heart chambers (left atrium (LA), left ventricle (LV), right atrium (RA...

📄 Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23830v1
👥 Authors: Fang Wu, Weihao Xuan, Jure Leskovec (possible past Stanford University affiliation), Yejin Choi (possible past Allen Institute For Artificial Intelligence affiliation), Li Erran Li
Abstract

Molecular surfaces encode the geometric and physicochemical patterns that determine antibody-antigen recognition, central to epitope prediction. However, existing methods rely on sequences or backbone structures and struggle to capture discontinuous, surface-driven epitopes. This study presents SurfBind, a surface-centric learning framework for epitope prediction that operates directly on molecular surface representations. SurfBind integrates geometric and physicochemical cues through a Transfor...

📄 DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23626v1
👥 Authors: Yuanming Yang, Guoqing Ma, Bo Wang (possible past Tencent (China) affiliation), Yuan Zhang (possible past Google (United States) affiliation), Wei Tang, Chenyi Li, Haoyang Huang, Nan Duan
Abstract

Can representations learned for image generation also support the evaluation of generated images? We study text-to-image reward prediction as a downstream task of generative representation learning. To this end, we introduce DiT-Reward, which converts a pretrained text-to-image Diffusion Transformer into a reward model by processing near-clean image latents and aggregating text-conditioned image representations across transformer layers. Under the same training data mixture as HPSv3, DiT-Reward ...

📄 VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23543v1
👥 Authors: Haoling Li, Kai Zheng, Jie Wu, Can Xu (possible past Google (United States) affiliation), Qingfeng Sun, Han Hu, Yujiu Yang (possible past Tsinghua University affiliation)
Abstract

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evo...

📄 Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24152v1
👥 Authors: Xin Wang (possible past University Of Edinburgh affiliation), Wenxuan Liu, Tongtong Feng, Wenwu Zhu (possible past Tsinghua University affiliation)
Abstract

Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the other hand, this claim dangerously relies on the belief that scaling visual prediction alone will automatically yield physical agents. We prefer a more accurate statement: video generation models learn a partial, implicit spatiotemporal world model, but not a fully grou...

📄 FedUP: One-Shot Federated Unlearning via Centroid-Guided Plug-in Filters
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24113v1
👥 Authors: Feihong Nan, Zhengyi Zhong, Pan Wang, Weidong Bao (possible past National University Of Defense Technology affiliation), Xiongtao Zhang, Quan Wen, Ji Wang (possible past Tencent (China) affiliation)
Abstract

Federated unlearning (FU) is critical for complying with legal mandates like the right to be forgotten in decentralized systems, yet current methods face a persistent dilemma between non-target knowledge loss and high request latency. To resolve these issues, we propose FedUP, a one-shot federated unlearning framework utilizing lightweight pluggable filters that act as a "knowledge funnel" to screen out target data while preserving original model performance. By freezing original model parameter...

📄 NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction
🗓️ Published: 6/23/2026
🔗 http://arxiv.org/abs/2606.24087v1
👥 Authors: Wenhao Gao (possible past Massachusetts Institute Of Technology affiliation), Yifan Wang (possible past Stanford University affiliation), Yijia Ma, Carl Yang, Wen Li (possible past Eth Zurich affiliation), Chenyu You
Abstract

Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is organized as a coherent acoustic trajectory with strong harmonic and temporal structure. The resulting mismatch makes waveform regression unstable and causes stochastic multi-step generation to be sensitive to artifact-dependent conditioning and subject variability. W...

📄 Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23758v1
👥 Authors: Xiran Wang, Jian Zhang (possible past Tencent (China) affiliation), Lei Qi, Yang Gao (possible past Tencent (China) affiliation), Yinghuan Shi
Abstract

Domain generalization learns from multiple source domains to generalize to unseen target domains. However, it often neglects the realistic case of label mismatch between source and target. Open set domain generalization is then proposed to recognize unseen classes in unseen domains. A simple approach trains one-vs-all classifiers to separate each class and detect outliers as unknown. Yet, the imbalance between few positive samples and many negative samples skews the decision boundary towards the...

📄 A Novel Approach to Temporal QoS Estimation via Extended Kalman Filter-Incorporated Latent Feature Analysis
🗓️ Published: 6/22/2026
🔗 http://arxiv.org/abs/2606.23010v1
👥 Authors: Ye Yuan (possible past Carnegie Mellon University affiliation), Song Wang, Hongxun Zhou, Ling Wang (possible past University Of Oxford affiliation), Xin Luo
Abstract

Predicting temporal Quality of Service (QoS) data is critical for optimizing network services and rationalizing resource allocation in cloud computing and service-oriented systems. Existing mainstream methods have achieved promising predictive performance. However, their purely data-driven manner limits their ability to capture non-stationary temporal patterns, thereby leading to accuracy degradation when temporal QoS data exhibits fluctuations. To tackle this limitation, we propose a novel Exte...

*Notable papers are those with at least two authors from a "big" AI/ML lab.