📄 Notable* Recent AI/ML arXiv Papers

Last updated just now...

📄 Algorithmic Monocultures in Hiring
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27371v1
👥 Authors: Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky (possible past Stanford University affiliation), Percy Liang (possible past Stanford University affiliation)
Abstract

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and ...

📄 MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27366v1
👥 Authors: Huawei Lin, Peng Li (possible past Tsinghua University affiliation), Jie Song (possible past Eth Zurich affiliation), Fuxin Jiang, Tieying Zhang
Abstract

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, mana...

📄 LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27365v1
👥 Authors: Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao (possible past Nvidia (United States) affiliation), Guilin Liu (possible past Nvidia (United States) affiliation), Jan Kautz (possible past Nvidia (United States) affiliation), Lei Zhang, Zhiding Yu (possible past Nvidia (United States) affiliation)
Abstract

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (...

📄 Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27209v1
👥 Authors: Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang (possible past Tsinghua University affiliation), Yaorui Shi, Yi Zhang (possible past Google (United States) affiliation), Qi Gu, Xunliang Cai, Xiang Wang (possible past Tencent (China) affiliation), An Zhang, Tat-Seng Chua
Abstract

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics,...

📄 VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27141v1
👥 Authors: Yuxin Chen, Yi Zhang (possible past Google (United States) affiliation), Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang (possible past Tencent (China) affiliation), An Zhang, Tat-Seng Chua
Abstract

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and...

📄 QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.27068v1
👥 Authors: Ye Yuan (possible past Carnegie Mellon University affiliation), Rui Song (possible past Peking University affiliation), Weien Li, Zeyu Li (possible past Peking University affiliation), Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu
Abstract

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-sour...

📄 Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26958v1
👥 Authors: Zixuan Yang, Yiqun Chen, Wei Yang (possible past Tencent (China) affiliation), Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu (possible past University Of California, Berkeley affiliation), Yao Hu, Jiaxin Mao
Abstract

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided L...

📄 ContextGuard: Structured Self-Auditing for Context Learning in Language Models
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26827v1
👥 Authors: Hongbo Jin, Chi Wang (possible past Microsoft (United States) affiliation), Haoran Tang (possible past University Of California, Berkeley affiliation), Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding
Abstract

Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements....

📄 HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26807v1
👥 Authors: Jiajun Wu (possible past Massachusetts Institute Of Technology affiliation), Jian Yang, Tuney Zheng, Wei Zhang (possible past Tsinghua University affiliation), Haowen Wang, Yihang Lou, Xianglong Liu
Abstract

LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic brows...

📄 What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26795v1
👥 Authors: Xiang Wang (possible past Tencent (China) affiliation), Wei Wei (possible past Google (United States) affiliation)
Abstract

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong...

📄 Ratio-Variance Regularized Policy Optimization
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26784v1
👥 Authors: Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu (possible past Tsinghua University affiliation), Fuchun Sun (possible past Tsinghua University affiliation), Jianye Hao, Dong Li
Abstract

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from n...

📄 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26781v1
👥 Authors: Xiaohan Wang (possible past Baidu (China) affiliation), Mingze Yin, Yilin Zhao, Gang Liu (possible past Tencent (China) affiliation), Dian Li (possible past Tencent (China) affiliation)
Abstract

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To ad...

📄 UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26646v1
👥 Authors: Yiqun Chen, Wei Yang (possible past Tencent (China) affiliation), Erhan Zhang, Shijie Wang, Qi Liu (possible past Tencent (China) affiliation), Zechun Niu, Bin Zhang, Haitao Li, Rui Li (possible past Google (United States) affiliation), Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu (possible past University Of California, Berkeley affiliation), Yao Hu, Jiaxin Mao
Abstract

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a ...

📄 JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26636v1
👥 Authors: Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu (possible past Google (United States) affiliation), Hongxu Yin, Yu Wang (possible past Tsinghua University affiliation), Song Han (possible past Stanford University affiliation), Han Cai
Abstract

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attenti...

📄 MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26621v1
👥 Authors: Zichun Wang, Hairong Shi, Bingzheng Wei (possible past Tencent (China) affiliation), Yan Xu (possible past Peking University affiliation), Zihua Wang
Abstract

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions....

📄 Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26606v1
👥 Authors: Woojeong Kim, Ziyi Yang (possible past Tencent (China) affiliation), Jing Nathan Yan, Jialu Liu (possible past Google (United States) affiliation)
Abstract

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effect...

📄 CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26524v1
👥 Authors: Yuxu Lu, Dong Yang (possible past Nvidia (United States) affiliation), Xiaoyu Li (possible past Tencent (China) affiliation), Mengwei Bao, Congcong Zhao
Abstract

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-b...

📄 InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26520v1
👥 Authors: Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu (possible past Baidu (China) affiliation), Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang (possible past Shanghai Jiao Tong University affiliation), Yuanjie Zheng, Wei Liu (possible past Tsinghua University affiliation), Lewei Lu
Abstract

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability v...

📄 The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26494v1
👥 Authors: Minimax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang (possible past Google (United States) affiliation), Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang (possible past Peking University affiliation), Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang (possible past Tencent (China) affiliation), Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang (possible past Tencent (China) affiliation), Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang (possible past Tencent (China) affiliation), Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang (possible past Tsinghua University affiliation), Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang (possible past Tencent (China) affiliation), Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He (possible past Carnegie Mellon University affiliation), Kang Xu, Ke Ji, Ke Yang (possible past Google (United States) affiliation), Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan (possible past National University Of Singapore affiliation), Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang (possible past Eth Zurich affiliation), Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu (possible past Tsinghua University affiliation), Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang (possible past Baidu (China) affiliation), Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang (possible past Google (United States) affiliation), Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge
Abstract

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable w...

📄 Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26441v1
👥 Authors: Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen (possible past Tencent (China) affiliation), Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng (possible past National University Of Singapore affiliation), Daizong Liu
Abstract

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video...

📄 ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26340v1
👥 Authors: Rui Meng, Bhavana Dalvi Mishra (possible past Carnegie Mellon University affiliation), Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan (possible past Google (United States) affiliation), Burak Gokturk, Jinsung Yoon (possible past Google (United States) affiliation), Tomas Pfister (possible past University Of Oxford affiliation)
Abstract

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autono...

📄 Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26797v1
👥 Authors: Zeyi Huang, Xuehai He, Liliang Ren, Yiping Wang, Baolin Peng, Hao Cheng (possible past Tencent (China) affiliation), Shuohang Wang, Pengcheng He, Jianfeng Gao (possible past Microsoft (United States) affiliation), Yong Jae Lee, Yelong Shen (possible past Tencent (China) affiliation)
Abstract

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this ...

📄 APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction
🗓️ Published: 5/26/2026
🔗 http://arxiv.org/abs/2605.26732v1
👥 Authors: Yifan Sun (possible past Baidu (China) affiliation), Lei Cheng, Sijie Chen, Ting Zhang (possible past Meta (United States) affiliation), Jianlong Li, Shikai Fang
Abstract

Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently...

📄 Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
🗓️ Published: 5/25/2026
🔗 http://arxiv.org/abs/2605.26111v1
👥 Authors: Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li (possible past Carnegie Mellon University affiliation), Igor Gilitschenski (possible past Massachusetts Institute Of Technology affiliation)
Abstract

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models...

*Notable papers are those with at least two authors from a "big" AI/ML lab.