InterleaveThinker: Reinforcing Agentic Interleaved Generation

TL;DR

InterleaveThinker employs a multi-agent framework with a planner and critic, achieving high-quality interleaved text-image generation with step-wise reinforcement learning, improving performance on benchmarks by over 50%.

cs.CV 🔴 Advanced 2026-06-12 71 views

Dian Zheng Harry Lee Manyuan Zhang Kaituo Feng Zoey Guo Ray Zhang Hongsheng Li

AI Reader Arxiv Page Download PDF

multimodal generation multi-agent systems reinforcement learning text-image sequence long-horizon tasks

Key Findings

Methodology

The proposed InterleaveThinker framework integrates three main modules: a Planner that analyzes input sequences and generates a comprehensive set of instructions upfront, a Generator that produces images step-by-step based on refined prompts, and a Critic that evaluates each generated image against the instructions, providing feedback and prompt refinements. The training involves constructing high-quality datasets—Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k—using supervised fine-tuning (SFT) and reinforcement learning (GRPO). To address the computational challenge of optimizing entire long trajectories (often over 25 steps), the authors develop a dual-reward strategy—accuracy reward and step-wise reward—that enables effective single-step RL. This design ensures the model can handle complex, multi-step text-image tasks with high fidelity and coherence, outperforming existing models like Nano Banana and GPT-5 on interleaved benchmarks.

Key Results

On the 4-step FLUX.2-klein benchmark, the WISE score increased from 0.47 to 0.73, and the RISE score jumped from 13.3 to 28.9, demonstrating substantial improvements in reasoning and visual coherence.
Across multiple models, performance gains exceeded 10%, especially in tasks requiring multi-step reasoning, validating the framework’s adaptability and robustness.
The single-step reinforcement learning approach significantly reduced training costs—by approximately 50%—while maintaining or improving overall output quality, proving its efficiency in long-horizon tasks.

Significance

This work addresses a fundamental bottleneck in multimodal AI—long-horizon, multi-step interleaved generation—by introducing a multi-agent architecture that decouples planning, generation, and evaluation. It effectively mitigates issues like visual over-reliance and error accumulation, enabling models to produce coherent, high-fidelity sequences over extended steps. This advancement has profound implications for applications such as visual storytelling, robotic instruction, and virtual assistants, where complex multi-modal interactions are essential. The framework’s generality allows it to be integrated with various existing models, broadening its impact across academia and industry, and paving the way for more autonomous, reliable multimodal systems.

Technical Contribution

The primary technical innovation lies in the multi-agent design combining a global planner, a step-wise generator, and an evaluator critic, forming a closed-loop system that ensures adherence to global objectives while correcting local deviations. The planner, fine-tuned via supervised learning, predicts entire instruction sequences, bypassing visual feedback over-reliance. The critic, trained with high-quality curated datasets, performs fine-grained evaluation at each step, guiding prompt refinements. The reinforcement learning component employs a dual-reward strategy—accuracy and step-wise rewards—enabling effective optimization of long sequences with minimal computational overhead. This approach introduces a new paradigm for long-horizon, multi-step multimodal generation, balancing efficiency and accuracy, and establishing theoretical guarantees for trajectory consistency.

Novelty

This is the first work to explicitly incorporate a multi-agent framework with a global planner and local critic for long-horizon interleaved text-image generation. Unlike previous models such as UMMs, which rely on step-by-step visual conditioning prone to error accumulation, InterleaveThinker decouples planning from generation, preventing myopic reactions. The dual-reward single-step RL strategy further distinguishes this work, enabling trajectory-level alignment without prohibitive computational costs. The comprehensive dataset construction pipeline and the integration of supervised fine-tuning with reinforcement learning also contribute to its novelty, providing a scalable and adaptable solution for complex multimodal tasks.

Limitations

Despite its strengths, the framework's computational demands remain high, especially during training with large datasets and reinforcement learning iterations, limiting real-time applications in resource-constrained environments.
The reliance on curated high-quality datasets may restrict generalization to unseen or highly specialized domains, requiring further dataset expansion and domain adaptation techniques.
Handling extremely long sequences (beyond 50 steps) still poses challenges due to cumulative computational costs and potential error propagation, necessitating further optimization of the RL strategy and model architecture.

Future Work

Future research will focus on improving the efficiency of the multi-agent system, exploring more scalable reinforcement learning algorithms, and extending the framework to handle even longer sequences. Integrating unsupervised or semi-supervised learning could reduce data dependency, while enhancing interpretability and robustness. Additionally, applying this architecture to real-world robotics, autonomous agents, and complex storytelling systems will be key directions. Cross-modal generalization, such as incorporating audio or haptic feedback, also presents promising avenues for creating more versatile multimodal AI systems capable of autonomous, multi-step reasoning and interaction.

AI Executive Summary

The rapid evolution of multimodal AI has unlocked new possibilities for visual storytelling, robotic guidance, and immersive virtual environments. Yet, a persistent challenge remains: enabling models to generate and reason over long, complex sequences involving multiple steps and modalities. Traditional image generation models excel at single-image tasks but falter when tasked with multi-step, interleaved text-image workflows. This gap hampers their application in scenarios requiring coherent, multi-stage interactions—such as detailed visual narratives or embodied robotic manipulation.

Addressing this challenge, the authors introduce InterleaveThinker, a pioneering multi-agent framework designed to imbue existing image generators with robust interleaved generation capabilities. The core idea is to decouple the planning, generation, and evaluation processes into three specialized modules. The planner, trained via supervised fine-tuning, predicts a comprehensive sequence of instructions before generation begins, thus avoiding the pitfalls of visual over-reliance on intermediate states. The generator then follows these instructions step-by-step, producing images aligned with the global plan. Meanwhile, the critic evaluates each generated image against the instruction, providing feedback and prompts for correction, ensuring the entire sequence remains faithful to the original goal.

This architecture fundamentally shifts how long-horizon multimodal tasks are approached. Instead of reactive, step-by-step conditioning prone to error accumulation, the system employs a proactive planning stage combined with iterative evaluation and correction. To train such a system, the authors curated high-quality datasets covering diverse scenarios—embodied manipulation, storytelling, scientific workflows—using a combination of synthetic generation, filtering, and splitting strategies. The training pipeline involves supervised fine-tuning of the planner and critic, complemented by reinforcement learning with a dual-reward mechanism that guides the critic’s step-wise corrections efficiently.

Experimental results demonstrate that InterleaveThinker significantly outperforms existing open-source models on challenging benchmarks. For example, on the 4-step FLUX.2-klein task, the WISE score improved from 0.47 to 0.73, and RISE from 13.3 to 28.9, indicating substantial gains in reasoning and visual coherence. The framework’s adaptability was validated across multiple models, including FLUX.2-klein and Qwen-image-Edit, confirming its universality. Beyond interleaved generation, the approach also enhances baseline models’ reasoning capabilities, marking a notable leap forward in multimodal AI.

This research opens new avenues for developing autonomous, reliable multimodal systems capable of handling complex, multi-step tasks. Its implications span virtual storytelling, robotic instruction, and beyond, offering a scalable, efficient blueprint for future AI systems. Despite current limitations in computational cost and data dependency, ongoing work aims to optimize efficiency, extend sequence length, and broaden application domains. Overall, InterleaveThinker represents a significant stride toward truly intelligent, multi-modal reasoning agents, setting the stage for next-generation AI applications that seamlessly integrate vision, language, and action over extended workflows.

Deep Analysis

Background

近年来，深度学习推动多模态生成技术快速发展，代表性模型如Diffusion、Autoregressive架构极大提升了图像逼真度和指令遵循能力。OpenAI的DALL·E 2、Stable Diffusion等模型在单图像生成和编辑任务中取得突破，但受限于架构设计，难以实现多步骤、多模态交错任务。随着CLIP、Florence等多模态模型的出现，支持文本与图像的交互成为可能，但在长序列生成中仍面临视觉过度依赖和误差累积的挑战。现有方法如DuoGen结合视频生成，改善连续性，但缺乏通用性和可扩展性。整体而言，长序列、多步骤的交错生成仍是AI研究中的难点，亟需创新架构和训练策略。

Core Problem

当前图像生成模型在多步骤交错任务中的表现有限，主要问题包括：1）视觉过度依赖：模型在生成过程中过度依赖中间视觉状态，导致偏离全局目标；2）逐步误差累积：小的偏差在多步骤中不断放大，最终影响整体效果。这些问题限制了模型在复杂场景中的应用，如视觉叙事、机器人指导等。解决方案需要在保证局部准确性的同时，提升整体一致性和鲁棒性，尤其在长序列任务中尤为重要。传统方法多采用逐步微调或后处理修正，但效果有限，难以应对复杂、多变的场景需求。

Innovation

本文的核心创新在于引入多智能体架构，将任务拆分为规划、生成和评估三个环节：1）规划器提前生成全局指令，避免中间状态的视觉过度依赖；2）生成器根据细化提示逐步生成图像，确保局部质量；3）批评者在每一步评估输出，识别偏差并优化提示，实现动态修正。这一设计区别于传统单一模型，提供了全局规划与局部修正的结合方式。采用GRPO的单步强化学习策略，有效降低长序列优化的计算成本，确保轨迹整体一致性。数据方面，构建多场景、多任务的高质量训练集，结合筛选和分割策略，提升模型泛化能力。这一体系突破了视觉过度依赖和误差累积的瓶颈，推动多模态交互向更高水平发展。

Methodology

�� 任务分析：输入文本-图像序列，规划器分析后提前生成全局指令（Instruction），包括每一步的操作（ui）、提示（pi）和补充信息（ai）；
�� 指令生成：利用Qwen-VL-8B-Instruct等模型进行微调，确保规划器能生成符合任务需求的全局指令集；
�� 图像生成：在每一步，根据细化提示（rt_i）和前一帧图像（Ii-1），由图像生成模型（如FLUX.2-klein）逐步生成新图像（It_i）；
�� 评估与修正：批评者（Critic）在每一步评估生成图像与指令的一致性，输出偏差判断（jt_i）和修正提示（rt+1_i），指导下一轮生成；
�� 训练策略：采用监督微调（SFT）和基于GRPO的单步强化学习，优化批评者的修正能力，确保轨迹整体一致性；
�� 数据构建：通过合成、筛选和分割多场景、多任务数据，确保训练集多样性和质量，提升模型泛化能力。

Experiments

实验采用多场景、多任务数据集，涵盖embodied manipulation、艺术、故事叙述等。对比基线包括单一模型和UMMs，评估指标包括WISE、RISE等长序列推理指标。对不同模型（如FLUX.2-klein、Qwen-image-Edit）进行测试，验证框架的适应性和性能提升。通过消融实验分析规划器、批评者和强化学习策略的贡献。实验还包括长序列任务的复杂性分析，验证单步强化学习在降低成本和提升效果方面的优势。

Results

�� 在4步FLUX.2-klein任务中，WISE指标由0.47提升至0.73，RISE由13.3跃升至28.9，显示出在复杂推理中的优越性；
�� 多模型验证显示，性能提升普遍在10%以上，尤其在连续推理和多步骤场景中效果显著；
�� 采用单步强化学习策略，显著降低训练成本（节省约50%的计算资源），同时保持甚至提升整体性能，验证了其在长序列任务中的效率优势。

Applications

�� 视觉叙事：自动生成多步骤故事情节，提高虚拟角色的交互能力；
�� 机器人操控：实现复杂指令的连续执行，提升自主机器人在家庭和工业环境中的表现；
�� 教育培训：辅助教学场景中的多步骤演示和指导，增强学习体验；
�� 影视制作：自动化生成连续场景，节省后期制作时间。未来还可结合虚拟现实，打造沉浸式交互体验。

Limitations & Outlook

�� 计算成本较高，尤其在长序列（超过50步）任务中，模型训练和推理的资源需求仍然较大；
�� 依赖高质量训练数据，数据采集和筛选过程繁琐，可能限制模型在未覆盖场景中的泛化能力，尤其在专业领域或少样本任务中表现尚待验证；
�� 当前方法对极端复杂场景的适应性有限，未来需优化模型的鲁棒性和效率，探索更智能的修正机制。

Plain Language Accessible to non-experts

想象你在做一份复杂的菜谱，需要提前规划好所有步骤，从准备食材到烹饪到装盘。普通厨师（模型）可能只会专注于每个步骤，容易在中途迷失或出错。而InterleaveThinker就像一个聪明的厨房助手，它会提前帮你把所有步骤都安排好，确保每个环节都按照大计划进行。它还会在你烹饪过程中不断检查，发现偏差就提醒你调整。这样，整个做菜过程变得井井有条，不会因为某个小错误而影响最终的美味。这个系统让复杂的菜谱变得像做家常菜一样简单，既有全局把控，又能逐步修正偏差，确保每次都做出完美的菜肴。

ELI14 Explained like you're 14

你可以把这个技术想象成一个超级聪明的老师，教你完成一项很难的任务，比如拼装一个复杂的模型。普通老师可能只会告诉你一步步怎么做，但如果你走错了，可能就会迷路或者拼错。这个新老师不但会提前帮你规划好整个拼装的步骤，还会在你拼的时候不断检查，发现哪里不对就告诉你怎么修正。这样，你就能按照计划一步步完成任务，而且每次都能修正错误，不会偏离目标。它就像一个有耐心、懂得提前安排和随时指导的好伙伴，让你轻松搞定复杂的事情。

Abstract

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

cs.CV

References (20)

VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation

Max W.F. Ku, Dongfu Jiang, Cong Wei et al.

2023 177 citations ⭐ Influential View Analysis →

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou et al.

2025 724 citations ⭐ Influential View Analysis →

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng et al.

2025 88 citations View Analysis →

Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation

Zhengyuan Yang, Jianfeng Wang, Linjie Li et al.

2024 18 citations

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yong-Feng Yang et al.

2024 16 citations View Analysis →

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kaizhi Zheng, Xuehai He, Xin Eric Wang

2023 132 citations

AUTO-ENCODING VARIATIONAL BAYES

Romain Lopez, Pierre Boyeau, N. Yosef et al.

2020 23726 citations

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, Anyi Rao, Maneesh Agrawala

2023 7072 citations View Analysis →

Qwen2.5-VL Technical Report

Shuai Bai, Ke-qin Chen, Xuejing Liu et al.

2025 4890 citations View Analysis →

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul et al.

2025 39 citations View Analysis →

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey et al.

2023 4854 citations View Analysis →

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

2025 58 citations View Analysis →

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang et al.

2025 72 citations View Analysis →

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, A. Blattmann et al.

2025 783 citations View Analysis →

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Zhenyu Wang, Aoxue Li, Zhenguo Li et al.

2024 131 citations View Analysis →

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao et al.

2025 156 citations View Analysis →

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen et al.

2025 99 citations View Analysis →

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, P. Abbeel

2020 31639 citations View Analysis →

Making LLaMA SEE and Draw with SEED Tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng et al.

2023 210 citations View Analysis →

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang, Yu Zeng, Qiuchen Wang et al.

2026 19 citations View Analysis →

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence