UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

TL;DR

UniGRPO optimizes text and image generation policies using GRPO, enhancing reasoning-driven visual generation quality.

cs.CV 🔴 Advanced 2026-03-25 133 views

Jie Liu Zilyu Ye Linxiao Yuan Shenhan Zhu Yu Gao Jie Wu Kunchang Li Xionghui Wang Xiaonan Nie Weilin Huang Wanli Ouyang

unified model reinforcement learning visual generation reasoning multimodal

Key Findings

Methodology

This paper proposes a unified reinforcement learning framework, UniGRPO, for optimizing reasoning-driven visual generation. The framework models the multimodal generation process as a Markov Decision Process (MDP) and optimizes text and image generation policies using GRPO. Specifically, standard GRPO is used for reasoning optimization, and FlowGRPO is used for visual synthesis. To ensure scalability to multi-round generation, two key modifications are made to FlowGRPO: removing classifier-free guidance and replacing the standard latent KL penalty with an MSE penalty on velocity fields.

Key Results

Experimental results show that UniGRPO significantly enhances the quality of reasoning-driven image generation. It scored 0.8381 on the TA benchmark and 0.90 on the GenEval benchmark, outperforming existing baseline methods.
By eliminating classifier-free guidance, UniGRPO demonstrates higher computational efficiency and stability in multi-round and multi-condition generation scenarios.
Ablation studies reveal that using an MSE penalty on velocity fields significantly reduces reward hacking while maintaining generation performance.

Significance

The introduction of UniGRPO provides a robust baseline for unified optimization of multimodal generation models, particularly in reasoning-driven visual generation tasks. By jointly optimizing text and image generation policies, UniGRPO not only improves generation quality but also offers a scalable framework for future post-training of fully interleaved models. This research addresses long-standing pain points in multimodal generation, such as effectively combining language reasoning capabilities with high-fidelity image generation.

Technical Contribution

UniGRPO makes several notable technical contributions. First, it integrates reasoning and visual synthesis into a unified optimization loop, overcoming the separation issues in existing multimodal generation methods. Second, by removing classifier-free guidance and introducing an MSE penalty on velocity fields, UniGRPO achieves higher stability and computational efficiency in multi-round generation. Lastly, it provides a scalable framework for future multimodal generation research.

Novelty

UniGRPO is the first to model reasoning-driven image generation as a unified MDP and optimize it using GRPO. Compared to existing multimodal generation methods, UniGRPO offers fundamental methodological innovations, particularly in effectively combining language reasoning and visual synthesis.

Limitations

In some complex multi-condition generation tasks, UniGRPO may require higher computational resources to maintain generation quality.
Although classifier-free guidance is removed, it may lead to decreased alignment between generated text and images in some scenarios.
The current experimental setup focuses primarily on single-round generation tasks and has not fully validated its performance in more complex multi-round interaction scenarios.

Future Work

Future research directions include applying UniGRPO to more complex multi-round interaction generation scenarios, such as interactive image editing and visual storytelling. Additionally, introducing process reward models to improve the sample efficiency of RL training and ensure better interpretability of the model's decision-making process is an important research direction.

AI Executive Summary

In the field of multimodal generation, effectively combining language reasoning capabilities with high-fidelity image generation has long been a challenge. Existing methods often face issues with separate optimization between text and image generation, making true interleaved generation difficult to achieve. To address this problem, this paper proposes a unified reinforcement learning framework, UniGRPO, for optimizing reasoning-driven visual generation.

UniGRPO models the multimodal generation process as a Markov Decision Process (MDP) and optimizes text and image generation policies using GRPO. Specifically, standard GRPO is used for reasoning optimization, and FlowGRPO is used for visual synthesis. To ensure scalability to multi-round generation, two key modifications are made to FlowGRPO: removing classifier-free guidance and replacing the standard latent KL penalty with an MSE penalty on velocity fields.

Experimental results show that UniGRPO significantly enhances the quality of reasoning-driven image generation. It scored 0.8381 on the TA benchmark and 0.90 on the GenEval benchmark, outperforming existing baseline methods. By eliminating classifier-free guidance, UniGRPO demonstrates higher computational efficiency and stability in multi-round and multi-condition generation scenarios.

Despite its impressive performance, UniGRPO may require higher computational resources in some complex multi-condition generation tasks to maintain generation quality. Additionally, the current experimental setup focuses primarily on single-round generation tasks and has not fully validated its performance in more complex multi-round interaction scenarios. Future research directions include applying UniGRPO to more complex multi-round interaction generation scenarios, such as interactive image editing and visual storytelling.

Deep Analysis

Background

In recent years, the development of multimodal generation models has accelerated, particularly in the area of interleaved text and image generation. Traditional generative models, such as autoregressive models and diffusion models, typically perform well in single modalities but face challenges in multimodal generation. With the advancement of large language models (LLMs) in reasoning capabilities, researchers have begun exploring how to integrate these with high-fidelity image generation models to achieve more complex generation tasks. Existing research has largely focused on how to effectively leverage reasoning capabilities during generation, but there remain deficiencies in the collaborative optimization between text and image generation.

Core Problem

The core problem in multimodal generation is how to effectively combine language reasoning capabilities with high-fidelity image generation. Existing methods often face issues with separate optimization between text and image generation, making true interleaved generation difficult to achieve. Additionally, as the complexity of generation tasks increases, such as multi-round interactions and multi-condition generation, existing methods face challenges in computational efficiency and generation quality. Improving computational efficiency while maintaining generation quality is a pressing issue in the field of multimodal generation.

Innovation

UniGRPO presents several core innovations:

�� Models the multimodal generation process as a unified Markov Decision Process (MDP), achieving joint optimization of text and image generation policies.
�� Enhances computational efficiency and stability in multi-round generation by removing classifier-free guidance and introducing an MSE penalty on velocity fields.
�� Achieves efficient optimization in reasoning-driven image generation under a unified framework, significantly improving generation quality.

Methodology

The methodology of UniGRPO is detailed as follows:

�� Models the multimodal generation process as a Markov Decision Process (MDP), including state space, action space, transition function, and reward function.
�� Uses standard GRPO for reasoning optimization in text generation, with steps including: inputting user prompts, generating reasoning texts, calculating rewards, and updating policies.
�� Employs FlowGRPO for visual synthesis in image generation, with steps including: inputting reasoning texts, generating images, calculating rewards, and updating policies.
�� Ensures scalability to multi-round generation by removing classifier-free guidance, maintaining linear rollout of the generation process.
�� Replaces the standard latent KL penalty with an MSE penalty on velocity fields, providing a more robust regularization signal to reduce reward hacking.

Experiments

The experimental design includes the following aspects:

�� Datasets: Uses the Bagel model for pretraining and performs supervised fine-tuning on an internal dataset.
�� Baselines: Compares with methods such as ReFL, FPO, FlowGRPO, and TextGRPO.
�� Evaluation Metrics: Uses TA and GenEval benchmarks to evaluate text alignment and complex compositional capabilities.
�� Hyperparameters: Sets reasonable hyperparameters to ensure the fairness and reproducibility of the experiments.
�� Ablation Studies: Validates the effectiveness of the method by removing classifier-free guidance and using different regularization strategies.

Results

Applications

The application scenarios of UniGRPO include:

�� Interactive Image Editing: Achieves more complex image editing tasks through multi-round interaction generation. Users can interact with the model in real-time to adjust image content using text prompts.
�� Visual Storytelling: Combines language reasoning capabilities and image generation capabilities to achieve more expressive visual content, applicable in fields such as advertising and education.
�� Multimodal Dialogue Systems: Enhances collaborative optimization of text and image generation in multimodal dialogue systems, achieving more natural human-computer interaction experiences.

Limitations & Outlook

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (user prompt) and need to make a dish (generate an image) based on it. Before cooking, you need to think through each step (reasoning), like what ingredients you need and how to combine them. Then, you start cooking (image synthesis). Throughout this process, you adjust the flavors (optimize generation policies) to ensure the final dish meets your expectations (generation quality). UniGRPO is like a smart chef, not only making delicious dishes based on the recipe but also continuously optimizing each step to ensure the final dish is both tasty and aligned with the recipe. By removing unnecessary steps (classifier-free guidance), UniGRPO can complete the entire cooking process more efficiently while ensuring the quality and taste of the dish.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you control two characters: one is a language master, and the other is a drawing expert. The language master comes up with all sorts of fun storylines, and the drawing expert creates awesome pictures based on those stories. To win the game, you need these two characters to work perfectly together, creating both fun and beautiful works!

That's what UniGRPO does! It's like a super helper in the game, helping the language master and drawing expert collaborate better. It uses something called 'reinforcement learning' to help both characters improve their skills in the game, eventually becoming an unbeatable duo!

During this process, UniGRPO keeps adjusting strategies, just like you adjust tactics in a game, ensuring you score high every time. Even when faced with challenging levels, it finds the best solutions with smart strategies!

So, UniGRPO is like your best buddy in the game, helping you excel in the world of language and images, creating amazing works that wow everyone!

Glossary

UniGRPO (Unified Policy Optimization)

UniGRPO is a unified reinforcement learning framework for reasoning-driven visual generation, optimizing text and image generation policies to enhance generation quality.

In the paper, UniGRPO is used to optimize the multimodal generation process.

GRPO (Group Relative Policy Optimization)

GRPO is an efficient policy optimization method that eliminates the value model by using group-relative baselines, suitable for reasoning-intensive models.

In UniGRPO, GRPO is used for reasoning optimization.

FlowGRPO

FlowGRPO is a method that applies policy gradients to flow models by reformulating the generation process into a stochastic differential equation for visual synthesis optimization.

In UniGRPO, FlowGRPO is used for visual synthesis in image generation.

MDP (Markov Decision Process)

MDP is a mathematical framework for modeling decision processes, including state space, action space, transition function, and reward function.

UniGRPO models the multimodal generation process as an MDP.

MSE (Mean Squared Error)

MSE is a metric for measuring the difference between predicted and true values by calculating the average of squared errors to evaluate model performance.

In UniGRPO, MSE is used to replace the standard latent KL penalty.

Classifier-Free Guidance

Classifier-Free Guidance is a standard inference technique that ensures linear rollout of the generation process by removing classifier guidance.

In UniGRPO, classifier-free guidance is removed to improve computational efficiency.

TA (Text Alignment)

TA is a benchmark for evaluating the alignment of text generation models, measuring the consistency between generated text and input prompts.

In experiments, TA is used to evaluate the text generation quality of UniGRPO.

GenEval

GenEval is a standard benchmark for assessing the complex compositional capabilities of text-to-image models, including object counting, spatial relations, and attribute binding.

In experiments, GenEval is used to evaluate the image generation capabilities of UniGRPO.

Reward Hacking

Reward hacking is an issue in optimization where the model obtains high rewards through improper means, leading to degraded generation quality.

In UniGRPO, reward hacking is reduced by using an MSE penalty.

Ablation Study

Ablation study is a method for validating the importance of components in a model by removing or replacing certain components and observing the impact on model performance.

In experiments, ablation studies are used to validate the effectiveness of UniGRPO.

Open Questions Unanswered questions from this research

1 How to maintain context consistency in multi-round generation? Existing methods often struggle to maintain context consistency when handling long-horizon interaction generation. Future research needs to explore more effective strategies to ensure models can continuously track and maintain context in multi-round interactions.
2 How to improve the sample efficiency of RL training? Current RL training often requires a large number of samples to achieve satisfactory performance. Introducing process reward models could be a solution, providing more granular feedback during generation to improve sample efficiency.
3 How to maintain generation quality in multi-condition generation? Multi-condition generation tasks typically require handling multiple input conditions, and maintaining generation quality without increasing computational complexity is a challenge. Future research needs to explore more efficient strategies to achieve high-quality outputs in multi-condition generation.
4 How to improve text-image alignment during reasoning? Although UniGRPO performs well in reasoning-driven image generation, the alignment between text and images may decrease in some scenarios. Future research needs to explore more effective strategies to improve text-image alignment during reasoning.
5 How to enhance generation quality without increasing computational resources? Current generative models often require substantial computational resources to maintain high-quality outputs. Future research needs to explore more efficient model architectures and optimization strategies to enhance generation quality without increasing computational resources.

Applications

Immediate Applications

Interactive Image Editing

UniGRPO can be used for interactive image editing tasks, achieving more complex image editing through multi-round interaction generation. Users can interact with the model in real-time to adjust image content using text prompts.

Visual Storytelling

Combining language reasoning capabilities and image generation capabilities, UniGRPO can be used for visual storytelling, generating more expressive visual content. Applicable in fields such as advertising and education.

Multimodal Dialogue Systems

In multimodal dialogue systems, UniGRPO can enhance the collaborative optimization of text and image generation, achieving more natural human-computer interaction experiences.

Long-term Vision

Intelligent Creation Assistant

UniGRPO can become an intelligent creation assistant, helping users provide inspiration and suggestions during the creation process, generating high-quality text and image content.

Automated Content Generation

In industries such as advertising and media, UniGRPO can be used for automated content generation, improving production efficiency and reducing manual intervention.

Abstract

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

cs.CV

References (20)

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li et al.

2025 465 citations ⭐ Influential View Analysis →

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye et al.

2025 48 citations View Analysis →

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 26037 citations View Analysis →

RewardDance: Reward Scaling in Visual Generation

Jie Wu, Yu Gao, Zi-Nuo Ye et al.

2025 31 citations View Analysis →

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du et al.

2023 761 citations View Analysis →

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

Jiajun Fan, Shuaike Shen, Chaoran Cheng et al.

2025 26 citations View Analysis →

OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Jiacheng Zhang, Jie Wu, Weifeng Chen et al.

2024 34 citations View Analysis →

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky et al.

2023 357 citations View Analysis →

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Xinjie Zhang, Jintao Guo, Shanshan Zhao et al.

2025 42 citations View Analysis →

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

Zichen Miao, Jiang Wang, Ze Wang et al.

2024 56 citations

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

2024 553 citations View Analysis →

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang et al.

2019 760 citations View Analysis →

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, R. Child et al.

2019 27794 citations

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng et al.

2024 84 citations View Analysis →

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Dhruba Ghosh, H. Hajishirzi, Ludwig Schmidt

2023 664 citations View Analysis →

Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

Benjamin Yu, Jackie Liu, Justin Cui

2025 6 citations View Analysis →

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Shuchen Xue, Chongjian Ge, Shilong Zhang et al.

2025 13 citations View Analysis →

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Xiaoxuan He, Siming Fu, Yuke Zhao et al.

2025 45 citations View Analysis →

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng, Keyu Yan, Chaojie Mao et al.

2026 6 citations View Analysis →

Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

2023 2528 citations View Analysis →

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

UniGRPO (Unified Policy Optimization)

GRPO (Group Relative Policy Optimization)

FlowGRPO

MDP (Markov Decision Process)

MSE (Mean Squared Error)

Classifier-Free Guidance

TA (Text Alignment)

GenEval

Reward Hacking

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Interactive Image Editing

Visual Storytelling

Multimodal Dialogue Systems

Long-term Vision

Intelligent Creation Assistant

Automated Content Generation

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock