UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
UniGRPO optimizes text and image generation policies using GRPO, enhancing reasoning-driven visual generation quality.
Key Findings
Methodology
This paper proposes a unified reinforcement learning framework, UniGRPO, for optimizing reasoning-driven visual generation. The framework models the multimodal generation process as a Markov Decision Process (MDP) and optimizes text and image generation policies using GRPO. Specifically, standard GRPO is used for reasoning optimization, and FlowGRPO is used for visual synthesis. To ensure scalability to multi-round generation, two key modifications are made to FlowGRPO: removing classifier-free guidance and replacing the standard latent KL penalty with an MSE penalty on velocity fields.
Key Results
- Experimental results show that UniGRPO significantly enhances the quality of reasoning-driven image generation. It scored 0.8381 on the TA benchmark and 0.90 on the GenEval benchmark, outperforming existing baseline methods.
- By eliminating classifier-free guidance, UniGRPO demonstrates higher computational efficiency and stability in multi-round and multi-condition generation scenarios.
- Ablation studies reveal that using an MSE penalty on velocity fields significantly reduces reward hacking while maintaining generation performance.
Significance
The introduction of UniGRPO provides a robust baseline for unified optimization of multimodal generation models, particularly in reasoning-driven visual generation tasks. By jointly optimizing text and image generation policies, UniGRPO not only improves generation quality but also offers a scalable framework for future post-training of fully interleaved models. This research addresses long-standing pain points in multimodal generation, such as effectively combining language reasoning capabilities with high-fidelity image generation.
Technical Contribution
UniGRPO makes several notable technical contributions. First, it integrates reasoning and visual synthesis into a unified optimization loop, overcoming the separation issues in existing multimodal generation methods. Second, by removing classifier-free guidance and introducing an MSE penalty on velocity fields, UniGRPO achieves higher stability and computational efficiency in multi-round generation. Lastly, it provides a scalable framework for future multimodal generation research.
Novelty
UniGRPO is the first to model reasoning-driven image generation as a unified MDP and optimize it using GRPO. Compared to existing multimodal generation methods, UniGRPO offers fundamental methodological innovations, particularly in effectively combining language reasoning and visual synthesis.
Limitations
- In some complex multi-condition generation tasks, UniGRPO may require higher computational resources to maintain generation quality.
- Although classifier-free guidance is removed, it may lead to decreased alignment between generated text and images in some scenarios.
- The current experimental setup focuses primarily on single-round generation tasks and has not fully validated its performance in more complex multi-round interaction scenarios.
Future Work
Future research directions include applying UniGRPO to more complex multi-round interaction generation scenarios, such as interactive image editing and visual storytelling. Additionally, introducing process reward models to improve the sample efficiency of RL training and ensure better interpretability of the model's decision-making process is an important research direction.
AI Executive Summary
In the field of multimodal generation, effectively combining language reasoning capabilities with high-fidelity image generation has long been a challenge. Existing methods often face issues with separate optimization between text and image generation, making true interleaved generation difficult to achieve. To address this problem, this paper proposes a unified reinforcement learning framework, UniGRPO, for optimizing reasoning-driven visual generation.
UniGRPO models the multimodal generation process as a Markov Decision Process (MDP) and optimizes text and image generation policies using GRPO. Specifically, standard GRPO is used for reasoning optimization, and FlowGRPO is used for visual synthesis. To ensure scalability to multi-round generation, two key modifications are made to FlowGRPO: removing classifier-free guidance and replacing the standard latent KL penalty with an MSE penalty on velocity fields.
Experimental results show that UniGRPO significantly enhances the quality of reasoning-driven image generation. It scored 0.8381 on the TA benchmark and 0.90 on the GenEval benchmark, outperforming existing baseline methods. By eliminating classifier-free guidance, UniGRPO demonstrates higher computational efficiency and stability in multi-round and multi-condition generation scenarios.
The introduction of UniGRPO provides a robust baseline for unified optimization of multimodal generation models, particularly in reasoning-driven visual generation tasks. By jointly optimizing text and image generation policies, UniGRPO not only improves generation quality but also offers a scalable framework for future post-training of fully interleaved models.
Despite its impressive performance, UniGRPO may require higher computational resources in some complex multi-condition generation tasks to maintain generation quality. Additionally, the current experimental setup focuses primarily on single-round generation tasks and has not fully validated its performance in more complex multi-round interaction scenarios. Future research directions include applying UniGRPO to more complex multi-round interaction generation scenarios, such as interactive image editing and visual storytelling.
Deep Analysis
Background
In recent years, the development of multimodal generation models has accelerated, particularly in the area of interleaved text and image generation. Traditional generative models, such as autoregressive models and diffusion models, typically perform well in single modalities but face challenges in multimodal generation. With the advancement of large language models (LLMs) in reasoning capabilities, researchers have begun exploring how to integrate these with high-fidelity image generation models to achieve more complex generation tasks. Existing research has largely focused on how to effectively leverage reasoning capabilities during generation, but there remain deficiencies in the collaborative optimization between text and image generation.
Core Problem
The core problem in multimodal generation is how to effectively combine language reasoning capabilities with high-fidelity image generation. Existing methods often face issues with separate optimization between text and image generation, making true interleaved generation difficult to achieve. Additionally, as the complexity of generation tasks increases, such as multi-round interactions and multi-condition generation, existing methods face challenges in computational efficiency and generation quality. Improving computational efficiency while maintaining generation quality is a pressing issue in the field of multimodal generation.
Innovation
UniGRPO presents several core innovations:
- �� Models the multimodal generation process as a unified Markov Decision Process (MDP), achieving joint optimization of text and image generation policies.
- �� Enhances computational efficiency and stability in multi-round generation by removing classifier-free guidance and introducing an MSE penalty on velocity fields.
- �� Achieves efficient optimization in reasoning-driven image generation under a unified framework, significantly improving generation quality.
Methodology
The methodology of UniGRPO is detailed as follows:
- �� Models the multimodal generation process as a Markov Decision Process (MDP), including state space, action space, transition function, and reward function.
- �� Uses standard GRPO for reasoning optimization in text generation, with steps including: inputting user prompts, generating reasoning texts, calculating rewards, and updating policies.
- �� Employs FlowGRPO for visual synthesis in image generation, with steps including: inputting reasoning texts, generating images, calculating rewards, and updating policies.
- �� Ensures scalability to multi-round generation by removing classifier-free guidance, maintaining linear rollout of the generation process.
- �� Replaces the standard latent KL penalty with an MSE penalty on velocity fields, providing a more robust regularization signal to reduce reward hacking.
Experiments
The experimental design includes the following aspects:
- �� Datasets: Uses the Bagel model for pretraining and performs supervised fine-tuning on an internal dataset.
- �� Baselines: Compares with methods such as ReFL, FPO, FlowGRPO, and TextGRPO.
- �� Evaluation Metrics: Uses TA and GenEval benchmarks to evaluate text alignment and complex compositional capabilities.
- �� Hyperparameters: Sets reasonable hyperparameters to ensure the fairness and reproducibility of the experiments.
- �� Ablation Studies: Validates the effectiveness of the method by removing classifier-free guidance and using different regularization strategies.
Results
Experimental results show that UniGRPO significantly enhances the quality of reasoning-driven image generation. It scored 0.8381 on the TA benchmark and 0.90 on the GenEval benchmark, outperforming existing baseline methods. By eliminating classifier-free guidance, UniGRPO demonstrates higher computational efficiency and stability in multi-round and multi-condition generation scenarios. Ablation studies reveal that using an MSE penalty on velocity fields significantly reduces reward hacking while maintaining generation performance.
Applications
The application scenarios of UniGRPO include:
- �� Interactive Image Editing: Achieves more complex image editing tasks through multi-round interaction generation. Users can interact with the model in real-time to adjust image content using text prompts.
- �� Visual Storytelling: Combines language reasoning capabilities and image generation capabilities to achieve more expressive visual content, applicable in fields such as advertising and education.
- �� Multimodal Dialogue Systems: Enhances collaborative optimization of text and image generation in multimodal dialogue systems, achieving more natural human-computer interaction experiences.
Limitations & Outlook
Despite its impressive performance, UniGRPO may require higher computational resources in some complex multi-condition generation tasks to maintain generation quality. Additionally, the current experimental setup focuses primarily on single-round generation tasks and has not fully validated its performance in more complex multi-round interaction scenarios. Future research directions include applying UniGRPO to more complex multi-round interaction generation scenarios, such as interactive image editing and visual storytelling.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a recipe (user prompt) and need to make a dish (generate an image) based on it. Before cooking, you need to think through each step (reasoning), like what ingredients you need and how to combine them. Then, you start cooking (image synthesis). Throughout this process, you adjust the flavors (optimize generation policies) to ensure the final dish meets your expectations (generation quality). UniGRPO is like a smart chef, not only making delicious dishes based on the recipe but also continuously optimizing each step to ensure the final dish is both tasty and aligned with the recipe. By removing unnecessary steps (classifier-free guidance), UniGRPO can complete the entire cooking process more efficiently while ensuring the quality and taste of the dish.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game where you control two characters: one is a language master, and the other is a drawing expert. The language master comes up with all sorts of fun storylines, and the drawing expert creates awesome pictures based on those stories. To win the game, you need these two characters to work perfectly together, creating both fun and beautiful works!
That's what UniGRPO does! It's like a super helper in the game, helping the language master and drawing expert collaborate better. It uses something called 'reinforcement learning' to help both characters improve their skills in the game, eventually becoming an unbeatable duo!
During this process, UniGRPO keeps adjusting strategies, just like you adjust tactics in a game, ensuring you score high every time. Even when faced with challenging levels, it finds the best solutions with smart strategies!
So, UniGRPO is like your best buddy in the game, helping you excel in the world of language and images, creating amazing works that wow everyone!
Glossary
UniGRPO (Unified Policy Optimization)
UniGRPO is a unified reinforcement learning framework for reasoning-driven visual generation, optimizing text and image generation policies to enhance generation quality.
In the paper, UniGRPO is used to optimize the multimodal generation process.
GRPO (Group Relative Policy Optimization)
GRPO is an efficient policy optimization method that eliminates the value model by using group-relative baselines, suitable for reasoning-intensive models.
In UniGRPO, GRPO is used for reasoning optimization.
FlowGRPO
FlowGRPO is a method that applies policy gradients to flow models by reformulating the generation process into a stochastic differential equation for visual synthesis optimization.
In UniGRPO, FlowGRPO is used for visual synthesis in image generation.
MDP (Markov Decision Process)
MDP is a mathematical framework for modeling decision processes, including state space, action space, transition function, and reward function.
UniGRPO models the multimodal generation process as an MDP.
MSE (Mean Squared Error)
MSE is a metric for measuring the difference between predicted and true values by calculating the average of squared errors to evaluate model performance.
In UniGRPO, MSE is used to replace the standard latent KL penalty.
Classifier-Free Guidance
Classifier-Free Guidance is a standard inference technique that ensures linear rollout of the generation process by removing classifier guidance.
In UniGRPO, classifier-free guidance is removed to improve computational efficiency.
TA (Text Alignment)
TA is a benchmark for evaluating the alignment of text generation models, measuring the consistency between generated text and input prompts.
In experiments, TA is used to evaluate the text generation quality of UniGRPO.
GenEval
GenEval is a standard benchmark for assessing the complex compositional capabilities of text-to-image models, including object counting, spatial relations, and attribute binding.
In experiments, GenEval is used to evaluate the image generation capabilities of UniGRPO.
Reward Hacking
Reward hacking is an issue in optimization where the model obtains high rewards through improper means, leading to degraded generation quality.
In UniGRPO, reward hacking is reduced by using an MSE penalty.
Ablation Study
Ablation study is a method for validating the importance of components in a model by removing or replacing certain components and observing the impact on model performance.
In experiments, ablation studies are used to validate the effectiveness of UniGRPO.
Open Questions Unanswered questions from this research
- 1 How to maintain context consistency in multi-round generation? Existing methods often struggle to maintain context consistency when handling long-horizon interaction generation. Future research needs to explore more effective strategies to ensure models can continuously track and maintain context in multi-round interactions.
- 2 How to improve the sample efficiency of RL training? Current RL training often requires a large number of samples to achieve satisfactory performance. Introducing process reward models could be a solution, providing more granular feedback during generation to improve sample efficiency.
- 3 How to maintain generation quality in multi-condition generation? Multi-condition generation tasks typically require handling multiple input conditions, and maintaining generation quality without increasing computational complexity is a challenge. Future research needs to explore more efficient strategies to achieve high-quality outputs in multi-condition generation.
- 4 How to improve text-image alignment during reasoning? Although UniGRPO performs well in reasoning-driven image generation, the alignment between text and images may decrease in some scenarios. Future research needs to explore more effective strategies to improve text-image alignment during reasoning.
- 5 How to enhance generation quality without increasing computational resources? Current generative models often require substantial computational resources to maintain high-quality outputs. Future research needs to explore more efficient model architectures and optimization strategies to enhance generation quality without increasing computational resources.
Applications
Immediate Applications
Interactive Image Editing
UniGRPO can be used for interactive image editing tasks, achieving more complex image editing through multi-round interaction generation. Users can interact with the model in real-time to adjust image content using text prompts.
Visual Storytelling
Combining language reasoning capabilities and image generation capabilities, UniGRPO can be used for visual storytelling, generating more expressive visual content. Applicable in fields such as advertising and education.
Multimodal Dialogue Systems
In multimodal dialogue systems, UniGRPO can enhance the collaborative optimization of text and image generation, achieving more natural human-computer interaction experiences.
Long-term Vision
Intelligent Creation Assistant
UniGRPO can become an intelligent creation assistant, helping users provide inspiration and suggestions during the creation process, generating high-quality text and image content.
Automated Content Generation
In industries such as advertising and media, UniGRPO can be used for automated content generation, improving production efficiency and reducing manual intervention.
Abstract
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
References (20)
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li et al.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye et al.
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal et al.
RewardDance: Reward Scaling in Visual Generation
Jie Wu, Yu Gao, Zi-Nuo Ye et al.
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du et al.
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization
Jiajun Fan, Shuaike Shen, Chaoran Cheng et al.
OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
Jiacheng Zhang, Jie Wu, Weifeng Chen et al.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Kevin Clark, Paul Vicol, Kevin Swersky et al.
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Xinjie Zhang, Jintao Guo, Shanshan Zhao et al.
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
Zichen Miao, Jiang Wang, Ze Wang et al.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai et al.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang et al.
Language Models are Unsupervised Multitask Learners
Alec Radford, Jeff Wu, R. Child et al.
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng et al.
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
Dhruba Ghosh, H. Hajishirzi, Ludwig Schmidt
Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models
Benjamin Yu, Jackie Liu, Justin Cui
Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
Shuchen Xue, Chongjian Ge, Shilong Zhang et al.
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
Xiaoxuan He, Siming Fu, Yuke Zhao et al.
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
Haoyou Deng, Keyu Yan, Chaojie Mao et al.
Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang et al.