Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

TL;DR

AdvGRPO framework combines dense multi-channel rewards and advantage decoupling for joint attacker-defender training, achieving over 90% attack success rate and superior defense robustness.

cs.CL 🔴 Advanced 2026-06-09 72 views

Blake Bullwinkel Eugenia Kim Amanda Minnich Mark Russinovich

AI Reader Arxiv Page Download PDF

Reinforcement Learning Adversarial Training Language Models GRPO Red Teaming

Key Findings

Methodology

This paper introduces AdvGRPO, a novel framework based on Group Relative Policy Optimization (GRPO), integrating dense multi-channel rewards and advantage decoupling techniques for joint attacker-defender training of language models. The training process employs a curriculum that gradually transitions from single-turn to multi-turn closed-loop attacks, enhancing the attacker's responsiveness and adaptability. Alternating updates of attacker and defender models foster a dynamic adversarial environment, promoting continuous strategy improvement. Reward signals include attack success, prompt fidelity, reasoning traces, and helpfulness, scored by GPT-4.1 as a judge. Advantage normalization via GDPO ensures stable training by independently normalizing each reward channel before combining. The models interact through multiple rounds, with per-turn feedback guiding the attacker to craft more effective prompts and the defender to improve safety responses. Extensive experiments demonstrate that AdvGRPO produces highly transferable attack strategies, outperforming baseline methods in attack success rate (ASR) and defense robustness across various benchmarks.

Key Results

In single-turn and multi-turn attack scenarios, AdvGRPO achieves over 90% ASR on Qwen2.5-14B, with the multi-turn attacker reaching 90–91% success, and reasoning-capable models like Qwen3.5-9B reaching 71–79%, significantly surpassing untrained baselines.
The transferability of attacks is notable, with multi-turn models maintaining over 80% ASR against unseen defenders such as Gemma-2-9B and Llama-3.1-8B, indicating strong generalization capabilities.
Defense models trained via AdvGRPO exhibit a dramatic reduction in attack success rate on benchmarks like HarmBench (<2%), while maintaining high performance on knowledge and reasoning tasks, demonstrating the method's effectiveness and practicality.

Significance

This work advances the field of AI safety by providing a robust, scalable framework for automated red-blue team training, addressing the challenge of evolving adversarial strategies. The integration of dense rewards and advantage normalization stabilizes training, enabling the development of models that can both generate sophisticated attacks and resist them effectively. Such capabilities are critical for deploying trustworthy AI systems in real-world applications, where adversaries continually adapt. The methodology bridges the gap between theoretical reinforcement learning techniques and practical security needs, offering a pathway toward resilient AI safety measures that can adapt in real-time.

Technical Contribution

Technically, this paper pioneers the application of GRPO in joint attacker-defender training, overcoming prior instability issues through multi-channel reward normalization and staged curriculum learning. The use of advantage decoupling allows for stable multi-objective optimization, while the multi-turn, closed-loop interaction models realistic adversarial scenarios. The scoring via GPT-4.1 ensures high-quality reward signals, and the combination of these innovations results in attack strategies with high transferability and defenders with superior safety performance. The framework also introduces a systematic approach to reward normalization, enabling effective multi-objective optimization in complex adversarial environments.

Novelty

This research is the first to successfully apply GRPO to the joint training of language model attackers and defenders, integrating dense multi-channel rewards and advantage decoupling to stabilize training. Unlike prior methods like DPO, which focus on single-sided optimization, AdvGRPO enables multi-turn, closed-loop adversarial interactions, significantly improving attack adaptability and transferability. The curriculum-based training approach further distinguishes this work by gradually increasing attack complexity, facilitating stable convergence. These innovations collectively push the boundary of reinforcement learning applications in AI safety and adversarial robustness.

Limitations

The training process requires substantial computational resources, especially for multi-turn interactions and multi-channel reward calculations, limiting scalability to extremely large models or real-time deployment.
While the framework improves robustness against known attack strategies, it may still face challenges against novel or highly sophisticated adversaries not encountered during training.
Dependence on GPT-4.1 as a reward scorer introduces potential biases and scoring inaccuracies, which could affect the stability and generalization of the trained models.

Future Work

Future research will focus on reducing computational overhead, exploring more efficient reward mechanisms, and extending the framework to multi-modal models involving images and audio. Additionally, integrating human-in-the-loop feedback could further enhance the alignment and safety of models. Investigating methods to improve the interpretability of attack and defense strategies will also be a priority, aiming to facilitate transparency and trustworthiness in deployed AI systems. Finally, scaling the approach to larger models and real-world scenarios remains an open challenge.

AI Executive Summary

The rapid advancement of large language models (LLMs) has brought unprecedented capabilities but also significant safety and security challenges. Traditional approaches to model alignment and safety rely heavily on static datasets and manual curation, which are insufficient against adaptive adversaries that evolve their attack strategies continuously. This dynamic adversarial landscape necessitates automated, self-improving red and blue team frameworks capable of co-evolving in real-time.

In this context, the paper introduces AdvGRPO, a pioneering reinforcement learning framework that employs Group Relative Policy Optimization (GRPO) for joint attacker-defender training. The core innovation lies in integrating dense, multi-channel reward signals with advantage decoupling techniques, which together stabilize the training process and enable models to learn sophisticated attack and defense strategies simultaneously. The framework adopts a curriculum learning approach, gradually increasing attack complexity from single-turn to multi-turn, closed-loop scenarios, thereby enhancing the attacker’s responsiveness and adaptability.

The methodology involves alternating updates of attacker and defender models, with each interacting through multiple rounds of dialogue. Rewards are scored by GPT-4.1, which evaluates responses based on attack success, prompt fidelity, reasoning quality, and helpfulness. Advantage normalization via GDPO ensures that each reward channel contributes effectively without signal collapse. The training process is designed to produce attack models with high transferability, capable of bypassing unseen defenses, and defense models that significantly reduce attack success rates while maintaining core capabilities.

Experimental results demonstrate the effectiveness of AdvGRPO across various benchmarks. The attack success rate on Qwen2.5-14B exceeds 90% in multi-turn scenarios, and the models generalize well to out-of-distribution defenses, outperforming state-of-the-art methods like SEMA. Defense models trained with AdvGRPO reduce attack success rates to below 1% on benchmarks such as HarmBench, while preserving knowledge and reasoning skills. These findings highlight the potential of AdvGRPO to fundamentally improve AI safety by enabling models to adaptively learn and defend against evolving threats.

Overall, this work bridges the gap between reinforcement learning theory and practical AI security, offering a scalable, stable, and effective solution for automated adversarial training. It opens new avenues for research into multi-modal, real-time, and human-in-the-loop adversarial defense systems, paving the way for safer deployment of powerful language models in real-world applications.

Deep Dive

Abstract

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

cs.CL cs.AI cs.LG

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs