Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure
LLM-guided evolutionary co-evolution of constitutions achieves ~0.78 stable equilibrium in Public Goods Game arms race.
Key Findings
Methodology
This paper introduces an LLM-guided evolutionary search framework for co-evolving natural-language constitutions in multi-agent adversarial settings. Specifically, it studies a constitutional arms race between Blue cooperators and Red free-riders across Public Goods Game (PGG) and spatial grid-world environments. Each faction’s constitution is a priority-ordered set of natural-language rules that agents follow verbatim. The evolutionary process alternates updates between factions using OpenEvolve combined with MAP-Elites, with fitness functions including faction-specific scores, score-advantage (S_own - S_opp), and pure adversarial objectives. Coupling of fitness and evaluation seed count K are critical to induce genuine adversarial pressure and maintain search stability.
Key Results
- In the PGG, Blue faction’s score improved from 0.370 to 0.777 and Red faction’s from 0.177 to 0.782 over 30 generations, converging to a stable near-parity equilibrium around 0.78, robust across multipliers m in {1.2, 1.5, 2.0, 3.0}.
- In independently scored environments, faction scores were statistically uncorrelated (corr(S_B, S_R) = +0.088), yielding no adversarial pressure. Introducing score-advantage fitness restored adversarial dynamics and enabled a constitutional arms race.
- Under pure-adversary fitness, evaluation seed count K controlled mode regression: K=2 led to regression, while K=5 sustained strong adversarial specialization for all 30 generations, highlighting evaluation budget as a key lever.
Significance
This study is the first to systematically demonstrate the feasibility of natural-language constitutional co-evolution under multi-agent adversarial pressure, overcoming limitations of single-agent or cooperative assumptions. By identifying fitness coupling and evaluation budget as critical factors, it advances understanding of how to induce and sustain adversarial dynamics in constitutional AI. The evolved Red constitutions serve as interpretable red-team artifacts, enabling rigorous robustness testing of cooperative governance mechanisms. This work lays foundational theoretical and practical groundwork for multi-agent governance rule design under conflict.
Technical Contribution
Technically, the paper innovates by integrating LLMs as mutation operators within OpenEvolve and MAP-Elites to evolve priority-ordered natural-language constitutional rules. It introduces the concept of score-advantage fitness to enforce fitness coupling, resolving the issue of independent scoring failing to produce adversarial pressure. Furthermore, it uncovers the critical role of evaluation seed count in mitigating fitness estimation noise and preventing mode regression during pure adversarial search. These contributions enrich the theoretical framework and engineering toolkit for LLM-guided evolutionary optimization in multi-agent adversarial contexts.
Novelty
This work is the first to explore LLM-guided co-evolution of natural-language constitutions under explicit adversarial pressure, contrasting prior research focused solely on cooperative settings. By combining fitness coupling and evaluation budget control, it achieves stable constitutional arms races and produces interpretable, transferable adversarial constitutions. This fills a gap in multi-agent constitutional AI research, pioneering methods to stress-test governance rules against adaptive social conflict.
Limitations
- Fitness estimation noise depends heavily on evaluation seed count; low budgets cause mode regression, limiting search efficiency and stability.
- Experiments are confined to PGG and a specific grid-world, lacking validation in more complex or real-world multi-agent adversarial scenarios, thus limiting generalizability.
- No transfer experiments were conducted to compare robustness of adversarially-evolved constitutions against fresh adversaries, leaving open questions on practical superiority.
Future Work
Future directions include extending the framework to more complex multi-agent environments with multiple heterogeneous factions, improving LLM mutation operator stability, and scaling evaluation budgets to reduce noise. Conducting transfer and robustness tests of adversarially-evolved constitutions against novel opponents is crucial to validate practical security. These efforts will advance adaptive, interpretable governance rule design for safe multi-agent systems.
AI Executive Summary
Multi-agent systems, especially those powered by large language models (LLMs), face complex dynamics of cooperation and defection. Traditional constitutional AI approaches assume single-agent or cooperative settings, which fall short in adversarial multi-agent environments where agents may engage in sabotage, blackmail, or information leaks. This paper addresses these challenges by proposing an LLM-guided evolutionary search framework that co-evolves natural-language constitutions for two adversarial factions—Blue cooperators and Red free-riders—across two distinct environments: the Public Goods Game (PGG) and a spatial grid-world. The framework leverages OpenEvolve combined with MAP-Elites to iteratively update priority-ordered rule sets that govern agent behavior, simulating a constitutional arms race.
A key technical innovation is the design of fitness functions that induce genuine adversarial pressure. The PGG environment’s inherent payoff coupling naturally fosters competition, leading to a stable near-parity equilibrium around 0.78 after 30 generations, robust across different multiplier parameters. In contrast, independently scored environments lack such coupling, resulting in uncorrelated faction scores and no true adversarial dynamics. Introducing a score-advantage fitness function (S_own - S_opp) restores competition and enables meaningful co-evolution. Additionally, the evaluation seed count K is identified as a critical hyperparameter controlling mode regression in pure adversarial search: low K leads to degradation, whereas K=5 maintains stable adversarial specialization.
Experiments further reveal the impact of information asymmetry: when the Red faction observes Blue’s actions, it gains a significant advantage, highlighting the role of information in strategy evolution. Defensive mechanisms, such as requiring coordinated attacks, effectively reduce Red’s dominance, demonstrating practical levers for mechanism design. The evolved Red constitutions, expressed as interpretable natural-language rules, serve as valuable red-team artifacts for testing cooperative governance robustness.
This work advances constitutional AI by extending it from single-agent or cooperative paradigms to multi-agent adversarial settings, providing a methodology to construct and diagnose constitutional arms races. It identifies fitness coupling and evaluation budget as essential conditions for stable co-evolution, offering insights into the design of resilient governance rules. While limitations remain—such as fitness noise sensitivity, environment scope, and lack of transfer testing—the framework lays a foundation for future research in adaptive, interpretable multi-agent governance.
Overall, this study bridges a critical gap in multi-agent alignment research, enabling the development of safer, more robust cooperative mechanisms in adversarial contexts. It opens pathways for deploying multi-agent systems with interpretable, evolvable constitutions capable of withstanding strategic conflicts, thus contributing to the broader goal of trustworthy AI governance.
Deep Analysis
Background
Large language models (LLMs) have revolutionized natural language processing, enabling increasingly autonomous intelligent agents. Constitutional AI (CAI) leverages human-written principles to align LLM behavior, improving safety and helpfulness in single-agent contexts. However, real-world multi-agent systems involve complex interactions including negotiation, competition, and information sharing, where cooperative assumptions break down. Prior work by Kumar et al. demonstrated that LLM-guided evolutionary search can discover effective cooperative constitutions in multi-agent grid-worlds, outperforming hand-crafted baselines. Yet, these studies focused on static cooperative environments, leaving open questions about constitutional evolution under adversarial pressure, the reliability of LLM mutation operators in such contexts, and the influence of different social dilemma structures. Addressing these gaps is critical as multi-agent AI systems are increasingly deployed in settings with conflicting goals and strategic incentives.
Core Problem
The core problem is designing and evolving natural-language constitutions that govern multi-agent behavior under adversarial pressure, enabling stable and interpretable governance in environments with cooperation-defection tensions. Key challenges include: 1) ensuring the fitness function induces genuine adversarial selection pressure rather than independent faction optimization; 2) verifying the reliability and stability of LLM mutation operators when optimizing adversarial-specialist objectives; 3) understanding how different social dilemma structures (e.g., Public Goods Game vs. spatial grid-world) affect constitutional evolution; and 4) controlling evaluation budget and fitness estimation noise to prevent mode regression and maintain robust search dynamics. Solving these challenges is essential for advancing multi-agent alignment beyond cooperative settings.
Innovation
This work introduces several core innovations:
1) An LLM-guided evolutionary search framework that co-evolves priority-ordered natural-language constitutions for two adversarial factions, simulating a constitutional arms race, contrasting prior cooperative-only approaches.
2) The formulation and empirical validation of score-advantage fitness (S_own - S_opp) as a necessary mechanism to induce fitness coupling and genuine adversarial dynamics, overcoming the failure of independent scoring.
3) Identification of evaluation seed count K as a critical hyperparameter controlling mode regression in pure adversarial search, highlighting evaluation budget as a key lever alongside mutation operator design.
4) Comprehensive experimentation across structurally distinct environments (PGG and grid-world), demonstrating the generality of the approach and the influence of environmental payoff structure on evolutionary dynamics.
5) Generation of interpretable, transferable adversarial constitutions that serve as red-team artifacts, enabling rigorous robustness testing of cooperative governance mechanisms.
Methodology
- �� Environment Setup: Two environments are used—Public Goods Game (PGG) with 6 agents (3 Blue cooperators, 3 Red free-riders), 20 rounds, multiplier m in {1.2,1.5,2.0,3.0}; and an 8×8 spatial grid-world with two hidden factions each controlling 3 agents.
- �� Constitutional Representation: Each faction’s constitution is a priority-ordered set of natural-language rules that agents follow verbatim during decision-making.
- �� Evolutionary Framework: OpenEvolve combined with MAP-Elites is employed to perform LLM-guided mutation and selection, alternating updates between Blue and Red factions to maximize fitness.
- �� Fitness Functions: Three modes are explored—faction-specific score Sfaction, score-advantage Sfaction - Sopp, and pure adversarial 1 - Sopp. PGG’s shared pool naturally couples payoffs; grid-world experiments require score-advantage fitness to induce adversarial pressure.
- �� Evaluation Strategy: Multiple random seeds (K=2 or 5) are used to estimate fitness, mitigating noise and preventing mode regression.
- �� Experimental Protocol: Runs span 30 generations, tracking score trajectories, equilibrium convergence, information asymmetry effects, and defensive mechanism impacts.
Experiments
Experiments are organized into PGG and grid-world categories. In PGG, initial Blue score is 0.370 and Red 0.177; after 30 generations, both converge near 0.78, robust across multipliers m. Pure adversarial fitness experiments show Red can suppress Blue’s score effectively. Grid-world experiments reveal independent scoring fails to induce adversarial dynamics; adopting score-advantage fitness restores competition. Fixing Blue’s cooperative constitution C* and evolving Red from a zero-sum seed yields Red advantage averaging -0.27, indicating structural resilience. Introducing coordinated attack requirements reduces Red’s advantage to -0.66. Information asymmetry experiments show Red’s advantage increases to +0.415 when observing Blue’s actions. Pure adversarial search with K=2 suffers mode regression; increasing to K=5 stabilizes specialization, underscoring evaluation budget’s importance.
Results
In PGG, Blue’s score rose from 0.370 to 0.777 and Red’s from 0.177 to 0.782 over 30 generations, converging to a stable near-parity equilibrium (~0.78) robust across multipliers m. Independently scored environments showed negligible correlation between faction scores (corr=+0.088), failing to produce adversarial pressure; score-advantage fitness restored competitive dynamics. Pure adversarial fitness experiments revealed evaluation seed count K as a pivotal factor: K=2 led to mode regression, whereas K=5 maintained stable adversarial specialization throughout 30 generations. Information asymmetry experiments demonstrated Red’s significant advantage (+0.415) when observing Blue’s actions. Defensive mechanisms requiring coordinated attacks effectively reduced Red’s mean advantage from -0.27 to -0.66, illustrating practical mechanism design levers.
Applications
The findings apply to governance rule design in multi-agent systems involving cooperation and competition, such as automated economic markets, distributed resource management, and security red teaming. The evolved natural-language constitutions are interpretable, facilitating human expert review and adjustment to enhance transparency and safety. Red constitutions serve as red-team artifacts for robustness testing of cooperative mechanisms, aiding secure deployment and regulation of multi-agent AI systems.
Limitations & Outlook
Fitness estimation noise is heavily dependent on evaluation seed count; low budgets cause mode regression, limiting search efficiency and stability. Experiments are limited to PGG and a specific grid-world, lacking validation in more complex or real-world multi-agent adversarial scenarios, thus limiting generalizability. No transfer experiments comparing adversarially-evolved constitutions against novel adversaries were conducted, leaving open questions about practical robustness superiority.
Abstract
Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in {1.2, 1.5, 2.0, 3.0}; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.
References (20)
Deep Learning Meets Mechanism Design: Key Results and Some Novel Applications
V. Sankar, Vishisht Srihari Rao, Mayank Ratan Bhardwaj et al.
An Interpretable Automated Mechanism Design Framework with Large Language Models
Jiayuan Liu, Mingyu Guo, Vincent Conitzer
Evolving Interpretable Constitutions for Multi-Agent Coordination
Ujwal Kumar, A. Saito, Hershraj Niranjani et al.
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan et al.
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
Natasha Jaques, Angeliki Lazaridou, Edward Hughes et al.
Agentic Misalignment: How LLMs Could Be Insider Threats
Aengus Lynch, Benjamin Wright, Caleb Larson et al.
The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process
F. Carichon, Aditi Khandelwal, Marylou Fauchard et al.
Volunteering as Red Queen Mechanism for Cooperation in Public Goods Games
C. Hauert, S. De Monte, J. Hofbauer et al.
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Joel Z. Leibo, V. Zambaldi, Marc Lanctot et al.
Mathematical discoveries from program search with large language models
B. Romera-Paredes, M. Barekatain, Alexander Novikov et al.
Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions
Rui Wang, J. Lehman, J. Clune et al.
The evolution of cooperation
R. May
Inequity aversion improves cooperation in intertemporal social dilemmas
Edward Hughes, Joel Z. Leibo, Matthew Phillips et al.
Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation
Shiyang Lai, Yujin Potter, Junsol Kim et al.
Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J. Maddison et al.
Human-centred mechanism design with Democratic AI
R. Koster, Jan Balaguer, Andrea Tacchetti et al.
Generative Agents: Interactive Simulacra of Human Behavior
J. Park, Joseph O'Brien, Carrie J. Cai et al.
Evolution through Large Models
J. Lehman, Jonathan Gordon, Shawn Jain et al.
Cooperation and Punishment in Public Goods Experiments
E. Fehr, S. Gächter
Illuminating search spaces by mapping elites
Jean-Baptiste Mouret, J. Clune