Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

TL;DR

LLM-guided evolutionary co-evolution of constitutions achieves ~0.78 stable equilibrium in Public Goods Game arms race.

cs.MA 🔴 Advanced 2026-05-26 70 views
Ujwal Kumar Arth Singh Hershraj Niranjani Machiko Hirota Takehiro Takayanagi Alice Saito Eiji Kamioka Phan Xuan Tan
Constitutional AI Public Goods Game Multi-agent Adversarial LLM Evolutionary Search Mechanism Design

Key Findings

Methodology

This paper introduces an LLM-guided evolutionary search framework for co-evolving natural-language constitutions in multi-agent adversarial settings. Specifically, it studies a constitutional arms race between Blue cooperators and Red free-riders across Public Goods Game (PGG) and spatial grid-world environments. Each faction’s constitution is a priority-ordered set of natural-language rules that agents follow verbatim. The evolutionary process alternates updates between factions using OpenEvolve combined with MAP-Elites, with fitness functions including faction-specific scores, score-advantage (S_own - S_opp), and pure adversarial objectives. Coupling of fitness and evaluation seed count K are critical to induce genuine adversarial pressure and maintain search stability.

Key Results

  • In the PGG, Blue faction’s score improved from 0.370 to 0.777 and Red faction’s from 0.177 to 0.782 over 30 generations, converging to a stable near-parity equilibrium around 0.78, robust across multipliers m in {1.2, 1.5, 2.0, 3.0}.
  • In independently scored environments, faction scores were statistically uncorrelated (corr(S_B, S_R) = +0.088), yielding no adversarial pressure. Introducing score-advantage fitness restored adversarial dynamics and enabled a constitutional arms race.
  • Under pure-adversary fitness, evaluation seed count K controlled mode regression: K=2 led to regression, while K=5 sustained strong adversarial specialization for all 30 generations, highlighting evaluation budget as a key lever.

Significance

This study is the first to systematically demonstrate the feasibility of natural-language constitutional co-evolution under multi-agent adversarial pressure, overcoming limitations of single-agent or cooperative assumptions. By identifying fitness coupling and evaluation budget as critical factors, it advances understanding of how to induce and sustain adversarial dynamics in constitutional AI. The evolved Red constitutions serve as interpretable red-team artifacts, enabling rigorous robustness testing of cooperative governance mechanisms. This work lays foundational theoretical and practical groundwork for multi-agent governance rule design under conflict.

Technical Contribution

Technically, the paper innovates by integrating LLMs as mutation operators within OpenEvolve and MAP-Elites to evolve priority-ordered natural-language constitutional rules. It introduces the concept of score-advantage fitness to enforce fitness coupling, resolving the issue of independent scoring failing to produce adversarial pressure. Furthermore, it uncovers the critical role of evaluation seed count in mitigating fitness estimation noise and preventing mode regression during pure adversarial search. These contributions enrich the theoretical framework and engineering toolkit for LLM-guided evolutionary optimization in multi-agent adversarial contexts.

Novelty

This work is the first to explore LLM-guided co-evolution of natural-language constitutions under explicit adversarial pressure, contrasting prior research focused solely on cooperative settings. By combining fitness coupling and evaluation budget control, it achieves stable constitutional arms races and produces interpretable, transferable adversarial constitutions. This fills a gap in multi-agent constitutional AI research, pioneering methods to stress-test governance rules against adaptive social conflict.

Limitations

  • Fitness estimation noise depends heavily on evaluation seed count; low budgets cause mode regression, limiting search efficiency and stability.
  • Experiments are confined to PGG and a specific grid-world, lacking validation in more complex or real-world multi-agent adversarial scenarios, thus limiting generalizability.
  • No transfer experiments were conducted to compare robustness of adversarially-evolved constitutions against fresh adversaries, leaving open questions on practical superiority.

Future Work

Future directions include extending the framework to more complex multi-agent environments with multiple heterogeneous factions, improving LLM mutation operator stability, and scaling evaluation budgets to reduce noise. Conducting transfer and robustness tests of adversarially-evolved constitutions against novel opponents is crucial to validate practical security. These efforts will advance adaptive, interpretable governance rule design for safe multi-agent systems.

AI Executive Summary

Multi-agent systems, especially those powered by large language models (LLMs), face complex dynamics of cooperation and defection. Traditional constitutional AI approaches assume single-agent or cooperative settings, which fall short in adversarial multi-agent environments where agents may engage in sabotage, blackmail, or information leaks. This paper addresses these challenges by proposing an LLM-guided evolutionary search framework that co-evolves natural-language constitutions for two adversarial factions—Blue cooperators and Red free-riders—across two distinct environments: the Public Goods Game (PGG) and a spatial grid-world. The framework leverages OpenEvolve combined with MAP-Elites to iteratively update priority-ordered rule sets that govern agent behavior, simulating a constitutional arms race.

A key technical innovation is the design of fitness functions that induce genuine adversarial pressure. The PGG environment’s inherent payoff coupling naturally fosters competition, leading to a stable near-parity equilibrium around 0.78 after 30 generations, robust across different multiplier parameters. In contrast, independently scored environments lack such coupling, resulting in uncorrelated faction scores and no true adversarial dynamics. Introducing a score-advantage fitness function (S_own - S_opp) restores competition and enables meaningful co-evolution. Additionally, the evaluation seed count K is identified as a critical hyperparameter controlling mode regression in pure adversarial search: low K leads to degradation, whereas K=5 maintains stable adversarial specialization.

Experiments further reveal the impact of information asymmetry: when the Red faction observes Blue’s actions, it gains a significant advantage, highlighting the role of information in strategy evolution. Defensive mechanisms, such as requiring coordinated attacks, effectively reduce Red’s dominance, demonstrating practical levers for mechanism design. The evolved Red constitutions, expressed as interpretable natural-language rules, serve as valuable red-team artifacts for testing cooperative governance robustness.

This work advances constitutional AI by extending it from single-agent or cooperative paradigms to multi-agent adversarial settings, providing a methodology to construct and diagnose constitutional arms races. It identifies fitness coupling and evaluation budget as essential conditions for stable co-evolution, offering insights into the design of resilient governance rules. While limitations remain—such as fitness noise sensitivity, environment scope, and lack of transfer testing—the framework lays a foundation for future research in adaptive, interpretable multi-agent governance.

Overall, this study bridges a critical gap in multi-agent alignment research, enabling the development of safer, more robust cooperative mechanisms in adversarial contexts. It opens pathways for deploying multi-agent systems with interpretable, evolvable constitutions capable of withstanding strategic conflicts, thus contributing to the broader goal of trustworthy AI governance.

Deep Analysis

Background

Large language models (LLMs) have revolutionized natural language processing, enabling increasingly autonomous intelligent agents. Constitutional AI (CAI) leverages human-written principles to align LLM behavior, improving safety and helpfulness in single-agent contexts. However, real-world multi-agent systems involve complex interactions including negotiation, competition, and information sharing, where cooperative assumptions break down. Prior work by Kumar et al. demonstrated that LLM-guided evolutionary search can discover effective cooperative constitutions in multi-agent grid-worlds, outperforming hand-crafted baselines. Yet, these studies focused on static cooperative environments, leaving open questions about constitutional evolution under adversarial pressure, the reliability of LLM mutation operators in such contexts, and the influence of different social dilemma structures. Addressing these gaps is critical as multi-agent AI systems are increasingly deployed in settings with conflicting goals and strategic incentives.

Core Problem

The core problem is designing and evolving natural-language constitutions that govern multi-agent behavior under adversarial pressure, enabling stable and interpretable governance in environments with cooperation-defection tensions. Key challenges include: 1) ensuring the fitness function induces genuine adversarial selection pressure rather than independent faction optimization; 2) verifying the reliability and stability of LLM mutation operators when optimizing adversarial-specialist objectives; 3) understanding how different social dilemma structures (e.g., Public Goods Game vs. spatial grid-world) affect constitutional evolution; and 4) controlling evaluation budget and fitness estimation noise to prevent mode regression and maintain robust search dynamics. Solving these challenges is essential for advancing multi-agent alignment beyond cooperative settings.

Innovation

This work introduces several core innovations:


1) An LLM-guided evolutionary search framework that co-evolves priority-ordered natural-language constitutions for two adversarial factions, simulating a constitutional arms race, contrasting prior cooperative-only approaches.


2) The formulation and empirical validation of score-advantage fitness (S_own - S_opp) as a necessary mechanism to induce fitness coupling and genuine adversarial dynamics, overcoming the failure of independent scoring.


3) Identification of evaluation seed count K as a critical hyperparameter controlling mode regression in pure adversarial search, highlighting evaluation budget as a key lever alongside mutation operator design.


4) Comprehensive experimentation across structurally distinct environments (PGG and grid-world), demonstrating the generality of the approach and the influence of environmental payoff structure on evolutionary dynamics.


5) Generation of interpretable, transferable adversarial constitutions that serve as red-team artifacts, enabling rigorous robustness testing of cooperative governance mechanisms.

Methodology

  • �� Environment Setup: Two environments are used—Public Goods Game (PGG) with 6 agents (3 Blue cooperators, 3 Red free-riders), 20 rounds, multiplier m in {1.2,1.5,2.0,3.0}; and an 8×8 spatial grid-world with two hidden factions each controlling 3 agents.

  • �� Constitutional Representation: Each faction’s constitution is a priority-ordered set of natural-language rules that agents follow verbatim during decision-making.

  • �� Evolutionary Framework: OpenEvolve combined with MAP-Elites is employed to perform LLM-guided mutation and selection, alternating updates between Blue and Red factions to maximize fitness.

  • �� Fitness Functions: Three modes are explored—faction-specific score Sfaction, score-advantage Sfaction - Sopp, and pure adversarial 1 - Sopp. PGG’s shared pool naturally couples payoffs; grid-world experiments require score-advantage fitness to induce adversarial pressure.

  • �� Evaluation Strategy: Multiple random seeds (K=2 or 5) are used to estimate fitness, mitigating noise and preventing mode regression.

  • �� Experimental Protocol: Runs span 30 generations, tracking score trajectories, equilibrium convergence, information asymmetry effects, and defensive mechanism impacts.

Experiments

Experiments are organized into PGG and grid-world categories. In PGG, initial Blue score is 0.370 and Red 0.177; after 30 generations, both converge near 0.78, robust across multipliers m. Pure adversarial fitness experiments show Red can suppress Blue’s score effectively. Grid-world experiments reveal independent scoring fails to induce adversarial dynamics; adopting score-advantage fitness restores competition. Fixing Blue’s cooperative constitution C* and evolving Red from a zero-sum seed yields Red advantage averaging -0.27, indicating structural resilience. Introducing coordinated attack requirements reduces Red’s advantage to -0.66. Information asymmetry experiments show Red’s advantage increases to +0.415 when observing Blue’s actions. Pure adversarial search with K=2 suffers mode regression; increasing to K=5 stabilizes specialization, underscoring evaluation budget’s importance.

Results

In PGG, Blue’s score rose from 0.370 to 0.777 and Red’s from 0.177 to 0.782 over 30 generations, converging to a stable near-parity equilibrium (~0.78) robust across multipliers m. Independently scored environments showed negligible correlation between faction scores (corr=+0.088), failing to produce adversarial pressure; score-advantage fitness restored competitive dynamics. Pure adversarial fitness experiments revealed evaluation seed count K as a pivotal factor: K=2 led to mode regression, whereas K=5 maintained stable adversarial specialization throughout 30 generations. Information asymmetry experiments demonstrated Red’s significant advantage (+0.415) when observing Blue’s actions. Defensive mechanisms requiring coordinated attacks effectively reduced Red’s mean advantage from -0.27 to -0.66, illustrating practical mechanism design levers.

Applications

The findings apply to governance rule design in multi-agent systems involving cooperation and competition, such as automated economic markets, distributed resource management, and security red teaming. The evolved natural-language constitutions are interpretable, facilitating human expert review and adjustment to enhance transparency and safety. Red constitutions serve as red-team artifacts for robustness testing of cooperative mechanisms, aiding secure deployment and regulation of multi-agent AI systems.

Limitations & Outlook

Fitness estimation noise is heavily dependent on evaluation seed count; low budgets cause mode regression, limiting search efficiency and stability. Experiments are limited to PGG and a specific grid-world, lacking validation in more complex or real-world multi-agent adversarial scenarios, thus limiting generalizability. No transfer experiments comparing adversarially-evolved constitutions against novel adversaries were conducted, leaving open questions about practical robustness superiority.

Abstract

Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in {1.2, 1.5, 2.0, 3.0}; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.

cs.MA cs.GT cs.NE

References (20)

Deep Learning Meets Mechanism Design: Key Results and Some Novel Applications

V. Sankar, Vishisht Srihari Rao, Mayank Ratan Bhardwaj et al.

2024 2 citations ⭐ Influential View Analysis →

An Interpretable Automated Mechanism Design Framework with Large Language Models

Jiayuan Liu, Mingyu Guo, Vincent Conitzer

2025 6 citations ⭐ Influential View Analysis →

Evolving Interpretable Constitutions for Multi-Agent Coordination

Ujwal Kumar, A. Saito, Hershraj Niranjani et al.

2026 2 citations ⭐ Influential View Analysis →

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan et al.

2019 2142 citations View Analysis →

Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

Natasha Jaques, Angeliki Lazaridou, Edward Hughes et al.

2018 543 citations

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al.

2025 111 citations View Analysis →

The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process

F. Carichon, Aditi Khandelwal, Marylou Fauchard et al.

2025 11 citations View Analysis →

Volunteering as Red Queen Mechanism for Cooperation in Public Goods Games

C. Hauert, S. De Monte, J. Hofbauer et al.

2002 1013 citations

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Joel Z. Leibo, V. Zambaldi, Marc Lanctot et al.

2017 689 citations View Analysis →

Mathematical discoveries from program search with large language models

B. Romera-Paredes, M. Barekatain, Alexander Novikov et al.

2023 922 citations

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

Rui Wang, J. Lehman, J. Clune et al.

2019 295 citations View Analysis →

The evolution of cooperation

R. May

1981 23037 citations

Inequity aversion improves cooperation in intertemporal social dilemmas

Edward Hughes, Joel Z. Leibo, Matthew Phillips et al.

2018 261 citations View Analysis →

Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation

Shiyang Lai, Yujin Potter, Junsol Kim et al.

2024 12 citations View Analysis →

Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang, Chris J. Maddison et al.

2016 18839 citations

Human-centred mechanism design with Democratic AI

R. Koster, Jan Balaguer, Andrea Tacchetti et al.

2022 109 citations

Generative Agents: Interactive Simulacra of Human Behavior

J. Park, Joseph O'Brien, Carrie J. Cai et al.

2023 4190 citations View Analysis →

Evolution through Large Models

J. Lehman, Jonathan Gordon, Shawn Jain et al.

2022 144 citations View Analysis →

Cooperation and Punishment in Public Goods Experiments

E. Fehr, S. Gächter

2000 4425 citations

Illuminating search spaces by mapping elites

Jean-Baptiste Mouret, J. Clune

2015 899 citations View Analysis →