Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

TL;DR

Code-A1 enhances code and test generation through an adversarial co-evolution framework.

cs.CL 🔴 Advanced 2026-03-17 56 views
Aozhe Wang Yuchen Yan Nan Zhou Zhengxi Lu Weiming Lu Jun Xiao Yueting Zhuang Yongliang Shen
adversarial learning code generation test generation reinforcement learning large language models

Key Findings

Methodology

Code-A1 employs an adversarial co-evolution framework to optimize a Code LLM and a Test LLM using reinforcement learning. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. A Mistake Book mechanism is introduced for experience replay, and a composite reward system balances test validity with adversarial difficulty.

Key Results

  • Experiments on Qwen2.5-Coder models show that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities. For instance, the 3B model achieves a Mul score of 15.29, surpassing the 7B base model's 14.72.
  • In code generation benchmarks, the 1.5B model of Code-A1 achieves an average accuracy of 56.95%, surpassing the Golden Tests baseline of 56.23% and the Self-Play approach of 55.88%.
  • For test generation, Code-A1's Test LLM evolves from simple validity to high discriminatory power, with the 3B model achieving a Mul score of 15.29, significantly outperforming SFT's 8.53.

Significance

Code-A1 addresses the limitations of static rewards that fail to adapt to improving model capabilities, significantly enhancing the effectiveness of both code and test generation. This method holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites.

Technical Contribution

The technical contributions of Code-A1 include its adversarial co-evolution framework, which separates code and test generation tasks to eliminate self-collusion risks and enable safe white-box test generation. Additionally, the Mistake Book mechanism and composite reward design provide stable training signals.

Novelty

Code-A1 is the first to introduce adversarial co-evolution into code RL, enabling dynamic and adaptive verifiable rewards. This innovation allows the model to continuously evolve beyond any static performance ceiling.

Limitations

  • Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases.
  • While the Mistake Book mechanism is introduced, experience replay on large datasets may lead to increased computational overhead.
  • Adversarial training may cause the model to overfit to specific types of tests in certain situations.

Future Work

Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.

AI Executive Summary

In the field of code generation, traditional reinforcement learning methods rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Existing self-play methods attempt to unify code and test generation within a single model but face issues of self-collusion with white-box access and generic tests with black-box restrictions.

Code-A1 introduces an adversarial co-evolution framework that separates the Code LLM and Test LLM, each optimized for opposing objectives. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation.

Technically, Code-A1 incorporates a Mistake Book mechanism for experience replay and designs a composite reward system balancing test validity with adversarial difficulty. Experimental results demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities.

The adversarial co-evolution framework of Code-A1 holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites. By enabling dynamic and adaptive verifiable rewards, the model can continuously evolve beyond any static performance ceiling.

However, Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases. Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.

Deep Analysis

Background

Code generation is a critical task in artificial intelligence, and recent advancements in large language models have significantly improved its capabilities. However, traditional code generation methods primarily rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Recent efforts have explored self-play methods that unify code and test generation within a single model, but these face issues of self-collusion with white-box access and generic tests with black-box restrictions.

Core Problem

Existing code generation methods rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Self-play methods attempt to unify code and test generation within a single model but face issues of self-collusion with white-box access and generic tests with black-box restrictions. These issues limit the potential for performance improvement and practical application.

Innovation

Code-A1 introduces an adversarial co-evolution framework that separates the Code LLM and Test LLM, each optimized for opposing objectives. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation. Additionally, Code-A1 incorporates a Mistake Book mechanism for experience replay and designs a composite reward system balancing test validity with adversarial difficulty.

Methodology

  • �� Code-A1 employs an adversarial co-evolution framework to optimize a Code LLM and a Test LLM using reinforcement learning.
  • �� The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects.
  • �� A Mistake Book mechanism is introduced for experience replay, recording historical failed tests for each question.
  • �� A composite reward system is designed to balance test validity with adversarial difficulty.
  • �� Experiments are conducted on Qwen2.5-Coder models to validate Code-A1's performance.

Experiments

Experiments are conducted on Qwen2.5-Coder models with three scales: 1.5B, 3B, and 7B. The experimental design includes both code generation and test generation, using benchmarks such as HumanEval, MBPP, and BigCodeBench. In the experiments, the Code LLM and Test LLM generate candidate solutions and test suites, respectively, and execute them in a sandbox environment. The Mistake Book mechanism records historical failed tests for each question, ensuring that the model does not forget resolved errors during training.

Results

Experimental results show that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities. For instance, the 3B model achieves a Mul score of 15.29, surpassing the 7B base model's 14.72. Additionally, in code generation benchmarks, the 1.5B model of Code-A1 achieves an average accuracy of 56.95%, surpassing the Golden Tests baseline of 56.23% and the Self-Play approach of 55.88%. For test generation, Code-A1's Test LLM evolves from simple validity to high discriminatory power, with the 3B model achieving a Mul score of 15.29, significantly outperforming SFT's 8.53.

Applications

The adversarial co-evolution framework of Code-A1 holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites. By enabling dynamic and adaptive verifiable rewards, the model can continuously evolve beyond any static performance ceiling. This method can be applied to automated test generation in software development, improving test coverage and efficiency while reducing human intervention.

Limitations & Outlook

Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases. Additionally, while the Mistake Book mechanism is introduced, experience replay on large datasets may lead to increased computational overhead. Adversarial training may cause the model to overfit to specific types of tests in certain situations. Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (code) and need to ensure it produces a delicious dish (correct result). To verify this, you need someone to taste it (test). Traditionally, you might rely on a fixed food critic (static tests) to taste your dish, but they might only like certain flavors and can't fully evaluate your dish.

Code-A1 is like a dynamic team of food critics who adjust their tastes based on your dish (dynamic tests). This way, your dish can perform well under different tastes.

Throughout this process, Code-A1 keeps a record of each tasting result (Mistake Book) to ensure you don't repeat the same mistakes. Ultimately, your cooking skills (code generation ability) will continuously improve, able to handle various taste challenges.

This approach not only improves the quality of the dish (code correctness) but also reduces reliance on a fixed food critic (human-annotated tests), giving you more freedom in the kitchen.

ELI14 Explained like you're 14

Hey there, buddy! Imagine you're playing a super cool game where you're a top-notch programmer writing perfect code to defeat enemies! But the problem is, your enemies keep changing, so you can't use the same strategy to beat them.

That's where Code-A1 comes in as your super assistant. It helps you generate all kinds of tests, like different enemies, so you can keep improving your skills!

Every time your code passes a test, Code-A1 records it, so you don't make the same mistake again. It even adjusts the difficulty based on your performance, helping you level up in the game.

In the end, you'll become an unbeatable programmer, writing perfect code and defeating all the enemies! Isn't that awesome?

Glossary

Adversarial Co-evolution

A method where two models are simultaneously optimized through adversarial training, often used for generation and testing tasks.

Used in Code-A1 to optimize code and test generation models.

Reinforcement Learning

A machine learning approach where models learn strategies to maximize cumulative rewards through reward signals.

Used to optimize Code-A1's code and test generation models.

Mistake Book

A mechanism that records historical failed tests for each question, ensuring the model does not forget resolved errors during training.

Used in Code-A1 for experience replay.

Composite Reward

A reward mechanism that combines multiple objectives to balance test validity with adversarial difficulty.

Used in Code-A1 to optimize the test generation model.

Qwen2.5-Coder

A model used for code and test generation, available in 1.5B, 3B, and 7B scales.

Used in Code-A1's experiments.

White-box Testing

A testing method where testers have access to the internal structure of the code to generate more targeted tests.

Used in Code-A1 for generating targeted tests.

Mul Score

A metric for evaluating test generation model performance, combining test validity and adversarial capability.

Used in Code-A1's experimental results.

Self-collusion

In self-play methods, the model generates simple tests for easy rewards, distorting the training signal.

Avoided in Code-A1 by separating model structures.

Dynamic Reward

A reward mechanism that adjusts based on model capability changes, avoiding the limitations of static tests.

Used in Code-A1 to optimize model performance.

Experience Replay

A method that stabilizes the training process by reusing historical experiences.

Implemented in Code-A1 through the Mistake Book.

Open Questions Unanswered questions from this research

  • 1 How can the Mistake Book mechanism be effectively applied to larger datasets to avoid excessive computational overhead? The current implementation may lead to performance bottlenecks on large datasets.
  • 2 Code-A1 may perform suboptimally in extremely complex code and test scenarios. How can its applicability be extended to cover more diverse scenarios?
  • 3 Adversarial training may cause the model to overfit to specific types of tests. How can more robust training mechanisms be designed to avoid this issue?
  • 4 In practical applications, how can Code-A1 be effectively integrated into existing software development processes to maximize its benefits?
  • 5 How can the efficiency of adversarial training be further optimized to reduce training time and computational resource consumption?

Applications

Immediate Applications

Automated Test Generation

Code-A1 can be used in software development for automated test generation, improving test coverage and efficiency while reducing human intervention.

Code Quality Improvement

Through dynamic and adaptive verifiable rewards, Code-A1 can continuously improve code quality, reducing potential errors and vulnerabilities.

Education and Training

Code-A1 can be used in programming education and training, generating diverse test cases to help students improve their coding skills.

Long-term Vision

Software Development Process Optimization

Code-A1's adversarial co-evolution framework can be integrated into software development processes, enhancing overall development efficiency and quality.

Intelligent Programming Assistant

In the future, Code-A1 could evolve into an intelligent programming assistant, automatically generating and optimizing code to boost developer productivity.

Abstract

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

cs.CL

References (20)

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Yixin Chen, Yiding Wang, Siqi Zhu et al.

2025 15 citations ⭐ Influential View Analysis →

CodeT: Code Generation with Generated Tests

Bei Chen, Fengji Zhang, A. Nguyen et al.

2022 483 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1300 citations View Analysis →

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Zhongmou He, Yee Man Choi, Kexun Zhang et al.

2025 10 citations View Analysis →

PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

Bo Shen, Jiaxin Zhang, Taihong Chen et al.

2023 102 citations View Analysis →

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu et al.

2025 1390 citations View Analysis →

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Adriano Lieret, Carlos E. Jimenez et al.

2025 106 citations View Analysis →

CoverUp: Effective High Coverage Test Generation for Python

Juan Altmayer Pizzorno, E. Berger

2024 39 citations View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 1788 citations

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chun Xia, Yuyao Wang et al.

2023 1560 citations View Analysis →

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Kush Jain, Gabriel Synnaeve, Baptiste Rozière

2024 48 citations View Analysis →

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun et al.

2021 8694 citations View Analysis →

RLTF: Reinforcement Learning from Unit Test Feedback

Jiate Liu, Yiqin Zhu, Kaiwen Xiao et al.

2023 115 citations View Analysis →

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye et al.

2021 3262 citations View Analysis →

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang et al.

2025 67 citations View Analysis →

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson E. Denison, M. MacDiarmid, Fazl Barez et al.

2024 92 citations View Analysis →

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue et al.

2025 166 citations View Analysis →

Benchmarking LLMs for Unit Test Generation from Real-World Functions

Dong Huang, Jie M. Zhang, Mark Harman et al.

2025 7 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 4970 citations View Analysis →

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng et al.

2025 8 citations View Analysis →