Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
Code-A1 enhances code and test generation through an adversarial co-evolution framework.
Key Findings
Methodology
Code-A1 employs an adversarial co-evolution framework to optimize a Code LLM and a Test LLM using reinforcement learning. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. A Mistake Book mechanism is introduced for experience replay, and a composite reward system balances test validity with adversarial difficulty.
Key Results
- Experiments on Qwen2.5-Coder models show that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities. For instance, the 3B model achieves a Mul score of 15.29, surpassing the 7B base model's 14.72.
- In code generation benchmarks, the 1.5B model of Code-A1 achieves an average accuracy of 56.95%, surpassing the Golden Tests baseline of 56.23% and the Self-Play approach of 55.88%.
- For test generation, Code-A1's Test LLM evolves from simple validity to high discriminatory power, with the 3B model achieving a Mul score of 15.29, significantly outperforming SFT's 8.53.
Significance
Code-A1 addresses the limitations of static rewards that fail to adapt to improving model capabilities, significantly enhancing the effectiveness of both code and test generation. This method holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites.
Technical Contribution
The technical contributions of Code-A1 include its adversarial co-evolution framework, which separates code and test generation tasks to eliminate self-collusion risks and enable safe white-box test generation. Additionally, the Mistake Book mechanism and composite reward design provide stable training signals.
Novelty
Code-A1 is the first to introduce adversarial co-evolution into code RL, enabling dynamic and adaptive verifiable rewards. This innovation allows the model to continuously evolve beyond any static performance ceiling.
Limitations
- Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases.
- While the Mistake Book mechanism is introduced, experience replay on large datasets may lead to increased computational overhead.
- Adversarial training may cause the model to overfit to specific types of tests in certain situations.
Future Work
Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.
AI Executive Summary
In the field of code generation, traditional reinforcement learning methods rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Existing self-play methods attempt to unify code and test generation within a single model but face issues of self-collusion with white-box access and generic tests with black-box restrictions.
Code-A1 introduces an adversarial co-evolution framework that separates the Code LLM and Test LLM, each optimized for opposing objectives. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation.
Technically, Code-A1 incorporates a Mistake Book mechanism for experience replay and designs a composite reward system balancing test validity with adversarial difficulty. Experimental results demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities.
The adversarial co-evolution framework of Code-A1 holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites. By enabling dynamic and adaptive verifiable rewards, the model can continuously evolve beyond any static performance ceiling.
However, Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases. Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.
Deep Analysis
Background
Code generation is a critical task in artificial intelligence, and recent advancements in large language models have significantly improved its capabilities. However, traditional code generation methods primarily rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Recent efforts have explored self-play methods that unify code and test generation within a single model, but these face issues of self-collusion with white-box access and generic tests with black-box restrictions.
Core Problem
Existing code generation methods rely on static human-annotated test suites, which suffer from limited coverage and inability to adapt to improving model capabilities. Self-play methods attempt to unify code and test generation within a single model but face issues of self-collusion with white-box access and generic tests with black-box restrictions. These issues limit the potential for performance improvement and practical application.
Innovation
Code-A1 introduces an adversarial co-evolution framework that separates the Code LLM and Test LLM, each optimized for opposing objectives. The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation. Additionally, Code-A1 incorporates a Mistake Book mechanism for experience replay and designs a composite reward system balancing test validity with adversarial difficulty.
Methodology
- �� Code-A1 employs an adversarial co-evolution framework to optimize a Code LLM and a Test LLM using reinforcement learning.
- �� The Code LLM aims to pass more tests, while the Test LLM seeks to expose more defects.
- �� A Mistake Book mechanism is introduced for experience replay, recording historical failed tests for each question.
- �� A composite reward system is designed to balance test validity with adversarial difficulty.
- �� Experiments are conducted on Qwen2.5-Coder models to validate Code-A1's performance.
Experiments
Experiments are conducted on Qwen2.5-Coder models with three scales: 1.5B, 3B, and 7B. The experimental design includes both code generation and test generation, using benchmarks such as HumanEval, MBPP, and BigCodeBench. In the experiments, the Code LLM and Test LLM generate candidate solutions and test suites, respectively, and execute them in a sandbox environment. The Mistake Book mechanism records historical failed tests for each question, ensuring that the model does not forget resolved errors during training.
Results
Experimental results show that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, with significantly improved test generation capabilities. For instance, the 3B model achieves a Mul score of 15.29, surpassing the 7B base model's 14.72. Additionally, in code generation benchmarks, the 1.5B model of Code-A1 achieves an average accuracy of 56.95%, surpassing the Golden Tests baseline of 56.23% and the Self-Play approach of 55.88%. For test generation, Code-A1's Test LLM evolves from simple validity to high discriminatory power, with the 3B model achieving a Mul score of 15.29, significantly outperforming SFT's 8.53.
Applications
The adversarial co-evolution framework of Code-A1 holds substantial significance in academia and offers a scalable solution for industry, reducing reliance on human-annotated test suites. By enabling dynamic and adaptive verifiable rewards, the model can continuously evolve beyond any static performance ceiling. This method can be applied to automated test generation in software development, improving test coverage and efficiency while reducing human intervention.
Limitations & Outlook
Code-A1 may perform suboptimally in extremely complex code and test scenarios, as the current model scale and training data may not cover all possible edge cases. Additionally, while the Mistake Book mechanism is introduced, experience replay on large datasets may lead to increased computational overhead. Adversarial training may cause the model to overfit to specific types of tests in certain situations. Future research directions include extending the applicability of Code-A1 to more diverse code and test scenarios, optimizing the efficiency of adversarial training, and exploring the effective application of the Mistake Book mechanism on larger datasets.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a recipe (code) and need to ensure it produces a delicious dish (correct result). To verify this, you need someone to taste it (test). Traditionally, you might rely on a fixed food critic (static tests) to taste your dish, but they might only like certain flavors and can't fully evaluate your dish.
Code-A1 is like a dynamic team of food critics who adjust their tastes based on your dish (dynamic tests). This way, your dish can perform well under different tastes.
Throughout this process, Code-A1 keeps a record of each tasting result (Mistake Book) to ensure you don't repeat the same mistakes. Ultimately, your cooking skills (code generation ability) will continuously improve, able to handle various taste challenges.
This approach not only improves the quality of the dish (code correctness) but also reduces reliance on a fixed food critic (human-annotated tests), giving you more freedom in the kitchen.
ELI14 Explained like you're 14
Hey there, buddy! Imagine you're playing a super cool game where you're a top-notch programmer writing perfect code to defeat enemies! But the problem is, your enemies keep changing, so you can't use the same strategy to beat them.
That's where Code-A1 comes in as your super assistant. It helps you generate all kinds of tests, like different enemies, so you can keep improving your skills!
Every time your code passes a test, Code-A1 records it, so you don't make the same mistake again. It even adjusts the difficulty based on your performance, helping you level up in the game.
In the end, you'll become an unbeatable programmer, writing perfect code and defeating all the enemies! Isn't that awesome?
Glossary
Adversarial Co-evolution
A method where two models are simultaneously optimized through adversarial training, often used for generation and testing tasks.
Used in Code-A1 to optimize code and test generation models.
Reinforcement Learning
A machine learning approach where models learn strategies to maximize cumulative rewards through reward signals.
Used to optimize Code-A1's code and test generation models.
Mistake Book
A mechanism that records historical failed tests for each question, ensuring the model does not forget resolved errors during training.
Used in Code-A1 for experience replay.
Composite Reward
A reward mechanism that combines multiple objectives to balance test validity with adversarial difficulty.
Used in Code-A1 to optimize the test generation model.
Qwen2.5-Coder
A model used for code and test generation, available in 1.5B, 3B, and 7B scales.
Used in Code-A1's experiments.
White-box Testing
A testing method where testers have access to the internal structure of the code to generate more targeted tests.
Used in Code-A1 for generating targeted tests.
Mul Score
A metric for evaluating test generation model performance, combining test validity and adversarial capability.
Used in Code-A1's experimental results.
Self-collusion
In self-play methods, the model generates simple tests for easy rewards, distorting the training signal.
Avoided in Code-A1 by separating model structures.
Dynamic Reward
A reward mechanism that adjusts based on model capability changes, avoiding the limitations of static tests.
Used in Code-A1 to optimize model performance.
Experience Replay
A method that stabilizes the training process by reusing historical experiences.
Implemented in Code-A1 through the Mistake Book.
Open Questions Unanswered questions from this research
- 1 How can the Mistake Book mechanism be effectively applied to larger datasets to avoid excessive computational overhead? The current implementation may lead to performance bottlenecks on large datasets.
- 2 Code-A1 may perform suboptimally in extremely complex code and test scenarios. How can its applicability be extended to cover more diverse scenarios?
- 3 Adversarial training may cause the model to overfit to specific types of tests. How can more robust training mechanisms be designed to avoid this issue?
- 4 In practical applications, how can Code-A1 be effectively integrated into existing software development processes to maximize its benefits?
- 5 How can the efficiency of adversarial training be further optimized to reduce training time and computational resource consumption?
Applications
Immediate Applications
Automated Test Generation
Code-A1 can be used in software development for automated test generation, improving test coverage and efficiency while reducing human intervention.
Code Quality Improvement
Through dynamic and adaptive verifiable rewards, Code-A1 can continuously improve code quality, reducing potential errors and vulnerabilities.
Education and Training
Code-A1 can be used in programming education and training, generating diverse test cases to help students improve their coding skills.
Long-term Vision
Software Development Process Optimization
Code-A1's adversarial co-evolution framework can be integrated into software development processes, enhancing overall development efficiency and quality.
Intelligent Programming Assistant
In the future, Code-A1 could evolve into an intelligent programming assistant, automatically generating and optimizing code to boost developer productivity.
Abstract
Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
References (20)
Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Yixin Chen, Yiding Wang, Siqi Zhu et al.
CodeT: Code Generation with Generated Tests
Bei Chen, Fengji Zhang, A. Nguyen et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Zhongmou He, Yee Man Choi, Kexun Zhang et al.
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback
Bo Shen, Jiaxin Zhang, Taihong Chen et al.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu et al.
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Adriano Lieret, Carlos E. Jimenez et al.
CoverUp: Effective High Coverage Test Generation for Python
Juan Altmayer Pizzorno, E. Berger
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Adam Suma, Sam Dauncey
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu, Chun Xia, Yuyao Wang et al.
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark
Kush Jain, Gabriel Synnaeve, Baptiste Rozière
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
RLTF: Reinforcement Learning from Unit Test Feedback
Jiate Liu, Yiqin Zhu, Kaiwen Xiao et al.
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye et al.
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang et al.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson E. Denison, M. MacDiarmid, Fazl Barez et al.
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue et al.
Benchmarking LLMs for Unit Test Generation from Real-World Functions
Dong Huang, Jie M. Zhang, Mark Harman et al.
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.
Search Self-play: Pushing the Frontier of Agent Capability without Supervision
Hongliang Lu, Yuhang Wen, Pengyu Cheng et al.