Code Review Agent Benchmark

Key Findings

Methodology

The paper introduces the c-CRAB dataset to evaluate code review agents. This dataset is systematically constructed from human reviews, generating corresponding tests to assess agent-generated reviews. This benchmark allows for an objective evaluation of whether agents can identify issues highlighted in human reviews. The evaluation framework includes tests on PR-Agent, Devin, Claude Code, and Codex.

Key Results

Existing code review agents collectively solve only about 40% of c-CRAB tasks, indicating significant room for improvement in future research.
Agent-generated reviews often focus on different aspects than human reviews, suggesting potential for human-agent collaboration.
Agent-generated tests from our dataset act as a held-out test suite, ensuring the quality of agent-generated reviews.

Significance

This study provides a new standard for evaluating code review agents through the c-CRAB dataset. The dataset not only highlights the limitations of current agents but also points to directions for future research. By facilitating human-agent collaboration, future software development teams can enhance code quality and review efficiency. Additionally, the test suite from the c-CRAB dataset acts as a quality gate for agent-generated reviews.

Technical Contribution

The construction method of the c-CRAB dataset differs from existing text similarity-based evaluation methods, offering an objective evaluation mechanism based on executable tests. This approach ensures reproducibility and stability in evaluations and lays the groundwork for future collaboration among code generation, test generation, and code review agents.

Novelty

The c-CRAB dataset is the first to convert human review feedback into executable tests for evaluating code review agents. This approach differs from traditional text similarity-based evaluations by providing a more objective and verifiable standard.

Limitations

Current code review agents solve only 40% of c-CRAB tasks, indicating limitations in identifying complex issues.
Agent-generated reviews focus on different aspects than human reviews, which may lead to inconsistencies in practical applications.
The construction of the c-CRAB dataset relies on the quality of human reviews, which may be affected by noise in human review practices.

Future Work

Future research can explore how to improve code review agents' performance on c-CRAB tasks, particularly in identifying complex and diverse issues. Additionally, studying best practices for human-agent collaboration to leverage the complementary strengths of agents and human reviews is crucial.

AI Executive Summary

In modern software development, code review is a critical step to ensure code quality. However, with the widespread use of AI agents in code generation, the volume of automatically generated code has surged, placing immense pressure on human reviewers. Existing code review agent tools have limited performance in identifying code issues, often relying on textual similarity to evaluate their effectiveness, which fails to accurately reflect whether agents identify real issues.

To address this problem, researchers have developed the c-CRAB dataset, a benchmark specifically designed to evaluate the capabilities of code review agents. The c-CRAB dataset provides an objective evaluation standard by converting human review feedback into executable tests. This method not only verifies whether agents identify issues highlighted in human reviews but also ensures reproducibility and stability in evaluations.

In experiments, researchers used the c-CRAB dataset to test several existing code review agents, including the open-source PR-Agent and commercial agents Devin, Claude Code, and Codex. Results show that these agents collectively solve only about 40% of c-CRAB tasks, indicating significant room for improvement in identifying complex issues.

Furthermore, the study found that agent-generated reviews often focus on different aspects than human reviews. This does not imply poor review quality but suggests potential for human-agent collaboration. By combining different perspectives from humans and agents, future software development teams can more effectively identify and resolve code issues.

While the c-CRAB dataset provides a new standard for evaluating code review agents, its construction relies on the quality of human reviews, which may be affected by noise in human review practices. Future research can explore how to improve agents' performance on c-CRAB tasks and optimize human-agent collaboration to enhance code review efficiency and quality.

Deep Analysis

Background

Code review is a crucial quality assurance practice in software development. With the advancement of AI technology, the application of automated code generation tools has become increasingly widespread, leading to a surge in the volume of generated code. However, existing human reviewers cannot keep pace with this growth, creating a bottleneck in the development process. Traditional code review relies on human reviewers to carefully examine code changes to detect defects, enforce project standards, and maintain long-term code quality. Recently, automated code review tools have emerged to accelerate this process, analyzing pull requests and generating review feedback. However, existing evaluation methods primarily rely on textual similarity, which fails to accurately measure whether agent-generated reviews identify real issues in the code.

Core Problem

With the widespread application of automated code generation tools, the speed and volume of code generation have significantly increased, placing immense pressure on human reviewers. Existing code review agent tools have limited performance in identifying code issues, often relying on textual similarity to evaluate their effectiveness, which fails to accurately reflect whether agents identify real issues. Additionally, existing evaluation methods face challenges in handling noise and incompleteness in human review comments.

Innovation

The core innovation of the c-CRAB dataset lies in converting human review feedback into executable tests for evaluating code review agents. This approach differs from traditional text similarity-based evaluations by providing a more objective and verifiable standard. Through this method, researchers can objectively evaluate whether agents identify issues highlighted in human reviews and ensure reproducibility and stability in evaluations. Additionally, the test suite from the c-CRAB dataset acts as a quality gate for agent-generated reviews.

Methodology

�� The c-CRAB dataset is constructed by systematically generating corresponding tests from human reviews.
�� The evaluation framework includes tests on PR-Agent, Devin, Claude Code, and Codex.
�� Use executable tests as the evaluation standard to ensure objectivity and reproducibility.
�� Explore the potential for human-agent collaboration by comparing agent-generated reviews with human reviews.

Experiments

The experimental design includes using the c-CRAB dataset to test several existing code review agents. The dataset systematically generates corresponding tests from human reviews to evaluate agent-generated reviews. The experiments involved open-source PR-Agent and commercial agents Devin, Claude Code, and Codex. The evaluation framework verifies whether agents identify issues highlighted in human reviews through executable tests, ensuring reproducibility and stability in evaluations.

Results

The experimental results show that existing code review agents collectively solve only about 40% of c-CRAB tasks, indicating significant room for improvement in identifying complex issues. Additionally, the study found that agent-generated reviews often focus on different aspects than human reviews. This does not imply poor review quality but suggests potential for human-agent collaboration. By combining different perspectives from humans and agents, future software development teams can more effectively identify and resolve code issues.

Applications

The c-CRAB dataset provides a new standard for evaluating code review agents, applicable to assessing existing and future code review agent tools. By combining different perspectives from humans and agents, future software development teams can more effectively identify and resolve code issues. Additionally, the test suite from the c-CRAB dataset acts as a quality gate for agent-generated reviews.

Limitations & Outlook

While the c-CRAB dataset provides a new standard for evaluating code review agents, its construction relies on the quality of human reviews, which may be affected by noise in human review practices. Existing code review agents solve only 40% of c-CRAB tasks, indicating limitations in identifying complex issues. Additionally, agent-generated reviews focus on different aspects than human reviews, which may lead to inconsistencies in practical applications.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (code), but you're not sure if it's perfect (error-free). Traditionally, you'd ask an experienced chef (human reviewer) to check your recipe to ensure there are no mistakes. However, as you cook more dishes, your chef becomes overwhelmed. That's when you decide to use a smart assistant (code review agent) to help you check the recipe. This smart assistant can quickly scan the recipe and find potential issues, but it sometimes misses details (limited ability to identify issues). To ensure the assistant's suggestions are correct, you decide to use a new method: every time the assistant finds an issue, you perform a small test (executable test) to see if the issue really exists. If the test passes, it means the assistant found a real problem. This method allows you to quickly find and fix issues in the recipe while reducing the chef's burden. This is the core idea of the c-CRAB dataset: evaluating code review agents' abilities through executable tests to ensure they can identify real issues.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game, and you need to keep upgrading your gear (code) to defeat enemies (bugs). Usually, you'd ask a skilled friend (human reviewer) to help you check your gear to make sure there are no problems. But now, you have a new helper (code review agent) that can quickly help you check your gear. However, sometimes it misses some issues (limited ability to identify problems). To make sure the helper's suggestions are correct, you decide to do a small test (executable test) every time it finds an issue to see if the problem really exists. If the test passes, it means the helper found a real problem. This way, you can quickly find and fix issues in your gear while reducing your friend's burden. This is the core idea of the c-CRAB dataset: evaluating code review agents' abilities through executable tests to ensure they can identify real issues.

Glossary

Code Review

Code review is a quality assurance practice in software development where developers inspect pull requests before merging code changes to detect defects, enforce project standards, and maintain long-term code quality.

In this paper, code review is the core task for evaluating code review agents' capabilities.

c-CRAB Dataset

c-CRAB is a benchmark dataset specifically designed to evaluate code review agents' capabilities by converting human review feedback into executable tests, providing an objective evaluation standard.

The paper introduces the c-CRAB dataset as a new standard for evaluating code review agents.

Executable Test

Executable tests are tests converted from human review feedback to verify whether code review agents identify real issues.

The c-CRAB dataset evaluates code review agents' capabilities through executable tests.

PR-Agent

PR-Agent is an open-source code review agent tool used to automatically generate code review feedback.

In the experiments, PR-Agent is one of the code review agents evaluated.

Devin

Devin is a commercial code review agent tool used to automatically generate code review feedback.

In the experiments, Devin is one of the code review agents evaluated.

Claude Code

Claude Code is a commercial code review agent tool used to automatically generate code review feedback.

In the experiments, Claude Code is one of the code review agents evaluated.

Codex

Codex is a commercial code review agent tool used to automatically generate code review feedback.

In the experiments, Codex is one of the code review agents evaluated.

Textual Similarity

Textual similarity is a method for evaluating the similarity between generated comments and human comments, typically relying on lexical overlap or embedding similarity.

Existing code review agent evaluation methods often rely on textual similarity.

Human-Agent Collaboration

Human-agent collaboration refers to humans and AI agents working together in the code review process to enhance code quality and review efficiency.

The paper explores the potential for human-agent collaboration in code review.

Benchmark Dataset

A benchmark dataset is a standard dataset used to evaluate the performance of algorithms or tools, providing a comparable evaluation standard.

c-CRAB is a benchmark dataset for evaluating code review agents.

Open Questions Unanswered questions from this research

1 Existing code review agents have limited performance in identifying complex issues. Future research needs to explore how to improve agents' performance on c-CRAB tasks, particularly in identifying complex and diverse issues.
2 The construction of the c-CRAB dataset relies on the quality of human reviews, which may be affected by noise in human review practices. Future research can explore how to improve the dataset's quality and reliability.
3 Agent-generated reviews focus on different aspects than human reviews, which may lead to inconsistencies in practical applications. Future research can explore how to optimize human-agent collaboration to leverage the complementary strengths of agents and human reviews.
4 Existing evaluation methods face challenges in handling noise and incompleteness in human review comments. Future research can explore more effective evaluation methods to improve the accuracy and reliability of evaluations.
5 While the c-CRAB dataset provides a new standard for evaluating code review agents, its construction relies on the quality of human reviews, which may be affected by noise in human review practices. Future research can explore how to improve the dataset's quality and reliability.

Applications

Immediate Applications

Code Review Agent Evaluation

The c-CRAB dataset can be immediately used to evaluate existing and future code review agent tools, helping developers choose the tools that best suit their needs.

Human-Agent Collaboration Optimization

By combining different perspectives from humans and agents, software development teams can more effectively identify and resolve code issues, enhancing code quality and review efficiency.

Code Quality Assurance

The test suite from the c-CRAB dataset acts as a quality gate for agent-generated reviews, ensuring that generated reviews can identify real issues.

Long-term Vision

Automated Development Processes

By improving code review agents' capabilities, future software development processes can achieve higher levels of automation, reducing the burden on human reviewers.

Intelligent Software Development

As code review agents' capabilities improve, future software development can achieve more intelligent collaboration, enhancing development efficiency and code quality.

Abstract

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.

cs.SE cs.AI

Related Papers

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

From natural language to verified code using Dafny, Gemma 4-31B achieved a 90.91% verification success rate.

cs.SE 2026-04-24

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Code Review

c-CRAB Dataset

Executable Test

PR-Agent

Devin

Claude Code

Codex

Textual Similarity

Human-Agent Collaboration

Benchmark Dataset

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Code Review Agent Evaluation

Human-Agent Collaboration Optimization

Code Quality Assurance

Long-term Vision

Automated Development Processes

Intelligent Software Development

Abstract

Related Papers

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Evaluating LLM-Based Test Generation Under Software Evolution

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

daVinci-Env: Open SWE Environment Synthesis at Scale