Evaluating LLM-Based Test Generation Under Software Evolution

TL;DR

Study shows LLMs struggle with test generation under software evolution, with pass rates dropping to 66% under semantic changes.

cs.SE 🔴 Advanced 2026-03-25 33 views
Sabaat Haroon Mohammad Taha Khan Muhammad Ali Gulzar
Large Language Models Software Testing Code Evolution Automated Testing Semantic Analysis

Key Findings

Methodology

The study employs an automated mutation-driven framework to analyze the test generation performance of eight LLMs across 22,374 program variants. By introducing Semantic Altering Changes (SAC) and Semantic Preserving Changes (SPC), the study evaluates how generated tests respond to code evolution. It was found that LLMs achieve 79% line coverage and 76% branch coverage on original programs, but under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%.

Key Results

  • On original programs, LLMs achieved 79% line coverage and 76% branch coverage with fully passing test suites.
  • Under Semantic Altering Changes (SAC), the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. Over 99% of failing SAC tests pass on the original program, indicating residual alignment with original behavior.
  • Under Semantic Preserving Changes (SPC), despite unchanged functionality, pass rates fall to 79% and branch coverage to 69%. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes.

Significance

This study reveals the limitations of current LLMs in software test generation, especially in adapting to code evolution. The findings indicate that LLMs rely heavily on surface-level cues rather than deep semantic understanding in test generation. This has significant implications for academia and industry, highlighting the shortcomings of current automated test generation techniques and providing directions for future improvements.

Technical Contribution

The technical contribution of this study lies in the development of a novel mutation-driven evaluation framework that systematically assesses LLMs' test generation capabilities under program changes. Unlike existing static benchmarks, this framework reveals performance differences when LLMs face semantic changes and provides direct assessment of whether LLM-generated tests reflect true reasoning about program behavior.

Novelty

This study is the first to systematically evaluate LLMs' test generation capabilities under program evolution, particularly under semantic altering and preserving changes. The innovation lies in using a mutation-driven approach to reveal LLMs' reliance on surface patterns rather than deep semantic understanding in test generation.

Limitations

  • The study primarily relies on a mutation-driven framework, which may not fully simulate real-world code change scenarios.
  • LLMs perform poorly in handling complex semantic changes, particularly when deep semantic understanding is required.
  • The experiments are limited to Java and Python implementations, which may not fully represent the performance across other programming languages.

Future Work

Future research directions include developing LLMs with better semantic understanding capabilities to improve their test generation performance under code evolution. Additionally, exploring the integration of other test generation techniques, such as symbolic execution and fuzz testing, could enhance test coverage and accuracy.

AI Executive Summary

In software development, automated test generation is a crucial task, especially when code frequently evolves. Large Language Models (LLMs) have been increasingly used for automated unit test generation in recent years. However, it remains unclear whether these tests genuinely reflect deep reasoning about program behavior or merely replicate superficial patterns learned during training.

This study presents a large-scale empirical analysis of LLM-based test generation under program evolution. Using an automated mutation-driven framework, the study evaluates the performance of eight LLMs across 22,374 program variants. By introducing Semantic Altering Changes (SAC) and Semantic Preserving Changes (SPC), the study analyzes how generated tests respond to these code changes.

The experimental results show that LLMs achieve 79% line coverage and 76% branch coverage on original programs, with fully passing test suites. However, under semantic changes, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. Over 99% of failing SAC tests pass on the original program, indicating residual alignment with original behavior rather than adaptation to updated semantics.

Even under semantic preserving changes, despite unchanged functionality, pass rates still drop to 79% and branch coverage to 69%. This suggests LLMs' sensitivity to lexical changes rather than true semantic impact. Models generate more new tests while discarding many baseline tests, further supporting this conclusion.

Overall, this study reveals the limitations of current LLMs in test generation, especially in adapting to code evolution. This finding has significant implications for academia and industry, highlighting the shortcomings of current automated test generation techniques and providing directions for future improvements. Future research could explore developing LLMs with better semantic understanding capabilities to improve their test generation performance under code evolution.

Deep Analysis

Background

As software development evolves, automated test generation has become a critical area in software engineering. Large Language Models (LLMs) have been increasingly used for automated unit test generation, particularly in handling well-known public programming tasks. LLMs can generate syntactically valid and often semantically correct test cases. However, producing complete and reliable automated test suites requires deep reasoning about control flow, execution paths, and the specific functional state of the provided implementation. Real-world software is inherently dynamic, code is frequently reused, refactored, and slightly tweaked to serve different functional purposes. Therefore, when developers supply this modified code to generate a high-coverage test suite, they expect the resulting tests to accurately reflect the current code.

Core Problem

Current studies and benchmarks often overlook how code evolution impacts model robustness and behavior. This gap leaves a critical question: do LLMs genuinely comprehend the semantics of the provided code, or are they performing shallow pattern replication? Consider a scenario where a developer makes a minor semantic-altering modification to adapt an existing open-source function to a new use case. Even when explicitly prompted to generate high-coverage tests, an ideal test generator should adapt to the provided logic. However, if the LLM relies heavily on memorized structures, it may implicitly assume the modified code is a 'buggy' version of the original program. It may also completely overlook the semantic implication of the code change.

Innovation

The core innovation of this study lies in developing a novel mutation-driven evaluation framework to systematically assess LLMs' test generation capabilities under program changes. This framework introduces two classes of code changes: Semantic Altering Changes (SAC) and Semantic Preserving Changes (SPC). By analyzing LLM behavior under both change types, the study isolates whether performance differences arise from semantic misunderstanding or reliance on superficial code patterns. This dual perspective allows the characterization of three properties of LLM-based test generation: sensitivity to semantic change, resilience to non-functional structural change, and stability of generated test suites across evolving programs.

Methodology

  • �� The study employs an automated mutation-driven framework to analyze the test generation performance of eight LLMs across 22,374 program variants.

  • �� The framework introduces Semantic Altering Changes (SAC) and Semantic Preserving Changes (SPC) to evaluate how generated tests respond to code evolution.

  • �� On original programs, LLMs achieve 79% line coverage and 76% branch coverage with fully passing test suites.

  • �� Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%.

  • �� Under SPCs, despite unchanged functionality, pass rates fall to 79% and branch coverage to 69%.

Experiments

The experimental design utilizes the Project CodeNet dataset, focusing on Java and Python implementations. Baseline evaluation reveals that LLMs perform well on original, unmodified programs, achieving an average of 79.2% line coverage and 76.1% branch coverage with fully passing test suites, containing 13.1 tests per program on average. However, this performance degrades sharply under code changes. When subjected to Semantic Altering Changes (SAC), the pass rate of newly generated tests plummets to 66.5%, and branch coverage falls to 60.6%. Under Semantic Preserving Changes (SPC), the test pass rate decreases from 100% to 79%, and branch coverage drops to 69%.

Results

The experimental results show that LLMs achieve 79% line coverage and 76% branch coverage on original programs, with fully passing test suites. However, under semantic changes, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. Over 99% of failing SAC tests pass on the original program, indicating residual alignment with original behavior rather than adaptation to updated semantics. Even under semantic preserving changes, despite unchanged functionality, pass rates still drop to 79% and branch coverage to 69%. This suggests LLMs' sensitivity to lexical changes rather than true semantic impact.

Applications

The study's application scenarios primarily focus on automated test generation, especially when code frequently evolves. The findings indicate that current LLMs perform poorly in handling code changes, providing directions for developing more semantically aware test generation tools. Additionally, the study can be applied in software engineering education to help students understand the impact of code evolution on test generation.

Limitations & Outlook

The study primarily relies on a mutation-driven framework, which may not fully simulate real-world code change scenarios. Additionally, LLMs perform poorly in handling complex semantic changes, particularly when deep semantic understanding is required. The experiments are limited to Java and Python implementations, which may not fully represent the performance across other programming languages. Future research could explore developing LLMs with better semantic understanding capabilities to improve their test generation performance under code evolution.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe that tells you how to make a delicious dish step by step. Now, suppose you have a smart assistant that can automatically generate a shopping list and cooking steps based on the recipe. This assistant is like a Large Language Model (LLM), which can generate test cases based on program code.

However, when you decide to make some small changes to the recipe, like adding a new spice, the assistant might get confused because it's used to working with the original recipe. This is similar to how LLMs perform when faced with code changes. When the code changes, LLMs might not adapt to these changes, and the generated test cases might no longer be valid.

The study shows that LLMs perform poorly when handling code changes, especially when semantics change. This is like your assistant struggling to adjust the shopping list and cooking steps when faced with new ingredients. To improve LLM performance, we need to develop smarter assistants that can understand recipe changes and adjust the shopping list and cooking steps accordingly.

In summary, current LLMs rely heavily on surface-level cues rather than deep semantic understanding when handling code changes. This finding provides directions for future improvements, helping us develop smarter automated test generation tools.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game. You have an assistant that can automatically generate game guides to help you level up easily. This assistant is like a Large Language Model (LLM), which can generate test cases based on program code.

But sometimes, the game gets updated with new levels or changes some rules. When that happens, your assistant might get a bit confused because it's used to working with the old rules. This is similar to how LLMs perform when faced with code changes. When the code changes, LLMs might not adapt to these changes, and the generated test cases might no longer be valid.

The study found that LLMs perform poorly when handling code changes, especially when rules change. This is like your assistant struggling to adjust the guide when faced with new game rules. To improve LLM performance, we need to develop smarter assistants that can understand game rule changes and adjust the guide accordingly.

In summary, current LLMs rely heavily on surface-level cues rather than deep understanding when handling code changes. This finding provides directions for future improvements, helping us develop smarter automated test generation tools.

Glossary

Large Language Model (LLM)

A large language model is a deep learning-based model capable of generating natural language text. It is trained on vast amounts of text data and can understand and generate complex language structures.

In this paper, LLMs are used for automated test case generation for programs.

Automated Test Generation

Automated test generation is a technique that uses tools or algorithms to automatically generate software test cases. It aims to improve testing efficiency and coverage while reducing manual testing workload.

The paper investigates LLMs' performance in automated test generation.

Semantic Altering Change (SAC)

Semantic altering change refers to modifications in program code that lead to changes in program behavior. It typically involves adjustments to logic structures or functionalities.

The paper introduces SACs to evaluate LLMs' test generation capabilities.

Semantic Preserving Change (SPC)

Semantic preserving change refers to modifications in program code that do not affect program functionality or behavior. It typically involves code refactoring or variable renaming.

The paper introduces SPCs to evaluate LLMs' robustness.

Mutation-Driven Framework

A mutation-driven framework is a technique that evaluates the effectiveness of test suites by introducing code mutations. It simulates code changes to test the adaptability of generated test cases.

The paper uses a mutation-driven framework to analyze LLMs' test generation performance under program changes.

Test Coverage

Test coverage is a metric that measures the extent to which test cases cover program code. It typically includes line coverage and branch coverage, reflecting test cases' coverage of code execution paths.

The paper evaluates the coverage of test cases generated by LLMs.

Line Coverage

Line coverage refers to the proportion of program code lines executed by test cases. High line coverage usually indicates that test cases cover most code lines.

In the paper, LLMs achieve 79% line coverage on original programs.

Branch Coverage

Branch coverage refers to the proportion of program code branches executed by test cases. High branch coverage usually indicates that test cases cover most code branches.

In the paper, LLMs achieve 76% branch coverage on original programs.

Code Evolution

Code evolution refers to the continuous modification and updating of code during software development. These changes may involve adding features, fixing bugs, or optimizing performance.

The paper investigates LLMs' test generation performance under code evolution.

Symbolic Execution

Symbolic execution is a program analysis technique that explores all possible execution paths of a program using symbolic variables instead of concrete values.

Symbolic execution is considered an ideal test generator in the paper.

Open Questions Unanswered questions from this research

  • 1 Current LLMs perform poorly in handling complex semantic changes, particularly when deep semantic understanding is required. Future research needs to develop models with better semantic understanding capabilities to improve their test generation performance under code evolution.
  • 2 The study primarily relies on a mutation-driven framework, which may not fully simulate real-world code change scenarios. Future research could explore integrating other test generation techniques, such as symbolic execution and fuzz testing, to enhance test coverage and accuracy.
  • 3 The experiments are limited to Java and Python implementations, which may not fully represent the performance across other programming languages. Future research could expand to other programming languages to verify LLMs' performance across different languages.
  • 4 Current LLMs rely heavily on surface-level cues rather than deep semantic understanding in test generation. Future research needs to explore how to improve LLMs' semantic understanding capabilities to enhance their test generation performance under code evolution.
  • 5 The findings indicate that LLMs perform poorly in adapting to code evolution, providing directions for developing more semantically aware test generation tools. Future research could explore developing smarter automated test generation tools to improve performance under code evolution.

Applications

Immediate Applications

Automated Test Generation

The findings can be applied to the development of automated test generation tools, especially when code frequently evolves, to improve test coverage and accuracy.

Software Engineering Education

The findings can be applied in software engineering education to help students understand the impact of code evolution on test generation and improve their testing skills in software development.

Code Refactoring Tools

The findings can be applied to the development of code refactoring tools to help developers generate efficient test cases during code refactoring, improving code quality.

Long-term Vision

Intelligent Test Generation Tools

Develop more intelligent test generation tools that can understand code semantic changes and adjust generated test cases accordingly, improving test effectiveness and coverage.

Cross-Language Test Generation

Expand research to other programming languages to develop tools capable of generating efficient test cases across different languages, improving software development efficiency and quality.

Abstract

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.

cs.SE cs.AI