Probe-and-Refine Tuning of Repository Guidance for Coding Agents

TL;DR

Propose probe-and-refine tuning, an iterative method using synthetic bug probes to enhance repository guidance, improving mean resolve rate from 28.3% to 33.0% with Qwen3.5-35B.

cs.SE 🔴 Advanced 2026-06-19 11 views

Asa Shepard Jeannie Albrecht

AI Reader Arxiv Page Download PDF

AI Software Engineering Large Language Models Prompt Tuning Code Agents

Key Findings

Methodology

This paper introduces a lightweight probe-and-refine tuning process that relies solely on single-shot calls to a large language model (LLM). The approach involves generating synthetic bug-fix probes, which are then used to diagnose the deficiencies of a repository guidance file. The model attempts to fix these probes, and an automatic evaluation assesses the quality of the fixes. Based on this feedback, the guidance file is iteratively patched to improve coverage without multi-step agent loops or tool use. Experiments conducted with the Qwen3.5-35B-A3B model across four independent trials show a mean resolve rate increase from 28.3% (static knowledge base) to 33.0%, with statistical significance (p<0.001). The process emphasizes expanding the guidance's coverage rather than increasing precision per patch, resulting in more instances with evaluable patches (a 14.5 percentage point increase), while per-patch precision remains roughly constant (~59%). Additional experiments demonstrate that guidance enables more effective use of larger step budgets and that the tuning loop degrades when the model cannot produce sufficiently diagnostic outputs, though patch precision remains unaffected.

Key Results

Across four independent trials, probe-and-refine tuning achieved an average resolve rate of 33.0%, significantly outperforming the static knowledge base (28.3%) and unguided baseline (25.5%), with p<0.001, demonstrating the effectiveness of synthetic probe-driven iterative guidance.
The primary source of improvement was increased coverage: refined guidance produced evaluable patches for 14.5 percentage points more instances, indicating better file localization, while per-patch precision remained statistically unchanged at approximately 59%.
Step-budget experiments revealed guidance's role in enabling the agent to utilize larger step counts productively, and cross-model tests with NVIDIA-Nemotron-3-Nano-30B-A3B showed that when the model's diagnostic output diminishes, the tuning loop's effectiveness declines, though patch quality remains stable, confirming robustness.

Significance

This work underscores the importance of guidance content quality in large language model-based code agents. By leveraging synthetic failure probes for iterative guidance refinement, the approach addresses the limitations of static instructions and complex reinforcement learning methods. It demonstrates that simple, content-focused iterative tuning can significantly improve task coverage and resolution rates, offering a scalable and computationally efficient pathway for enhancing AI-assisted software development. The findings have broad implications for deploying autonomous code repair systems in real-world environments, reducing manual debugging efforts, and enabling continuous improvement of AI-driven tools.

Technical Contribution

The main technical innovation is the development of a probe-and-refine procedure that relies solely on single-shot LLM calls to diagnose and improve repository guidance files iteratively. This process involves generating synthetic bug-fix probes, attempting fixes, evaluating outcomes, and mechanically patching the guidance content without multi-step reasoning or reinforcement learning. The approach effectively transforms static, generic guidance into specialized, operational instructions tailored to each repository, significantly increasing coverage and task resolution rates. The method is validated across multiple repositories and models, demonstrating its robustness and generalizability, and establishing a new paradigm for prompt-based guidance optimization.

Novelty

This research is the first to systematically leverage synthetic bug probes for iterative, content-driven guidance refinement in code agents. Unlike prior work that focuses on static instructions, fine-tuning, or multi-step interactions, the probe-and-refine method emphasizes diagnosis and mechanical correction of guidance content through single-shot LLM calls. Its novelty lies in transforming failure feedback into actionable guidance updates, which substantially improves coverage and resolution rates without complex training or reinforcement learning. This approach introduces a new paradigm in prompt engineering and operational guidance for AI in software engineering.

Limitations

The effectiveness diminishes when the model's capacity to generate diagnostic outputs is limited, leading to degraded tuning loops and reduced improvements in guidance quality.
The guidance length increases by approximately 63% after refinement, which may pose challenges under context length constraints in practical deployment.
The current validation is limited to specific models (Qwen3.5-35B-A3B, NVIDIA-Nemotron-3-Nano-30B-A3B) and datasets (SWE-bench), requiring further testing for broader applicability across diverse models and real-world repositories.

Future Work

Future directions include integrating adaptive sample generation techniques to enhance the diversity and relevance of synthetic probes, exploring multi-modal feedback mechanisms, and extending the approach to larger models and more complex codebases. Additionally, automating the generation of high-quality synthetic failure scenarios and applying the method in live industrial environments for continuous online tuning are promising avenues. Further research is needed to understand the limits of content-based guidance refinement and to develop more scalable, robust frameworks for AI-assisted software maintenance.

AI Executive Summary

In the rapidly evolving field of AI-powered software engineering, large language models (LLMs) have demonstrated remarkable capabilities in code understanding and generation. However, their effectiveness in real-world tasks heavily depends on the quality of operational guidance provided within repositories. Traditionally, developers maintain static guidance files such as AGENTS.md, which encode conventions, entry points, and debugging workflows. Yet, the impact of these files on model performance remains controversial, with studies reporting conflicting results—some indicating efficiency gains, others noting reduced resolution rates.

This paper introduces a novel approach called probe-and-refine tuning, designed to systematically improve repository guidance through iterative, synthetic failure-based diagnosis. Unlike traditional methods that rely on multi-step agent loops or reinforcement learning, this approach employs only single-shot calls to an LLM to generate synthetic bug-fix probes, attempt repairs, evaluate outcomes, and mechanically update guidance content. The process is lightweight, transparent, and highly adaptable, requiring no additional training or complex interactions.

The core idea is to leverage the model’s own diagnostic failures to identify gaps in guidance, then refine instructions to address these gaps. By doing so, the guidance evolves from generic, broad instructions to highly specific, operationally relevant directives tailored to each repository. Experiments conducted on four repositories using the Qwen3.5-35B-A3B model show that this iterative refinement boosts the mean bug resolution rate from 28.3% (static knowledge base) to 33.0%, a statistically significant improvement (p<0.001). Notably, this gain stems primarily from increased coverage—more instances produce evaluable patches—while the precision per patch remains stable at around 59%.

Further analysis reveals that guidance quality directly influences the model’s ability to utilize larger step budgets effectively, enabling more comprehensive exploration within the same computational constraints. Cross-model experiments with NVIDIA’s Nemotron-3-Nano-30B-A3B demonstrate that when the model’s diagnostic output diminishes, the tuning loop’s effectiveness declines, though patch accuracy remains unaffected. These findings highlight the robustness of the content-driven refinement process and its potential for broad application.

Overall, this work advances the understanding of how operational guidance impacts code agent performance. It offers a simple yet powerful framework for iterative, failure-informed guidance optimization, paving the way for more autonomous, reliable AI-assisted software development. Future research will focus on scaling the approach, integrating multi-modal feedback, and deploying in real-world industrial settings to realize continuous, adaptive code repair systems.

Deep Analysis

Background

The application of large language models (LLMs) in software engineering has led to significant progress in code synthesis, debugging, and documentation generation. Early efforts focused on fine-tuning models like CodeT5 and CodeBERT, or developing scaffolds such as tree search and local search algorithms, to improve task-specific performance. Recent advancements include integrating structural knowledge graphs (e.g., RepoGraph) and contextual information (e.g., AGENTS.md files) to enhance model understanding of repositories. Despite these developments, the effectiveness of repository-level guidance remains debated. Some studies report efficiency gains when using curated guidance, while others observe decreased bug resolution rates with automatically generated files. These conflicting results suggest that the quality and production process of guidance files are critical factors. Existing approaches often rely on static instructions or complex iterative training, which can be costly and inflexible. The need for a systematic, lightweight method to iteratively refine operational guidance based on failure feedback remains unmet. This research addresses this gap by proposing a content-driven, synthetic failure-based tuning framework that enhances guidance quality without extensive retraining.

Core Problem

The core challenge in deploying AI-based code agents is the variability and often suboptimal quality of repository guidance files, which encode operational knowledge such as debugging workflows, module navigation, and testing procedures. Static guidance files are typically generated once and remain unchanged, limiting their adaptability to new or unforeseen issues. When models encounter unfamiliar failures, they tend to jump to incorrect files or propose invalid fixes, partly because guidance lacks the diagnostic specificity needed for effective troubleshooting. The fundamental bottleneck is the absence of a systematic, scalable mechanism to diagnose and iteratively improve guidance content based on actual failure feedback. Existing methods either rely on costly fine-tuning, reinforcement learning, or manual curation, which are not scalable or adaptable enough for diverse repositories. The challenge is to develop a lightweight, automated process that can diagnose deficiencies, generate targeted guidance updates, and improve coverage and effectiveness in a scalable manner.

Innovation

This paper introduces a novel probe-and-refine tuning process that leverages synthetic bug-fix probes to iteratively diagnose and improve repository guidance files. Its key innovations include:

�� Generating diverse synthetic failure scenarios using the language model itself, simulating real debugging challenges.
�� Conducting single-shot attempts to fix these synthetic bugs, followed by automatic evaluation of the fixes.
�� Mechanically patching guidance files based on diagnostic feedback, expanding coverage and operational specificity.
�� Eliminating the need for multi-step agent loops, reinforcement learning, or retraining, making the process lightweight and transparent.
�� Validating the approach across multiple repositories and models, demonstrating consistent improvements in bug resolution rates.

This approach contrasts with traditional static guidance or complex fine-tuning, emphasizing content diagnosis and mechanical correction to achieve scalable, effective guidance refinement.

Methodology

�� Build repository structural knowledge using tree-sitter parsing, extracting key modules, dependencies, and entry points, summarized in natural language.
�� Generate synthetic bug-fix probes by sampling from the model with temperature 0.9, producing diverse failure scenarios across subsystems.
�� For each probe, attempt to generate a fix using the current guidance, simulating the code agent’s behavior.
�� Evaluate the attempted fix with a single-shot model call, assessing success and identifying shortcomings.
�� Based on evaluation, generate guidance edits—such as procedural rules, structural navigation, or quality gates—and mechanically patch the guidance file.
�� Limit guidance length to 3000 characters, ensuring concise, operational instructions.
�� Repeat the cycle for 3–5 iterations, stopping early if guidance stabilizes or no further improvements are detected.
�� The entire process relies solely on single-shot LLM calls, avoiding multi-step reasoning or reinforcement learning, thus maintaining simplicity and efficiency.

Experiments

The experimental setup involved four repositories—Django, SymPy, Matplotlib, and Scikit-learn—each tested on the SWE-bench verified subset. For each repository, the guidance was initially constructed from static structural and procedural knowledge, then refined through 3–5 iterations of the probe-and-refine process. The guidance content length increased from an average of 1687 to 2754 characters, mainly adding repo-specific navigation, debugging workflows, and quality rules. The models used were Qwen3.5-35B-A3B and NVIDIA-Nemotron-3-Nano-30B-A3B for cross-model validation. The primary metric was bug fix resolution rate, measured by the percentage of instances where the agent’s patch passed all tests. Baseline comparisons included unguided and static guidance conditions. The experiments also included step-budget analysis, varying the number of steps allowed, and cross-model tests to evaluate robustness under different model capacities. Results consistently showed that iterative guidance refinement improved resolution rates significantly, with p-values less than 0.001, confirming the method’s effectiveness.

Results

The probe-and-refine procedure achieved an average bug resolution rate of 33.0%, outperforming static knowledge base guidance (28.3%) and unguided baselines (25.5%), with high statistical significance (p<0.001). Content analysis revealed that the refined guidance added detailed procedural workflows, structural navigation instructions, and quality gates, which collectively increased the coverage of evaluable patches by 14.5 percentage points. The per-patch precision remained stable at about 59%, indicating that the improvement was primarily due to better diagnosis and localization rather than more aggressive or less accurate fixes. Step-budget experiments demonstrated that guidance enabled the model to utilize larger step counts more productively, and cross-model validation showed that when the model’s diagnostic output diminished, the tuning loop’s effectiveness declined, though patch accuracy remained unaffected. These findings confirm that content-driven guidance refinement effectively enhances the operational performance of code agents.

Applications

This approach can be directly applied to automated bug fixing, continuous integration pipelines, and AI-assisted code maintenance workflows. By iteratively refining operational guidance based on synthetic failure feedback, developers can reduce manual debugging efforts and improve the robustness of autonomous code repair systems. The method is especially suitable for open-source projects and large-scale enterprise systems where frequent updates and bug fixes are needed. Long-term, integrating this framework into CI/CD pipelines could enable continuous, self-improving guidance that adapts to evolving codebases, significantly reducing downtime and manual intervention. It also opens avenues for developing more intelligent, context-aware guidance systems that learn from failures without extensive retraining.

Limitations & Outlook

The effectiveness of the method depends on the model’s ability to generate diagnostic outputs; when this capacity is limited, the tuning loop degrades, reducing improvements. The guidance content length increases by about 63% after refinement, which may pose challenges in contexts with strict token limits. The validation is currently limited to specific models and datasets, and its generalization to other models, languages, or larger codebases remains to be demonstrated. Additionally, the synthetic probes, while diverse, may not capture all real-world failure modes, potentially limiting coverage in complex scenarios. Future work should focus on adaptive probe generation, multi-modal feedback integration, and real-world deployment to address these limitations.

Plain Language Accessible to non-experts

想象你在一家工厂工作，工厂每天都在生产各种商品。有时候，生产线会出现问题，比如某个机器突然停工，导致整个生产暂停。工厂的工程师们会写一本操作手册，告诉工人们遇到问题时该怎么处理。但是，这本手册有时候写得不够详细，工人们还得自己试错，花费很多时间。

现在，假设你有一个非常聪明的机器人助手，它可以自己模拟各种故障场景，尝试修复问题，然后告诉你哪里写得不够清楚，应该怎么改。这就像给它一本不断更新的操作指南，让它学会在遇到新问题时，能快速找到解决办法。这个机器人不断试错、修正指南，最终变得越来越聪明，能帮工厂更快地解决问题。

这个方法的关键在于，机器人不是一开始就知道所有答案，而是通过模拟故障、尝试修复、评估效果，然后不断改进指导内容。这样，工厂的生产线就能变得更稳定、更高效。这就像我们在学习新技能一样，反复练习、总结经验，才能变得越来越擅长。这个过程简单又高效，不需要复杂的程序，只要让机器人自己不断试错和修正，就能带来大变化。

ELI14 Explained like you're 14

想象你在学校里学做菜，你的老师给你一本食谱，但有时候食谱写得不够详细，你不知道什么时候需要多放点盐，或者怎么判断菜是不是熟了。于是，你自己试着做菜，尝一尝，然后告诉自己：“这个菜还不够咸，再多放点盐。”接着，你再试一次，直到味道刚刚好。

现在，假设有个超级聪明的机器人厨师，它可以自己试着做菜，然后告诉你哪些步骤写得不够清楚，哪些地方需要改进。它会模拟不同的调料用量，尝试不同的做法，然后不断修正食谱，让你以后做菜时不用再猜测。

这个机器人就像在不断试错、改进食谱一样，它通过模拟各种可能出现的问题，逐步学会如何写出更好的指导。这样一来，无论你做什么菜，机器人都能帮你提供最实用的建议，让你变成厨房里的大厨。这种方法简单又有效，就像你不断练习、总结经验一样，最终变得越来越厉害。

Abstract

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

cs.SE cs.LG

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

FASE: Fast Adaptive Semantic Entropy for Code Quality

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Code Review Agent Benchmark

Evaluating LLM-Based Test Generation Under Software Evolution