From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
From natural language to verified code using Dafny, Gemma 4-31B achieved a 90.91% verification success rate.
Key Findings
Methodology
This study proposes a Dafny-based formal verification framework to evaluate large language models' ability to generate code from natural language. Three prompting strategies were used: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates.
Key Results
- Result 1: With structured signature prompts and self-healing prompts, the Gemma 4-31B model achieved a 90.91% verification success rate, while GPT-OSS 120B improved from zero to 81.82%. This indicates that structured prompts and feedback mechanisms enable open-weight language models to effectively generate formally verified code.
- Result 2: Contextless prompting strategies led to near-universal failure, highlighting the importance of structured prompts in formal verification.
- Result 3: Functional validation was performed using the uDebug platform, ensuring that generated code is not only formally correct but also functionally robust in practical applications.
Significance
This research holds significant implications for academia and industry. It demonstrates the potential of large language models in formal verification and provides a viable path for high-assurance software development. By combining formal verification with large language models, the study addresses reliability issues in code generation, opening new possibilities for automated software engineering.
Technical Contribution
Technical contributions include: 1) Introducing a novel Dafny-based formal verification framework, 2) Significantly improving code generation accuracy through structured signature and self-healing prompting strategies, 3) Integrating the uDebug platform into the verification process, providing a dual-layer validation mechanism to ensure functional correctness.
Novelty
This study is the first to integrate large language models with Dafny formal verification, proposing a new method for code generation. Compared to existing work, this study not only focuses on code generation but also emphasizes code verification and functional correctness, filling a gap in natural language to formal verification code generation.
Limitations
- Limitation 1: Although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems, indicating a need for further optimization of prompting strategies.
- Limitation 2: The Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems.
- Limitation 3: The current study focuses mainly on algorithmic problems and has not been validated in broader software engineering domains.
Future Work
Future research directions include: 1) Expanding dataset size to cover more diverse problem types, 2) Optimizing prompting strategies to further improve verification success rates, 3) Exploring the application of large language models in other formal verification languages such as Coq and Lean.
AI Executive Summary
In the field of software engineering, automated code generation has been a topic of great interest. However, existing large language models often produce code that is syntactically plausible but semantically incorrect, a phenomenon known as 'hallucination.' To address this issue, researchers have proposed a new approach that combines large language models with Dafny formal verification to ensure the correctness of generated code.
This study provides a dataset called NaturalLanguage2VerifiedCode (NL2VC)-60, which includes 60 complex algorithmic problems. Researchers used three prompting strategies: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates.
In the experiments, researchers selected 11 problem sets and evaluated them using seven open-weight large language models. The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%. These results indicate that structured prompts and feedback mechanisms enable open-weight language models to effectively generate formally verified code.
Additionally, the study integrated the uDebug platform for functional validation, ensuring that generated code is not only formally correct but also functionally robust in practical applications. uDebug is a community-driven platform designed for competitive programmers to validate their solutions against high-quality test suites.
This research holds significant implications for academia and industry. It demonstrates the potential of large language models in formal verification and provides a viable path for high-assurance software development. By combining formal verification with large language models, the study addresses reliability issues in code generation, opening new possibilities for automated software engineering.
Despite the significant achievements, the study also has some limitations. For example, the Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems. Additionally, although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Future research directions include expanding dataset size, optimizing prompting strategies, and exploring the application of large language models in other formal verification languages.
Deep Analysis
Background
Formal verification plays a crucial role in software engineering by providing mathematical proof that programs strictly satisfy their specifications. Its application is increasingly widespread in high-stakes domains such as security-sensitive infrastructure, cryptographic libraries, and autonomous aerospace systems. However, writing program properties and proofs remains a creative and manual process requiring significant expertise. With the rise of large language models like GitHub Copilot and Amazon Q Developer, software development workflows have transformed significantly. These AI-driven tools accelerate programming tasks through natural language to code translation and intelligent autocompletion. However, despite their fluency, these systems often produce code that is syntactically plausible but semantically incorrect, especially when dealing with complex algorithmic logic. To ensure the correctness and logical integrity of generated code, there is a need to bridge the gap between AI synthesis and formal verification.
Core Problem
Large language models show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Existing formal verification languages, such as F*, Coq, Lean, and the Java Modeling Language (JML), face challenges in this process. Dafny, as a language that balances imperative programming and automated theorem proving, supports verification via assertions, preconditions, and postconditions. However, authoring formal specifications and auxiliary verification assertions remains difficult.
Innovation
The core innovations of this study include: 1) Introducing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset, designed to evaluate the synthesis of formally verified code from complex, real-world requirements. 2) Developing a Dafny-based formal verification framework to evaluate large language models' ability to generate code from natural language. 3) Introducing three prompting strategies: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates. 4) Integrating the uDebug platform for functional validation, ensuring that generated code is not only formally correct but also functionally robust.
Methodology
- �� Introduced the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset, containing 60 complex algorithmic problems.
- �� Used three prompting strategies: contextless prompts, signature prompts, and self-healing prompts.
- �� Mimicked human developers' workflows through iterative feedback from the Dafny verifier using self-healing prompts.
- �� Selected 11 problem sets and evaluated them using seven open-weight large language models.
- �� Integrated the uDebug platform for functional validation, ensuring that generated code is functionally robust in practical applications.
Experiments
The experimental design involved selecting 11 problem sets from the UVa Online Judge and evaluating them using seven open-weight large language models. Three prompting strategies were used: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier. The uDebug platform was integrated for functional validation, ensuring that generated code is functionally robust in practical applications. The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%.
Results
The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%. Contextless prompting strategies led to near-universal failure, highlighting the importance of structured prompts in formal verification. Functional validation was performed using the uDebug platform, ensuring that generated code is not only formally correct but also functionally robust in practical applications.
Applications
The application scenarios of this study include: 1) High-assurance software development by combining formal verification with large language models to address reliability issues in code generation. 2) Automated software engineering, providing a viable path for software development by leveraging large language models to accelerate programming tasks and reduce manual intervention. 3) Application in high-stakes domains such as security-sensitive infrastructure, cryptographic libraries, and autonomous aerospace systems.
Limitations & Outlook
Despite the significant achievements, the study also has some limitations. For example, the Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems. Additionally, although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Future research directions include expanding dataset size, optimizing prompting strategies, and exploring the application of large language models in other formal verification languages.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. You have a recipe (natural language description), but you need to ensure each step is executed correctly (formal verification). A large language model acts like a smart assistant that can help you cook based on the recipe, but sometimes it makes mistakes, like adding the wrong ingredient. This is where a verifier (Dafny) comes in to check if each step is correct. Researchers found that by giving the assistant some structured prompts, like telling it to add salt before sugar, its performance improves. Additionally, if the assistant makes a mistake, the verifier will point out where it went wrong, allowing the assistant to adjust based on feedback until each step is correct. It's like in the kitchen, where you keep trying and adjusting until you make the perfect dish. In this way, researchers demonstrated how to use large language models and formal verification to generate high-quality software code.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a game where you have to complete tasks based on clues. Sometimes, the clues are vague, and you don't know what to do. This is like the problem large language models face when generating code. To help them, we give them clearer clues, like in a game where you get a more detailed task description. This way, they can complete the task better! And if they make a mistake, we have a super smart checker (Dafny) to tell them where they went wrong. Then, they can adjust based on the checker's feedback, just like you keep trying in a game until you complete the task. This way, we can ensure the generated code is correct, just like you getting a high score in the game! Cool, right?
Glossary
Large Language Model
A type of AI model trained on vast amounts of text data, capable of generating natural language text.
Used for automated software code generation.
Formal Verification
The process of ensuring that a program strictly satisfies its specifications through mathematical proof.
Used to verify the correctness of code generated by large language models.
Dafny
A programming language that supports formal verification, allowing verification through assertions, preconditions, and postconditions.
Used as a formal verification framework.
Self-Healing Prompting
A prompting strategy that helps large language models correct errors through feedback mechanisms.
Used to improve code verification success rates.
uDebug
A community-driven platform for validating programmers' solutions against high-quality test suites.
Used for functional validation to ensure code is robust in practical applications.
Hallucination
Code generated by large language models that is syntactically plausible but semantically incorrect.
Needs to be addressed through formal verification.
Signature Prompt
A prompting strategy that provides structured hints to help large language models generate more accurate code.
Used to improve code verification success rates.
UVa Online Judge
An online automated judging system providing a vast number of programming problems.
Used for selecting experimental problem sets.
Program Synthesis
The process of automatically generating a program that meets a specific specification.
The core task of the study.
Verification Success Rate
The proportion of generated code that passes formal verification.
Used to evaluate the performance of large language models.
Open Questions Unanswered questions from this research
- 1 How can the generalization capabilities of large language models be improved on larger datasets? The existing Dafny ecosystem's dataset size is small, which may limit the model's performance on larger-scale problems. Developing larger datasets and exploring ways to optimize prompting strategies are needed.
- 2 How can large language models be applied in other formal verification languages? The current study focuses mainly on Dafny and has not been validated in other formal verification languages such as Coq and Lean. Exploring the potential application of large language models in these languages is needed.
- 3 How can the effectiveness of self-healing prompts be further improved? Although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Optimizing prompting strategies and exploring new feedback mechanisms are needed.
- 4 How can the hallucination problem in code generated by large language models be addressed? Existing formal verification methods can solve part of the problem, but new methods are needed to improve the accuracy of code generation.
- 5 How can large language models be applied in broader software engineering domains? The current study focuses mainly on algorithmic problems and has not been validated in broader software engineering domains. Exploring the potential application of large language models in other domains is needed.
Applications
Immediate Applications
High-Assurance Software Development
Combining formal verification with large language models to address reliability issues in code generation, providing a viable path for high-assurance software development.
Automated Software Engineering
Providing a viable path for software development by leveraging large language models to accelerate programming tasks and reduce manual intervention.
Security-Sensitive Infrastructure
Applying formal verification in security-sensitive infrastructure to ensure system security and reliability, preventing critical vulnerabilities and operational disruptions.
Long-term Vision
Cryptographic Libraries
Applying formal verification in cryptographic libraries to ensure the correctness and security of encryption algorithms, preventing data breaches.
Autonomous Aerospace Systems
Applying formal verification in autonomous aerospace systems to ensure system security and reliability, preventing critical vulnerabilities and operational disruptions.
Abstract
Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.
References (20)
Qwen3 Technical Report
An Yang, Anfeng Li, Baosong Yang et al.
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.
CWM: An Open-Weights LLM for Research on Code Generation with World Models
Fair CodeGen team. Jade Copet, Quentin Carbonneaux, Gal Cohen et al.
The Llama 3 Herd of Models
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.
Testing Dafny (experience paper)
Ahmed Irfan, Sorawee Porncharoenwase, Zvonimir Rakamarić et al.
Validity Threats in Empirical Software Engineering Research - An Initial Survey
R. Feldt, Ana Magazinius
Theories of Programming: The Life and Works of Tony Hoare
Developing verified programs with Dafny
K. Rustan M. Leino
Program Synthesis
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh
Formal and Executable Semantics of the Ethereum Virtual Machine in Dafny
F. Cassez, J. Fuller, Milad K. Ghale et al.
DafnyPro: LLM-Assisted Automated Verification for Dafny Programs
Debangshu Banerjee, Olivier Bouissou, Stefan Zetzsche
FormalFuzzer: Formal Verification Assisted Fuzz Testing for SoC Vulnerability Detection
Nusrat Farzana Dipu, Muhammad Monir Hossain, K. Z. Azar et al.
The MINERVA Software Development Process
Anthony Narkawicz, C. Muñoz, Aaron Dutle
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
DafnyBench: A Benchmark for Formal Software Verification
Chloe Loughridge, Qinyi Sun, Seth Ahrenbach et al.
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
Laria Reynolds, Kyle McDonell
The CompCert C verified compiler: Documentation and user’s manual
X. Leroy
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye et al.
Model Checking and Other Ways of Automating Formal Methods
J. Rushby
SMT-COMP: Satisfiability Modulo Theories Competition
Clark W. Barrett, L. D. Moura, Aaron Stump