From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

TL;DR

From natural language to verified code using Dafny, Gemma 4-31B achieved a 90.91% verification success rate.

cs.SE 🔴 Advanced 2026-04-24 20 views
Md Erfan Md Kamal Hossain Chowdhury Ahmed Ryan Md Rayhanur Rahman
formal verification Dafny program synthesis software correctness large language models

Key Findings

Methodology

This study proposes a Dafny-based formal verification framework to evaluate large language models' ability to generate code from natural language. Three prompting strategies were used: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates.

Key Results

  • Result 1: With structured signature prompts and self-healing prompts, the Gemma 4-31B model achieved a 90.91% verification success rate, while GPT-OSS 120B improved from zero to 81.82%. This indicates that structured prompts and feedback mechanisms enable open-weight language models to effectively generate formally verified code.
  • Result 2: Contextless prompting strategies led to near-universal failure, highlighting the importance of structured prompts in formal verification.
  • Result 3: Functional validation was performed using the uDebug platform, ensuring that generated code is not only formally correct but also functionally robust in practical applications.

Significance

This research holds significant implications for academia and industry. It demonstrates the potential of large language models in formal verification and provides a viable path for high-assurance software development. By combining formal verification with large language models, the study addresses reliability issues in code generation, opening new possibilities for automated software engineering.

Technical Contribution

Technical contributions include: 1) Introducing a novel Dafny-based formal verification framework, 2) Significantly improving code generation accuracy through structured signature and self-healing prompting strategies, 3) Integrating the uDebug platform into the verification process, providing a dual-layer validation mechanism to ensure functional correctness.

Novelty

This study is the first to integrate large language models with Dafny formal verification, proposing a new method for code generation. Compared to existing work, this study not only focuses on code generation but also emphasizes code verification and functional correctness, filling a gap in natural language to formal verification code generation.

Limitations

  • Limitation 1: Although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems, indicating a need for further optimization of prompting strategies.
  • Limitation 2: The Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems.
  • Limitation 3: The current study focuses mainly on algorithmic problems and has not been validated in broader software engineering domains.

Future Work

Future research directions include: 1) Expanding dataset size to cover more diverse problem types, 2) Optimizing prompting strategies to further improve verification success rates, 3) Exploring the application of large language models in other formal verification languages such as Coq and Lean.

AI Executive Summary

In the field of software engineering, automated code generation has been a topic of great interest. However, existing large language models often produce code that is syntactically plausible but semantically incorrect, a phenomenon known as 'hallucination.' To address this issue, researchers have proposed a new approach that combines large language models with Dafny formal verification to ensure the correctness of generated code.

This study provides a dataset called NaturalLanguage2VerifiedCode (NL2VC)-60, which includes 60 complex algorithmic problems. Researchers used three prompting strategies: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates.

In the experiments, researchers selected 11 problem sets and evaluated them using seven open-weight large language models. The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%. These results indicate that structured prompts and feedback mechanisms enable open-weight language models to effectively generate formally verified code.

Additionally, the study integrated the uDebug platform for functional validation, ensuring that generated code is not only formally correct but also functionally robust in practical applications. uDebug is a community-driven platform designed for competitive programmers to validate their solutions against high-quality test suites.

This research holds significant implications for academia and industry. It demonstrates the potential of large language models in formal verification and provides a viable path for high-assurance software development. By combining formal verification with large language models, the study addresses reliability issues in code generation, opening new possibilities for automated software engineering.

Despite the significant achievements, the study also has some limitations. For example, the Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems. Additionally, although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Future research directions include expanding dataset size, optimizing prompting strategies, and exploring the application of large language models in other formal verification languages.

Deep Analysis

Background

Formal verification plays a crucial role in software engineering by providing mathematical proof that programs strictly satisfy their specifications. Its application is increasingly widespread in high-stakes domains such as security-sensitive infrastructure, cryptographic libraries, and autonomous aerospace systems. However, writing program properties and proofs remains a creative and manual process requiring significant expertise. With the rise of large language models like GitHub Copilot and Amazon Q Developer, software development workflows have transformed significantly. These AI-driven tools accelerate programming tasks through natural language to code translation and intelligent autocompletion. However, despite their fluency, these systems often produce code that is syntactically plausible but semantically incorrect, especially when dealing with complex algorithmic logic. To ensure the correctness and logical integrity of generated code, there is a need to bridge the gap between AI synthesis and formal verification.

Core Problem

Large language models show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Existing formal verification languages, such as F*, Coq, Lean, and the Java Modeling Language (JML), face challenges in this process. Dafny, as a language that balances imperative programming and automated theorem proving, supports verification via assertions, preconditions, and postconditions. However, authoring formal specifications and auxiliary verification assertions remains difficult.

Innovation

The core innovations of this study include: 1) Introducing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset, designed to evaluate the synthesis of formally verified code from complex, real-world requirements. 2) Developing a Dafny-based formal verification framework to evaluate large language models' ability to generate code from natural language. 3) Introducing three prompting strategies: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier, significantly improving code verification success rates. 4) Integrating the uDebug platform for functional validation, ensuring that generated code is not only formally correct but also functionally robust.

Methodology

  • �� Introduced the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset, containing 60 complex algorithmic problems.
  • �� Used three prompting strategies: contextless prompts, signature prompts, and self-healing prompts.
  • �� Mimicked human developers' workflows through iterative feedback from the Dafny verifier using self-healing prompts.
  • �� Selected 11 problem sets and evaluated them using seven open-weight large language models.
  • �� Integrated the uDebug platform for functional validation, ensuring that generated code is functionally robust in practical applications.

Experiments

The experimental design involved selecting 11 problem sets from the UVa Online Judge and evaluating them using seven open-weight large language models. Three prompting strategies were used: contextless prompts, signature prompts, and self-healing prompts. Self-healing prompts mimic human developers' workflows through iterative feedback from the Dafny verifier. The uDebug platform was integrated for functional validation, ensuring that generated code is functionally robust in practical applications. The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%.

Results

The results showed that structured signature prompts and self-healing prompts significantly improved verification success rates, with the Gemma 4-31B model achieving a 90.91% verification success rate, and GPT-OSS 120B improving from zero to 81.82%. Contextless prompting strategies led to near-universal failure, highlighting the importance of structured prompts in formal verification. Functional validation was performed using the uDebug platform, ensuring that generated code is not only formally correct but also functionally robust in practical applications.

Applications

The application scenarios of this study include: 1) High-assurance software development by combining formal verification with large language models to address reliability issues in code generation. 2) Automated software engineering, providing a viable path for software development by leveraging large language models to accelerate programming tasks and reduce manual intervention. 3) Application in high-stakes domains such as security-sensitive infrastructure, cryptographic libraries, and autonomous aerospace systems.

Limitations & Outlook

Despite the significant achievements, the study also has some limitations. For example, the Dafny ecosystem's dataset size is small, which may limit the model's generalization capabilities on larger-scale problems. Additionally, although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Future research directions include expanding dataset size, optimizing prompting strategies, and exploring the application of large language models in other formal verification languages.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (natural language description), but you need to ensure each step is executed correctly (formal verification). A large language model acts like a smart assistant that can help you cook based on the recipe, but sometimes it makes mistakes, like adding the wrong ingredient. This is where a verifier (Dafny) comes in to check if each step is correct. Researchers found that by giving the assistant some structured prompts, like telling it to add salt before sugar, its performance improves. Additionally, if the assistant makes a mistake, the verifier will point out where it went wrong, allowing the assistant to adjust based on feedback until each step is correct. It's like in the kitchen, where you keep trying and adjusting until you make the perfect dish. In this way, researchers demonstrated how to use large language models and formal verification to generate high-quality software code.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game where you have to complete tasks based on clues. Sometimes, the clues are vague, and you don't know what to do. This is like the problem large language models face when generating code. To help them, we give them clearer clues, like in a game where you get a more detailed task description. This way, they can complete the task better! And if they make a mistake, we have a super smart checker (Dafny) to tell them where they went wrong. Then, they can adjust based on the checker's feedback, just like you keep trying in a game until you complete the task. This way, we can ensure the generated code is correct, just like you getting a high score in the game! Cool, right?

Glossary

Large Language Model

A type of AI model trained on vast amounts of text data, capable of generating natural language text.

Used for automated software code generation.

Formal Verification

The process of ensuring that a program strictly satisfies its specifications through mathematical proof.

Used to verify the correctness of code generated by large language models.

Dafny

A programming language that supports formal verification, allowing verification through assertions, preconditions, and postconditions.

Used as a formal verification framework.

Self-Healing Prompting

A prompting strategy that helps large language models correct errors through feedback mechanisms.

Used to improve code verification success rates.

uDebug

A community-driven platform for validating programmers' solutions against high-quality test suites.

Used for functional validation to ensure code is robust in practical applications.

Hallucination

Code generated by large language models that is syntactically plausible but semantically incorrect.

Needs to be addressed through formal verification.

Signature Prompt

A prompting strategy that provides structured hints to help large language models generate more accurate code.

Used to improve code verification success rates.

UVa Online Judge

An online automated judging system providing a vast number of programming problems.

Used for selecting experimental problem sets.

Program Synthesis

The process of automatically generating a program that meets a specific specification.

The core task of the study.

Verification Success Rate

The proportion of generated code that passes formal verification.

Used to evaluate the performance of large language models.

Open Questions Unanswered questions from this research

  • 1 How can the generalization capabilities of large language models be improved on larger datasets? The existing Dafny ecosystem's dataset size is small, which may limit the model's performance on larger-scale problems. Developing larger datasets and exploring ways to optimize prompting strategies are needed.
  • 2 How can large language models be applied in other formal verification languages? The current study focuses mainly on Dafny and has not been validated in other formal verification languages such as Coq and Lean. Exploring the potential application of large language models in these languages is needed.
  • 3 How can the effectiveness of self-healing prompts be further improved? Although self-healing prompts significantly improve verification success rates, there are still failures in some complex problems. Optimizing prompting strategies and exploring new feedback mechanisms are needed.
  • 4 How can the hallucination problem in code generated by large language models be addressed? Existing formal verification methods can solve part of the problem, but new methods are needed to improve the accuracy of code generation.
  • 5 How can large language models be applied in broader software engineering domains? The current study focuses mainly on algorithmic problems and has not been validated in broader software engineering domains. Exploring the potential application of large language models in other domains is needed.

Applications

Immediate Applications

High-Assurance Software Development

Combining formal verification with large language models to address reliability issues in code generation, providing a viable path for high-assurance software development.

Automated Software Engineering

Providing a viable path for software development by leveraging large language models to accelerate programming tasks and reduce manual intervention.

Security-Sensitive Infrastructure

Applying formal verification in security-sensitive infrastructure to ensure system security and reliability, preventing critical vulnerabilities and operational disruptions.

Long-term Vision

Cryptographic Libraries

Applying formal verification in cryptographic libraries to ensure the correctness and security of encryption algorithms, preventing data breaches.

Autonomous Aerospace Systems

Applying formal verification in autonomous aerospace systems to ensure system security and reliability, preventing critical vulnerabilities and operational disruptions.

Abstract

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.

cs.SE cs.AI

References (20)

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 4486 citations ⭐ Influential View Analysis →

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.

2025 1252 citations ⭐ Influential View Analysis →

CWM: An Open-Weights LLM for Research on Code Generation with World Models

Fair CodeGen team. Jade Copet, Quentin Carbonneaux, Gal Cohen et al.

2025 45 citations ⭐ Influential View Analysis →

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey et al.

2024 14517 citations ⭐ Influential View Analysis →

Testing Dafny (experience paper)

Ahmed Irfan, Sorawee Porncharoenwase, Zvonimir Rakamarić et al.

2022 19 citations

Validity Threats in Empirical Software Engineering Research - An Initial Survey

R. Feldt, Ana Magazinius

2010 276 citations

Theories of Programming: The Life and Works of Tony Hoare

2021 13 citations

Developing verified programs with Dafny

K. Rustan M. Leino

2012 81 citations

Program Synthesis

Sumit Gulwani, Oleksandr Polozov, Rishabh Singh

2017 664 citations

Formal and Executable Semantics of the Ethereum Virtual Machine in Dafny

F. Cassez, J. Fuller, Milad K. Ghale et al.

2023 16 citations View Analysis →

DafnyPro: LLM-Assisted Automated Verification for Dafny Programs

Debangshu Banerjee, Olivier Bouissou, Stefan Zetzsche

2026 3 citations View Analysis →

FormalFuzzer: Formal Verification Assisted Fuzz Testing for SoC Vulnerability Detection

Nusrat Farzana Dipu, Muhammad Monir Hossain, K. Z. Azar et al.

2024 6 citations

The MINERVA Software Development Process

Anthony Narkawicz, C. Muñoz, Aaron Dutle

2018 17 citations

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun et al.

2021 9219 citations View Analysis →

DafnyBench: A Benchmark for Formal Software Verification

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach et al.

2024 48 citations View Analysis →

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

Laria Reynolds, Kyle McDonell

2021 1274 citations View Analysis →

The CompCert C verified compiler: Documentation and user’s manual

X. Leroy

2015 47 citations

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye et al.

2021 3473 citations View Analysis →

Model Checking and Other Ways of Automating Formal Methods

J. Rushby

1995 19 citations

SMT-COMP: Satisfiability Modulo Theories Competition

Clark W. Barrett, L. D. Moura, Aaron Stump

2005 131 citations