daVinci-Env: Open SWE Environment Synthesis at Scale

TL;DR

OpenSWE synthesizes 45,320 executable Docker environments via a multi-agent pipeline, enhancing SWE agent training efficiency.

cs.SE 🔴 Advanced 2026-03-13 2 views

Dayuan Fu Shenyu Wu Yunze Wu Zerui Peng Yaxing Huang Jie Sun Ji Zeng Mohan Jiang Lin Zhang Yukun Li Jiarui Hu Liming Liu Jinlong Hou Pengfei Liu

AI Reader Arxiv Page Download PDF

software engineering environment synthesis multi-agent system Docker open-source framework

Key Findings

Methodology

OpenSWE is realized through a multi-agent synthesis pipeline deployed on a 64-node distributed cluster. This pipeline automates repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. A quality-centric filtering pipeline characterizes the inherent difficulty of each environment, filtering out instances that are unsolvable or insufficiently challenging, retaining only those that maximize learning efficiency.

Key Results

OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy on SWE-bench Verified, surpassing SOTA in the Qwen2.5 series.
Improvements of 12 points in mathematical reasoning and 5 points in science benchmarks without degrading factual recall.
Data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.

Significance

Technical Contribution

Technically, OpenSWE offers a complete open-source synthesis pipeline covering every aspect from repository exploration to Docker environment construction. By leveraging a multi-agent system, OpenSWE automates the large-scale generation of environments and ensures data quality through difficulty-aware filtering.

Novelty

OpenSWE is the first framework to provide fully transparent SWE agent training environments at such a large scale. Compared to existing solutions like SWE-rebench and SWE-Factory, OpenSWE leads not only in scale but also in the quality and diversity of environments.

Limitations

OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments.
The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams.

Future Work

Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages. The community can leverage OpenSWE's open-source nature for improvements and extensions.

AI Executive Summary

In the field of software engineering, training agents capable of autonomous code editing, test execution, and solution optimization requires large-scale, executable, and verifiable environments. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups.

OpenSWE is a fully transparent framework for SWE agent training, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis.

Beyond scale, OpenSWE proposes a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. The entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality-guaranteed environments.

Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among the Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

The introduction of OpenSWE provides the academic community with a transparent and reproducible framework for SWE agent training, breaking the opacity and high-cost barriers of industrial solutions. Through large-scale environment synthesis and quality filtering, OpenSWE not only enhances agent training efficiency but also demonstrates significant performance improvements in cross-domain tasks. Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages. The community can leverage OpenSWE's open-source nature for improvements and extensions.

Deep Analysis

Background

In recent years, the rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous software engineering (SWE) agents. These systems can interpret complex requirements, navigate extensive codebases, iteratively edit code, run tests, and refine solutions without human intervention. However, constructing high-quality and diverse executable environments at scale remains a critical bottleneck. While recent open-source efforts such as SWE-rebench, SWE-Universe, and SWE-Factory have made progress toward automation, the computational and infrastructure costs of generating validated environments at scale remain extraordinarily high, effectively excluding most academic research groups and creating a stark divide between industrial solutions, which achieve scale but remain opaque with unreleased infrastructure, and open-source alternatives that remain limited in both scale and repository diversity.

Core Problem

Training capable SWE agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. Beyond the cost of environment construction, the quality and difficulty distribution of these environments are equally critical for effective agent training. While scaling the number of environments is a necessary condition, it is far from sufficient on its own.

Innovation

The core innovations of OpenSWE include its fully transparent SWE agent training framework. First, it achieves large-scale environment automation through a multi-agent synthesis pipeline. Second, it proposes a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are unsolvable or insufficiently challenging. Finally, OpenSWE innovates in both the scale and quality of environment synthesis, providing a transparent and reproducible framework.

Methodology

�� Multi-agent synthesis pipeline: Deployed on a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis.
�� Quality filtering pipeline: Characterizes the inherent difficulty of each environment, filtering out unsolvable or insufficiently challenging instances.
�� Data scaling analysis: Shows a log-linear improvement trend with the addition of high-quality environments.
�� SWE-focused training: Demonstrates significant improvements in cross-domain tasks, including mathematical reasoning and science benchmarks.

Experiments

The experimental design includes evaluating the performance of OpenSWE-32B and OpenSWE-72B on SWE-bench Verified. The benchmarks used include mathematical reasoning and science benchmarks, with accuracy as the evaluation metric. The results show that OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy, surpassing SOTA in the Qwen2.5 series. Additionally, data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.

Results

The results show that OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy on SWE-bench Verified, surpassing SOTA in the Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall. Data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.

Applications

The application scenarios of OpenSWE include:

�� SWE agent training: Provides large-scale, executable, and verifiable environments, enhancing agent training efficiency.
�� Cross-domain task performance improvement: Demonstrates significant improvements in mathematical reasoning and science benchmarks.
�� Open-source community improvements and extensions: The community can leverage OpenSWE's open-source nature for improvements and extensions.

Limitations & Outlook

The limitations of OpenSWE include:

�� OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments.
�� The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams.
�� Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages.

Plain Language Accessible to non-experts

Imagine you're running a large factory that needs to process various raw materials from different suppliers. To ensure smooth production, you need an automated system to manage the procurement, storage, and use of these materials. OpenSWE is like this factory management system, capable of automatically sourcing raw materials from suppliers worldwide and selecting and storing them based on quality and demand.

In this system, each supplier is like a code repository, and the raw materials are code snippets from these repositories. OpenSWE uses a multi-agent system to automatically explore these repositories, construct executable environments, and generate evaluation scripts to verify the correctness of the code.

The core of this system is its ability to handle a large volume of raw materials while selecting them based on quality and demand, ensuring that only the highest quality materials are used in production. It's like an intelligent procurement system that can dynamically adjust based on market demand and production plans.

Through this approach, OpenSWE not only improves the factory's production efficiency but also ensures product quality and consistency. In the future, this system can be further expanded to support more types of raw materials and production lines.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game where you need to keep writing code to solve different problems. To make your code better, you need a huge practice field that can run your code and tell you where to improve.

OpenSWE is like this super practice field! It's like a giant game map with thousands of different levels, each one a code challenge. You can practice writing code here, run tests, and keep improving based on feedback.

What's cooler is that OpenSWE automatically picks the best levels for you to practice on, so you don't waste time on challenges that are too easy or too hard. It's like a smart game assistant that always finds the best challenges for you.

So, if you want to become a coding master, OpenSWE is your best training buddy! It will help you keep getting better in the world of coding and become a real code master!

Glossary

Docker Environment

A Docker environment is a lightweight virtualization technology that allows developers to run applications in isolated containers. In OpenSWE, it is used to create executable code testing environments.

Used for building and running executable code testing environments.

Multi-Agent System

A multi-agent system is a distributed system where multiple independent agents work together to complete complex tasks. In OpenSWE, it is used to automate environment synthesis.

Used to automate repository exploration, Dockerfile construction, and evaluation script generation.

Quality Filtering Pipeline

A quality filtering pipeline is a data processing mechanism that filters data based on its inherent properties. In OpenSWE, it is used to filter out unsolvable or insufficiently challenging environments.

Used to filter and retain environments that maximize learning efficiency.

SWE Agent

An SWE agent is a software engineering agent capable of autonomous code editing, test execution, and solution optimization. In OpenSWE, it is trained using large-scale environments.

Trained and optimized using environments provided by OpenSWE.

SWE-bench Verified

SWE-bench Verified is a benchmark used to evaluate the performance of SWE agents. In OpenSWE's experiments, it is used to verify model accuracy.

Used to evaluate the performance of OpenSWE-32B and OpenSWE-72B.

Qwen2.5 Series

The Qwen2.5 series is a set of models used for SWE agent training. In OpenSWE's experiments, it serves as a baseline for comparison.

Used to compare the performance of OpenSWE.

Log-Linear Growth

Log-linear growth is a data growth pattern where output increases logarithmically with input scale. In OpenSWE's experiments, it describes the trend of model performance improvement.

Describes the trend of model performance improvement with the addition of high-quality environments.

Mathematical Reasoning

Mathematical reasoning refers to the ability to perform logical reasoning and problem-solving in mathematical problems. In OpenSWE's experiments, it is used to evaluate cross-domain task performance improvement.

Used to evaluate the cross-domain performance improvement of SWE-focused training.

Science Benchmark

A science benchmark is a set of tests used to evaluate a model's scientific reasoning capabilities. In OpenSWE's experiments, it is used to verify the model's cross-domain performance.

Used to verify OpenSWE's cross-domain performance.

Open-Source Framework

An open-source framework is a software development framework with publicly available source code, allowing community improvements and extensions. In OpenSWE, all Dockerfiles, evaluation scripts, and infrastructure are open-sourced.

Ensures the transparency and reproducibility of OpenSWE.

Open Questions Unanswered questions from this research

1 Despite significant progress in environment synthesis and quality filtering, OpenSWE has room for improvement in supporting more programming languages. The current framework primarily supports Python, and future work can explore support for other languages.
2 OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments, posing a challenge to environment stability. Improving the stability of environment synthesis is a future research direction.
3 The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams. Reducing computational costs and improving resource utilization efficiency is an important topic for future research.
4 While OpenSWE demonstrates significant improvements in mathematical reasoning and science benchmarks, further validation is needed for performance improvements in other domain tasks. Future work can explore performance evaluation in more cross-domain tasks.
5 OpenSWE's quality filtering pipeline relies primarily on the inherent difficulty characteristics of environments. Further optimization of the filtering mechanism to improve data quality is a research-worthy issue.
6 Although OpenSWE provides a transparent and reproducible framework, more participation and support are needed for community improvements and extensions. How to incentivize community participation is a thought-provoking issue.
7 OpenSWE has innovated in both the scale and quality of environment synthesis, but further exploration is needed on how to optimize synthesis efficiency and stability.

Applications

Immediate Applications

SWE Agent Training

OpenSWE provides large-scale, executable, and verifiable environments that significantly enhance the training efficiency of SWE agents. Researchers and developers can use these environments for agent training and optimization.

Cross-Domain Task Performance Improvement

Training on OpenSWE demonstrates significant performance improvements in cross-domain tasks such as mathematical reasoning and science benchmarks.

Open-Source Community Improvements and Extensions

OpenSWE's open-source nature allows the community to make improvements and extensions, enabling researchers to conduct further research and development using this framework.

Long-term Vision

Multi-Language Support

In the future, OpenSWE can be expanded to support more programming languages, broadening its application scope and impact.

Environment Synthesis Efficiency Optimization

By further optimizing the synthesis pipeline and resource utilization efficiency, OpenSWE can generate more high-quality environments in a shorter time.

Abstract

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

cs.SE cs.AI cs.CL

References (20)

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich et al.

2025 42 citations ⭐ Influential View Analysis →

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

2017 3528 citations View Analysis →

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao et al.

2025 150 citations

Measuring

Daniel Lafrenière

2019 640 citations

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Carlos E. Jimenez, K. Lieret, Karthik R. Narasimhan et al.

2024 90 citations

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 7728 citations View Analysis →

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig et al.

2024 171 citations View Analysis →

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Zonghan Yang, Shengjie Wang, Kelin Fu et al.

2025 12 citations View Analysis →

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng, Dayuan Fu, Tiantian Mi et al.

2026 5 citations View Analysis →

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Lianghong Guo, Yanlin Wang, Caihua Li et al.

2025 23 citations

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chun Xia, Yuyao Wang et al.

2023 1556 citations View Analysis →

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Mohan Jiang, Dayuan Fu, Junhao Shi et al.

2026 1 citations View Analysis →

Agentless: Demystifying LLM-based Software Engineering Agents

Chun Xia, Yinlin Deng, S. Dunn et al.

2024 291 citations View Analysis →

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

Chaofan Tao, Jieru Chen, Yuxin Jiang et al.

2026 9 citations View Analysis →

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun et al.

2021 8665 citations View Analysis →

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6344 citations View Analysis →

SWE-Universe: Scale Real-World Verifiable Environments to Millions

Mouxiang Chen, Lei Zhang, Yunlong Feng et al.

2026 2 citations View Analysis →

AgentRefine: Enhancing Agent Generalization through Refinement Tuning

Dayuan Fu, Keqing He, Yejie Wang et al.

2025 35 citations View Analysis →

SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories

Junhao Wang, Daoguang Zan, Shulin Xin et al.

2025 21 citations View Analysis →

Context as a Tool: Context Management for Long-Horizon SWE-Agents

Shukai Liu, Jian Yang, Bo Jiang et al.

2025 6 citations View Analysis →