daVinci-Env: Open SWE Environment Synthesis at Scale
OpenSWE synthesizes 45,320 executable Docker environments via a multi-agent pipeline, enhancing SWE agent training efficiency.
Key Findings
Methodology
OpenSWE is realized through a multi-agent synthesis pipeline deployed on a 64-node distributed cluster. This pipeline automates repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. A quality-centric filtering pipeline characterizes the inherent difficulty of each environment, filtering out instances that are unsolvable or insufficiently challenging, retaining only those that maximize learning efficiency.
Key Results
- OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy on SWE-bench Verified, surpassing SOTA in the Qwen2.5 series.
- Improvements of 12 points in mathematical reasoning and 5 points in science benchmarks without degrading factual recall.
- Data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.
Significance
The introduction of OpenSWE provides the academic community with a transparent and reproducible framework for SWE agent training, breaking the opacity and high-cost barriers of industrial solutions. Through large-scale environment synthesis and quality filtering, OpenSWE not only enhances agent training efficiency but also demonstrates significant performance improvements in cross-domain tasks.
Technical Contribution
Technically, OpenSWE offers a complete open-source synthesis pipeline covering every aspect from repository exploration to Docker environment construction. By leveraging a multi-agent system, OpenSWE automates the large-scale generation of environments and ensures data quality through difficulty-aware filtering.
Novelty
OpenSWE is the first framework to provide fully transparent SWE agent training environments at such a large scale. Compared to existing solutions like SWE-rebench and SWE-Factory, OpenSWE leads not only in scale but also in the quality and diversity of environments.
Limitations
- OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments.
- The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams.
Future Work
Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages. The community can leverage OpenSWE's open-source nature for improvements and extensions.
AI Executive Summary
In the field of software engineering, training agents capable of autonomous code editing, test execution, and solution optimization requires large-scale, executable, and verifiable environments. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups.
OpenSWE is a fully transparent framework for SWE agent training, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis.
Beyond scale, OpenSWE proposes a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. The entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality-guaranteed environments.
Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among the Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.
The introduction of OpenSWE provides the academic community with a transparent and reproducible framework for SWE agent training, breaking the opacity and high-cost barriers of industrial solutions. Through large-scale environment synthesis and quality filtering, OpenSWE not only enhances agent training efficiency but also demonstrates significant performance improvements in cross-domain tasks. Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages. The community can leverage OpenSWE's open-source nature for improvements and extensions.
Deep Analysis
Background
In recent years, the rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous software engineering (SWE) agents. These systems can interpret complex requirements, navigate extensive codebases, iteratively edit code, run tests, and refine solutions without human intervention. However, constructing high-quality and diverse executable environments at scale remains a critical bottleneck. While recent open-source efforts such as SWE-rebench, SWE-Universe, and SWE-Factory have made progress toward automation, the computational and infrastructure costs of generating validated environments at scale remain extraordinarily high, effectively excluding most academic research groups and creating a stark divide between industrial solutions, which achieve scale but remain opaque with unreleased infrastructure, and open-source alternatives that remain limited in both scale and repository diversity.
Core Problem
Training capable SWE agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. Beyond the cost of environment construction, the quality and difficulty distribution of these environments are equally critical for effective agent training. While scaling the number of environments is a necessary condition, it is far from sufficient on its own.
Innovation
The core innovations of OpenSWE include its fully transparent SWE agent training framework. First, it achieves large-scale environment automation through a multi-agent synthesis pipeline. Second, it proposes a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are unsolvable or insufficiently challenging. Finally, OpenSWE innovates in both the scale and quality of environment synthesis, providing a transparent and reproducible framework.
Methodology
- �� Multi-agent synthesis pipeline: Deployed on a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis.
- �� Quality filtering pipeline: Characterizes the inherent difficulty of each environment, filtering out unsolvable or insufficiently challenging instances.
- �� Data scaling analysis: Shows a log-linear improvement trend with the addition of high-quality environments.
- �� SWE-focused training: Demonstrates significant improvements in cross-domain tasks, including mathematical reasoning and science benchmarks.
Experiments
The experimental design includes evaluating the performance of OpenSWE-32B and OpenSWE-72B on SWE-bench Verified. The benchmarks used include mathematical reasoning and science benchmarks, with accuracy as the evaluation metric. The results show that OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy, surpassing SOTA in the Qwen2.5 series. Additionally, data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.
Results
The results show that OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% accuracy on SWE-bench Verified, surpassing SOTA in the Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall. Data scaling analysis shows a log-linear improvement trend with the addition of high-quality environments.
Applications
The application scenarios of OpenSWE include:
- �� SWE agent training: Provides large-scale, executable, and verifiable environments, enhancing agent training efficiency.
- �� Cross-domain task performance improvement: Demonstrates significant improvements in mathematical reasoning and science benchmarks.
- �� Open-source community improvements and extensions: The community can leverage OpenSWE's open-source nature for improvements and extensions.
Limitations & Outlook
The limitations of OpenSWE include:
- �� OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments.
- �� The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams.
- �� Future work can focus on further optimizing the efficiency and stability of environment synthesis while exploring support for more programming languages.
Plain Language Accessible to non-experts
Imagine you're running a large factory that needs to process various raw materials from different suppliers. To ensure smooth production, you need an automated system to manage the procurement, storage, and use of these materials. OpenSWE is like this factory management system, capable of automatically sourcing raw materials from suppliers worldwide and selecting and storing them based on quality and demand.
In this system, each supplier is like a code repository, and the raw materials are code snippets from these repositories. OpenSWE uses a multi-agent system to automatically explore these repositories, construct executable environments, and generate evaluation scripts to verify the correctness of the code.
The core of this system is its ability to handle a large volume of raw materials while selecting them based on quality and demand, ensuring that only the highest quality materials are used in production. It's like an intelligent procurement system that can dynamically adjust based on market demand and production plans.
Through this approach, OpenSWE not only improves the factory's production efficiency but also ensures product quality and consistency. In the future, this system can be further expanded to support more types of raw materials and production lines.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game where you need to keep writing code to solve different problems. To make your code better, you need a huge practice field that can run your code and tell you where to improve.
OpenSWE is like this super practice field! It's like a giant game map with thousands of different levels, each one a code challenge. You can practice writing code here, run tests, and keep improving based on feedback.
What's cooler is that OpenSWE automatically picks the best levels for you to practice on, so you don't waste time on challenges that are too easy or too hard. It's like a smart game assistant that always finds the best challenges for you.
So, if you want to become a coding master, OpenSWE is your best training buddy! It will help you keep getting better in the world of coding and become a real code master!
Glossary
Docker Environment
A Docker environment is a lightweight virtualization technology that allows developers to run applications in isolated containers. In OpenSWE, it is used to create executable code testing environments.
Used for building and running executable code testing environments.
Multi-Agent System
A multi-agent system is a distributed system where multiple independent agents work together to complete complex tasks. In OpenSWE, it is used to automate environment synthesis.
Used to automate repository exploration, Dockerfile construction, and evaluation script generation.
Quality Filtering Pipeline
A quality filtering pipeline is a data processing mechanism that filters data based on its inherent properties. In OpenSWE, it is used to filter out unsolvable or insufficiently challenging environments.
Used to filter and retain environments that maximize learning efficiency.
SWE Agent
An SWE agent is a software engineering agent capable of autonomous code editing, test execution, and solution optimization. In OpenSWE, it is trained using large-scale environments.
Trained and optimized using environments provided by OpenSWE.
SWE-bench Verified
SWE-bench Verified is a benchmark used to evaluate the performance of SWE agents. In OpenSWE's experiments, it is used to verify model accuracy.
Used to evaluate the performance of OpenSWE-32B and OpenSWE-72B.
Qwen2.5 Series
The Qwen2.5 series is a set of models used for SWE agent training. In OpenSWE's experiments, it serves as a baseline for comparison.
Used to compare the performance of OpenSWE.
Log-Linear Growth
Log-linear growth is a data growth pattern where output increases logarithmically with input scale. In OpenSWE's experiments, it describes the trend of model performance improvement.
Describes the trend of model performance improvement with the addition of high-quality environments.
Mathematical Reasoning
Mathematical reasoning refers to the ability to perform logical reasoning and problem-solving in mathematical problems. In OpenSWE's experiments, it is used to evaluate cross-domain task performance improvement.
Used to evaluate the cross-domain performance improvement of SWE-focused training.
Science Benchmark
A science benchmark is a set of tests used to evaluate a model's scientific reasoning capabilities. In OpenSWE's experiments, it is used to verify the model's cross-domain performance.
Used to verify OpenSWE's cross-domain performance.
Open-Source Framework
An open-source framework is a software development framework with publicly available source code, allowing community improvements and extensions. In OpenSWE, all Dockerfiles, evaluation scripts, and infrastructure are open-sourced.
Ensures the transparency and reproducibility of OpenSWE.
Open Questions Unanswered questions from this research
- 1 Despite significant progress in environment synthesis and quality filtering, OpenSWE has room for improvement in supporting more programming languages. The current framework primarily supports Python, and future work can explore support for other languages.
- 2 OpenSWE may encounter build failures in unstable network conditions due to its reliance on Docker environments, posing a challenge to environment stability. Improving the stability of environment synthesis is a future research direction.
- 3 The construction and validation of environments require significant computational resources, which may be challenging for smaller research teams. Reducing computational costs and improving resource utilization efficiency is an important topic for future research.
- 4 While OpenSWE demonstrates significant improvements in mathematical reasoning and science benchmarks, further validation is needed for performance improvements in other domain tasks. Future work can explore performance evaluation in more cross-domain tasks.
- 5 OpenSWE's quality filtering pipeline relies primarily on the inherent difficulty characteristics of environments. Further optimization of the filtering mechanism to improve data quality is a research-worthy issue.
- 6 Although OpenSWE provides a transparent and reproducible framework, more participation and support are needed for community improvements and extensions. How to incentivize community participation is a thought-provoking issue.
- 7 OpenSWE has innovated in both the scale and quality of environment synthesis, but further exploration is needed on how to optimize synthesis efficiency and stability.
Applications
Immediate Applications
SWE Agent Training
OpenSWE provides large-scale, executable, and verifiable environments that significantly enhance the training efficiency of SWE agents. Researchers and developers can use these environments for agent training and optimization.
Cross-Domain Task Performance Improvement
Training on OpenSWE demonstrates significant performance improvements in cross-domain tasks such as mathematical reasoning and science benchmarks.
Open-Source Community Improvements and Extensions
OpenSWE's open-source nature allows the community to make improvements and extensions, enabling researchers to conduct further research and development using this framework.
Long-term Vision
Multi-Language Support
In the future, OpenSWE can be expanded to support more programming languages, broadening its application scope and impact.
Environment Synthesis Efficiency Optimization
By further optimizing the synthesis pipeline and resource utilization efficiency, OpenSWE can generate more high-quality environments in a shorter time.
Abstract
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.
References (20)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich et al.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao et al.
Measuring
Daniel Lafrenière
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Carlos E. Jimenez, K. Lieret, Karthik R. Narasimhan et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
Training Software Engineering Agents and Verifiers with SWE-Gym
Jiayi Pan, Xingyao Wang, Graham Neubig et al.
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
Zonghan Yang, Shengjie Wang, Kelin Fu et al.
daVinci-Dev: Agent-native Mid-training for Software Engineering
Ji Zeng, Dayuan Fu, Tiantian Mi et al.
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Lianghong Guo, Yanlin Wang, Caihua Li et al.
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu, Chun Xia, Yuyao Wang et al.
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
Mohan Jiang, Dayuan Fu, Junhao Shi et al.
Agentless: Demystifying LLM-based Software Engineering Agents
Chun Xia, Yinlin Deng, S. Dunn et al.
SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving
Chaofan Tao, Jieru Chen, Yuxin Jiang et al.
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
SWE-Universe: Scale Real-World Verifiable Environments to Millions
Mouxiang Chen, Lei Zhang, Yunlong Feng et al.
AgentRefine: Enhancing Agent Generalization through Refinement Tuning
Dayuan Fu, Keqing He, Yejie Wang et al.
SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories
Junhao Wang, Daoguang Zan, Shulin Xin et al.
Context as a Tool: Context Management for Long-Horizon SWE-Agents
Shukai Liu, Jian Yang, Bo Jiang et al.