Self-Improving Language Models with Bidirectional Evolutionary Search
Proposes Bidirectional Evolutionary Search (BES), combining forward candidate evolution with backward goal decomposition to enhance exploration and verification in language models.
Key Findings
Methodology
The proposed BES framework integrates two core mechanisms: forward candidate evolution using genetic-inspired operators (such as combination, translocation, crossover, deletion) and backward goal decomposition that recursively breaks down complex tasks into verifiable sub-goals. In the forward phase, BES employs these operators to recombine partial trajectories, generating diverse solutions beyond the model’s autoregressive support. Concurrently, the backward phase constructs a goal tree, where each node represents a sub-goal evaluated by task-specific verifiers (rule-based, similarity models, or LLM judges). This dual process allows dense feedback to guide the search efficiently. Theoretical analysis confirms that pure expansion-based methods are confined within a narrow entropy shell, whereas evolution operators enable escape, expanding the exploration space exponentially. Extensive experiments on logical reasoning, multi-hop reasoning, and open problem benchmarks demonstrate BES’s superiority over existing methods like GRPO, Tree-GRPO, and open-source frameworks, with significant improvements in accuracy, sample efficiency, and stability.
Key Results
- In the Knights-and-Knaves logical reasoning dataset, BES improved validation accuracy from approximately 30% to over 50%, outperforming baseline methods such as GRPO and MaxRL, and showing consistent self-improvement during training.
- On MuSiQue multi-hop reasoning tasks, BES achieved accuracy gains of 3-4 percentage points over baseline models like Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, with increased valid search counts and more effective action sequences, indicating enhanced reasoning capabilities.
- In open problem benchmarks such as circle packing and Heilbronn convex problems, BES outperformed open-source frameworks (ShinkaEvolve, GEPA), delivering higher average and best scores, with lower variance and more stable search trajectories, demonstrating robust exploration and solution quality.
Significance
This work addresses fundamental limitations of current search strategies in large language models, notably the sparse verification signals and limited exploration beyond high-probability regions. By integrating evolutionary algorithms with goal decomposition, BES significantly expands the search space and improves the quality of solutions in complex reasoning tasks. The theoretical guarantees and empirical results suggest that BES can serve as a foundational framework for autonomous model self-improvement, enabling models to discover high-quality solutions that were previously inaccessible. Its implications span improving AI reasoning, scientific computation, automated programming, and decision-making systems, marking a step toward more autonomous and reliable AI systems.
Technical Contribution
The key technical contributions include: • Designing four novel evolution operators inspired by biological recombination, enabling candidate diversity beyond autoregressive support; • Developing a recursive backward goal decomposition mechanism that provides dense, interpretable feedback for guiding forward search; • Theoretical proof that evolution operators can escape the entropy shell confining expansion-only search, exponentially increasing exploration capacity; • Integrating these components into a cohesive bidirectional search framework, validated through rigorous experiments on multiple challenging tasks. This approach advances the state-of-the-art in search-based reasoning and self-improvement for large models.
Novelty
This research introduces the first systematic combination of bidirectional search—merging evolutionary candidate generation with recursive goal decomposition—for large language models. Unlike prior methods limited to autoregressive expansion or unidirectional tree search, BES leverages genetic-inspired operators to diversify solutions and employs a recursive goal tree to densely guide the search. The theoretical analysis confirming the ability to break entropy shell constraints is novel, providing a rigorous foundation for exploration beyond traditional limits. This dual mechanism represents a significant departure from existing approaches, offering a new paradigm for scalable, effective self-improvement in AI systems.
Limitations
- The computational overhead of BES is higher than traditional methods due to multiple rounds of candidate recombination and goal decomposition, which may hinder real-time deployment in resource-constrained environments.
- The effectiveness of the backward goal decomposition depends heavily on the quality and design of task-specific verifiers, which may be challenging to construct for certain complex or unstructured tasks.
- Current experiments focus on moderate-sized models and specific reasoning tasks; scalability to very large models (e.g., GPT-4/5) and diverse domains remains to be validated. Future work should optimize efficiency and generalize the framework.
- While the theoretical guarantees are robust, practical implementation may face challenges in balancing exploration and exploitation, especially in highly stochastic or ambiguous tasks.
Future Work
Future research directions include optimizing the efficiency of BES to reduce computational costs, exploring adaptive strategies for tuning evolution operators and goal decomposition parameters, and extending the framework to multi-modal and multi-task settings. Additionally, integrating reinforcement learning techniques could further enhance search strategies, enabling models to learn optimal exploration policies. Broader validation across diverse domains, larger models, and real-world applications will be essential to realize BES’s full potential. Finally, developing automated methods for constructing high-quality verifiers will facilitate wider adoption in complex, unstructured tasks.
AI Executive Summary
In recent years, large language models (LLMs) have demonstrated remarkable reasoning and problem-solving capabilities across various domains. However, their performance in complex, hard problems remains limited by the inherent constraints of traditional sampling and search methods. Standard approaches like best-of-N sampling and tree search are effective for moderate tasks but struggle with exploration in low-probability regions and sparse verification signals, often leading to suboptimal solutions or failure to improve during self-training.
To address these challenges, Guowei Xu and colleagues introduce a novel framework called Bidirectional Evolutionary Search (BES). This approach fundamentally rethinks how models explore their solution space by coupling forward candidate evolution with backward goal decomposition. The core idea is inspired by biological evolution and recursive problem breakdown, enabling the model to generate diverse solutions and verify them at multiple granularities.
In the forward phase, BES employs a set of genetic-inspired operators—combination, translocation, crossover, and deletion—to recombine partial trajectories, effectively expanding the search beyond the narrow probability-supported regions typical of autoregressive methods. These operators allow the search to escape the entropy shell, a theoretical construct that confines traditional expansion-based search within low-entropy regions. Meanwhile, the backward phase recursively decomposes the original task into smaller, verifiable sub-goals, forming a goal tree that guides the forward search with dense, interpretable feedback.
Theoretical analysis confirms that pure expansion methods are confined within a narrow entropy shell, limiting exploration. Conversely, the introduction of evolution operators enables the search to break free from this shell, exponentially increasing the search space. The backward goal decomposition further enhances efficiency by reducing the number of samples needed to find correct solutions, especially in complex tasks.
Empirical results demonstrate BES’s effectiveness across multiple challenging benchmarks. In logical reasoning tasks like Knights-and-Knaves, BES consistently improves validation accuracy from around 30% to over 50%. In multi-hop reasoning datasets such as MuSiQue, it outperforms baseline algorithms by 3-4 percentage points, with increased search efficiency and solution diversity. For open problem benchmarks like circle packing and the Heilbronn convex problem, BES achieves higher solution quality and greater stability than existing open-source frameworks, with lower variance across runs.
These advances have significant implications. BES not only enhances the reasoning and problem-solving abilities of large models but also provides a systematic, theoretically grounded approach to self-improvement. Its ability to explore beyond the traditional probability support opens new avenues for autonomous AI systems capable of continuous learning and adaptation. The framework’s modular design allows integration with various verification mechanisms and can be extended to multi-modal, multi-task, and real-time applications.
Despite its promising results, BES faces challenges such as computational overhead and reliance on task-specific verifiers. Future work will focus on optimizing efficiency, automating verifier construction, and validating scalability across larger models and diverse domains. Overall, BES represents a significant step toward more autonomous, capable, and reliable AI systems, with broad potential for scientific, industrial, and societal impact.
Deep Analysis
Background
The evolution of large language models has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generation tasks. Early models like GPT-2 and BERT laid the groundwork, demonstrating the power of transformer architectures and pretraining on massive datasets. Building on these foundations, subsequent models such as GPT-3, Llama, and PaLM have scaled up parameters and training data, achieving remarkable performance across tasks.
However, despite these advances, models still face significant challenges in complex reasoning, multi-step problem solving, and open-ended tasks. Traditional sampling methods like best-of-N and beam search rely heavily on model probabilities, which often confine solutions within high-probability regions, limiting exploration of low-probability but correct solutions. Recent efforts incorporated reinforcement learning, tree search, and reasoning chains (e.g., Chain-of-Thought prompting), but these approaches often suffer from sparse verification signals and limited exploration capacity.
In parallel, research in search algorithms—drawing from classical AI, evolutionary algorithms, and reinforcement learning—has sought to improve exploration efficiency. Notable works include Tree of Thoughts, Graph of Thoughts, and evolutionary methods like AlphaEvolve and ShinkaEvolve, which maintain candidate populations and mutate solutions to discover better answers. Despite progress, these methods largely remain constrained by the support of the model’s probability distribution, unable to fully explore the solution space.
This landscape motivated the development of BES, which aims to fundamentally expand the exploration capacity of language models by combining biologically inspired evolution with recursive goal decomposition. The approach seeks to overcome the limitations of existing methods, enabling models to discover solutions in previously inaccessible regions of the search space, thereby pushing the boundaries of AI reasoning and self-improvement.
Core Problem
Current search strategies for large language models predominantly rely on autoregressive expansion or tree-based exploration, which are inherently limited by the model’s probability distribution. This results in a narrow exploration scope confined within a 'narrow entropy shell,' restricting the model's ability to find low-probability, correct solutions in complex or hard problems.
Moreover, the sparse verification signals—often binary or coarse—fail to provide dense feedback necessary for guiding the search effectively, especially in tasks requiring multi-step reasoning or intricate problem decomposition. As a consequence, models struggle to improve during self-training or to solve challenging tasks reliably.
These limitations hinder the development of autonomous, self-improving AI systems capable of tackling real-world problems that demand exploration beyond the model’s immediate probability support. Addressing this issue requires a new search paradigm that can both diversify candidate solutions and provide dense, interpretable feedback to guide the search process efficiently.
Innovation
The paper introduces several key innovations:
- �� Bidirectional Search Framework: Combines forward candidate evolution with backward goal decomposition, enabling both exploration and dense verification.
- �� Evolution Operators: Inspired by biological recombination, these operators (combination, translocation, crossover, deletion) allow candidates to be restructured and recombined, breaking free from the entropy shell that confines traditional expansion.
- �� Recursive Goal Decomposition: Breaks down complex tasks into a tree of sub-goals, each with verifiable success criteria, providing dense feedback and guiding the search.
- �� Theoretical Guarantees: Proven that evolution operators can escape the entropy shell, exponentially increasing the search space, and that backward goal decomposition reduces sample complexity.
- �� Practical Integration: The framework is designed to work with existing verifiers and models, making it adaptable across tasks and domains.
These innovations collectively enable the model to explore a much larger solution space, discover high-quality solutions in difficult tasks, and systematically improve through self-guided search.
Methodology
- �� Design forward search: Starting from initial partial trajectories, apply standard expansion and introduce four evolution operators (combination, translocation, crossover, deletion) to generate diverse candidates.
- �� Implement evolution operators: For single-parent operators, select parent nodes based on backward scores, and for two-parent operators, select pairs with complementary strengths, using Boltzmann distributions with temperature annealing.
- �� Construct backward goal decomposition: From the top-level goal, recursively split into finer sub-goals, each with a verifier assessing candidate coverage.
- �� Integrate search cycles: After several forward steps, invoke backward goal splitting on unsatisfied sub-goals, update the goal tree, and re-score candidates based on sub-goal coverage.
- �� Theorize and validate: Prove that evolution operators can escape the entropy shell confining expansion-only search, and that goal decomposition reduces sample complexity exponentially.
- �� Conduct experiments: Use datasets like Knights-and-Knaves, MuSiQue, circle packing, and Heilbronn problems, comparing BES with baselines in accuracy, sample efficiency, and stability.
- �� Ablation studies: Remove components like evolution operators or goal decomposition to validate their contributions.
Experiments
The experimental setup involves evaluating BES across multiple tasks and models:
- �� Logical reasoning: Using Knights-and-Knaves dataset, measuring validation accuracy improvements during training, comparing BES with GRPO and MaxRL.
- �� Multi-hop reasoning: On MuSiQue dataset, testing models like Llama-3.2-3B and Llama-3.1-8B, assessing accuracy, search counts, and valid actions.
- �� Open problem solving: On circle packing and Heilbronn convex problems, evaluating solution quality, variance, and stability across different frameworks.
- �� Hyperparameters: Number of search steps, operator probabilities, goal decomposition frequency, verifier types.
- �� Baselines: Including traditional methods (GRPO, Tree-GRPO), open-source frameworks (ShinkaEvolve, GEPA), and human/closed-source solutions.
- �� Metrics: Accuracy, success rate, search efficiency, solution diversity, computational cost.
- �� Ablation experiments: Removing evolution operators or goal decomposition to analyze their impact.
- �� Cost analysis: Wall-clock time, API usage, and resource consumption across methods.
Results
The results demonstrate that BES significantly outperforms baselines across tasks. In logical reasoning, validation accuracy improves from 30% to over 50%, with consistent self-improvement. In MuSiQue, accuracy gains of 3-4% are observed, with increased valid searches and more diverse solutions. On open problems, BES achieves higher average scores (e.g., 2.63 vs. 2.46 in circle packing), with lower variance, indicating more stable and reliable search. Ablation studies confirm that both evolution operators and goal decomposition are essential for performance gains. Cost analysis shows that BES incurs only about 30% more computational overhead than baseline methods, yet delivers substantially better solutions, validating its efficiency and effectiveness in complex reasoning scenarios.
Applications
BES can be applied in various domains requiring complex reasoning and autonomous problem-solving, such as scientific discovery, automated programming, strategic game playing, and decision-making systems. Its ability to generate high-quality, diverse solutions makes it suitable for training data augmentation, self-improving agents, and intelligent assistants. The framework’s modular design allows integration with different verifiers and models, facilitating adaptation to tasks like theorem proving, multi-modal reasoning, and multi-step planning. Over the long term, BES could enable AI systems that continuously learn and evolve, pushing the boundaries of autonomous intelligence and problem-solving capabilities in real-world applications.
Limitations & Outlook
Despite its promising results, BES faces challenges including high computational costs due to multiple candidate recombinations and recursive goal splitting, which may limit scalability and real-time deployment. The effectiveness heavily depends on the quality of verifiers, which can be difficult to design for unstructured or domain-specific tasks. The current experiments focus on moderate-sized models and specific benchmarks; scalability to larger models and diverse real-world problems remains to be validated. Future work should focus on optimizing efficiency, automating verifier construction, and exploring adaptive strategies for balancing exploration and exploitation to make BES more practical for widespread use.
Plain Language Accessible to non-experts
想象你在一家工厂工作,工厂里有许多不同的机器,每台机器都能完成特定的任务。以前,工厂的设计只让机器按照固定的流程工作,遇到复杂的问题时,它们只能沿着预设的路径尝试解决方案。这就像模型用自回归的方法逐步生成答案,限制了探索的范围。
现在,工厂引入了一种新策略,像是给机器装上了“创造”能力,让它们可以像人一样,尝试组合不同的零件,甚至交换零件,创造出全新的方案。这就是演化算子,比如组合、交叉和删除。与此同时,工厂还会把一个大任务拆成许多小任务,逐个确认每个小任务是否完成,就像拼拼图一样,把复杂的问题变成一堆简单的拼图,逐一解决。
通过这两种方法,工厂的机器不再局限于原有的设计,而是可以不断尝试新的组合和拆解,找到更快、更好的解决方案。这就像给模型装上了“创造”和“拆解”的双重能力,让它在面对难题时,能像人一样灵活思考,找到最优答案。这种方法不仅让解决问题变得更快,也让答案变得更丰富、更可靠。
ELI14 Explained like you're 14
想象你在玩一个超级复杂的拼图游戏,里面有很多碎片,要拼出一幅完整的画。以前,你只能按照顺序一块一块拼,试了很多次,还拼错了很多次。这就像模型用一种叫自回归的方法,逐步生成答案,但有时候会陷入死胡同,找不到正确的拼法。
现在,假设你有一种新策略,你可以把拼图拆成几个小部分,先确认每个小部分是不是拼对了,然后再试着把这些小部分拼在一起。更厉害的是,你还能把不同的小部分交换,试试不同的组合,找到最合适的拼法。这就像论文里的演化算子,让模型可以像“创造者”一样,尝试不同的答案组合,而不是只沿着一条路径走。
另外,你还可以把一个大任务拆成很多小任务,比如先拼出拼图的边框,再拼里面的内容。每完成一个小任务,你都可以检查一下,确保没错。这就像把复杂的问题拆成一堆简单的子问题,逐个解决。
通过这些方法,你的拼图游戏变得更聪明、更灵活,也更快能拼出完整的画。模型也是一样,借助这种“拆解”和“创造”的技巧,能更好地解决难题,找到更好的答案。未来,这种方法还能帮我们解决很多复杂的问题,比如科学研究、自动编程,甚至是创造新发明!
Glossary
Bidirectional Evolutionary Search (BES)(双向进化搜索)
A search framework combining forward candidate evolution with backward goal decomposition, designed to enhance exploration and verification in large models.
The core algorithm proposed in the paper, used to improve reasoning and self-improvement capabilities.
Evolution Operators(演化算子)
Genetic-inspired operations—combination, translocation, crossover, deletion—that recombine and mutate candidate solutions to diversify search space.
Applied during forward search to generate diverse candidate trajectories.
Backward Goal Decomposition(逆向目标分解)
A recursive process that breaks down complex tasks into verifiable sub-goals, providing dense feedback for guiding search.
Enhances search efficiency and solution quality by dense supervision.
Verifier(验证器)
A tool or model that assesses whether a candidate solution satisfies a sub-goal or task requirement, such as rule checkers or similarity models.
Used to evaluate candidate trajectories during backward goal decomposition.
Entropy Shell(熵壳)
A theoretical confinement where candidates generated by pure expansion are limited to low-entropy regions, restricting exploration.
Theoretical analysis shows evolution operators can break this shell.
Open Questions Unanswered questions from this research
- 1 While BES demonstrates strong performance on moderate-scale models and specific tasks, its scalability and efficiency on very large models like GPT-4 or GPT-5 are still unverified. Future research needs to optimize computational costs and adapt the framework for real-time applications. Additionally, designing universal, high-quality verifiers for diverse tasks remains challenging. How to automate and generalize verifier construction to handle unstructured or domain-specific problems is an open question. Further, integrating BES with reinforcement learning or other adaptive strategies could enhance its exploration-exploitation balance, enabling more autonomous and scalable self-improvement systems.
Applications
Immediate Applications
Training Data Generation
Using BES to generate high-quality, diverse training samples for complex reasoning tasks, improving model robustness and accuracy, especially in low-resource or hard problem domains.
Autonomous Reasoning Agents
Enhancing AI agents with BES-driven search to improve decision-making, problem-solving, and reasoning capabilities in applications like automated scientific discovery or strategic games.
Complex Optimization Tasks
Applying BES in automated planning, program synthesis, and combinatorial optimization, where exploring large solution spaces efficiently is critical.
Long-term Vision
Self-Improving AI Systems
Developing AI that can autonomously evolve and improve itself through iterative BES-based search, pushing towards fully autonomous, lifelong learning systems.
Cross-Domain Multi-Modal Reasoning
Extending BES to multi-modal data (text, images, video) and multi-task scenarios, enabling AI to perform complex reasoning across diverse information sources.
Abstract
Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.
References (20)
Tree Search for LLM Agent Reinforcement Learning
Yuxiang Ji, Ziyu Ma, Yong Wang et al.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
R. Lange, Yuki Imajuku, Edoardo Cetin
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Zhenyu Hou, Ziniu Hu, Yujiang Li et al.
Olympiad-level formal mathematical reasoning with reinforcement learning
T. Hubert, Rishi S Mehta, Laurent Sartran et al.
Branch-and-Bound Methods: A Survey
E. Lawler, D. Wood
Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces
R. Storn, K. Price
Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Maciej Besta, Nils Blach, Aleš Kubíček et al.
Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search
Kou Misaki, Yuichi Inoue, Yuki Imajuku et al.
Efficient Evolutionary Search Over Chemical Space with Large Language Models
Haorui Wang, Marta Skreta, C. Ser et al.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
Yangzhen Wu, Zhiqing Sun, Shanda Li et al.
Genetic Programming
Moshe Sipper
Some Genetic Aspects of Sex
H. Muller
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Dan Zhang, Sining Zhoubian, Yisong Yue et al.
Mathematical discoveries from program search with large language models
B. Romera-Paredes, M. Barekatain, Alexander Novikov et al.
Bandit Based Monte-Carlo Planning
Levente Kocsis, Csaba Szepesvari
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger et al.
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang et al.
IMPORTANT
Ruth Edwards
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu et al.