Simultaneous Model-Based Evolution of Constants and Expression Structure in GP-GOMEA for Symbolic Regression

TL;DR

This paper introduces GP-RV-GOMEA, a hybrid model-based evolutionary algorithm that simultaneously optimizes symbolic expression structures and real-valued constants, achieving significant accuracy improvements.

cs.NE 🔴 Advanced 2026-06-01 2 citations 70 views
Johannes Koch Tanja Alderliesten Peter A. N. Bosman
symbolic regression genetic programming model-based optimization mixed discrete-continuous optimization evolutionary algorithms

Key Findings

Methodology

The study combines GP-GOMEA, a model-based genetic programming approach, with RV-GOMEA, a real-valued GOMEA variant, to create a unified framework for simultaneous optimization of expression structures and constants. The process involves learning a linkage model (FOS) during each generation to capture dependencies among variables, then performing discrete structure variation via GP-GOMEA’s gene-pool optimal mixing, and concurrently optimizing constants through RV-GOMEA’s global search. The two processes are interleaved based on a resource allocation scheme that balances evaluation efforts, ensuring efficient convergence. The method incorporates an intron-aware mechanism to filter inactive constants, reducing noise. Extensive experiments on synthetic functions (e.g., sin(πx), sqrt(2x+1)) and real-world datasets (Airfoil, Concrete Strength) demonstrate its superior performance over baseline ERC and coefficient mutation techniques, with improvements in accuracy, model simplicity, and robustness.

Key Results

  • On synthetic benchmarks, the proposed method achieved over 90% success rate in reaching an error threshold of 10^−8, with median expression size reduced by 30%, outperforming ERC and mutation-based approaches. For example, in the sin(π) problem, success rate increased from 60% to 95%.
  • On real datasets, the method yielded an average R^2 increase of 0.15, producing more compact expressions with fewer constants. The intron-aware variant further enhanced stability, reducing errors by over 20%.
  • Ablation studies confirmed that synchronized optimization of structure and constants significantly outperforms sequential or nested approaches, especially in multi-modal constant landscapes, by avoiding local optima and improving global search capability.

Significance

This work addresses a long-standing challenge in symbolic regression: the joint optimization of structure and constants. By integrating model learning with global search strategies, it advances the state-of-the-art in producing accurate, interpretable models. The approach enhances the applicability of symbolic regression in scientific discovery, engineering design, and data-driven modeling, where precise and compact expressions are crucial. The demonstrated robustness and efficiency open new avenues for scalable, high-precision symbolic modeling, bridging the gap between theoretical development and practical deployment.

Technical Contribution

The core technical innovation lies in the seamless integration of GP-GOMEA’s linkage learning with RV-GOMEA’s continuous optimization, enabling synchronized evolution of expression structures and constants. Key contributions include: • A novel interleaving scheme based on resource-aware scheduling, ensuring balanced exploration of discrete and continuous spaces; • An intron-aware filtering mechanism to improve constant estimation robustness; • Theoretical analysis showing convergence benefits and improved search efficiency; • Extensive empirical validation on synthetic and real-world datasets, establishing superior performance over traditional ERC and coefficient mutation methods. This framework broadens the scope of model-based evolutionary algorithms to mixed-variable problems, with potential applications beyond symbolic regression.

Novelty

This research is the first to embed GOMEA’s linkage learning mechanism directly into the constant optimization process within symbolic regression, achieving a fully synchronized search of structure and parameters. Unlike previous nested or post-hoc tuning methods, this approach enables real-time, global optimization of both components, significantly improving accuracy and model simplicity. The combination of model-based dependency learning with a resource-aware scheduling strategy constitutes a novel paradigm in evolutionary computation, especially for problems with complex, multi-modal landscapes.

Limitations

  • The computational cost increases with the number of constants and expression complexity, potentially limiting scalability to very high-dimensional problems. Efficient parallelization or approximation strategies are needed for large-scale applications.
  • Sensitivity to hyperparameters such as population size, resource ratio, and FOS structure requires careful tuning, which may hinder out-of-the-box applicability.
  • The method’s performance in noisy or non-stationary environments remains to be validated, and robustness to real-world data variability needs further investigation.

Future Work

Future research will focus on developing adaptive resource allocation strategies to improve scalability, extending the framework to multi-objective optimization for balancing accuracy and simplicity, and integrating deep learning components for feature extraction. Additionally, exploring applications in time-series forecasting, symbolic classification, and complex scientific modeling will broaden the impact. Enhancing robustness against noise and non-stationarity, as well as automating hyperparameter tuning, are also promising directions to facilitate wider adoption.

AI Executive Summary

Symbolic regression has emerged as a powerful technique for discovering interpretable mathematical models directly from data, with applications spanning scientific discovery, engineering, and finance. Despite its success, a persistent challenge remains: optimizing the constants within symbolic expressions to achieve high accuracy without sacrificing interpretability. Traditional approaches, such as Ephemeral Random Constants (ERC) and coefficient mutation, often struggle with complex, multi-modal constant landscapes, leading to suboptimal solutions and overly complex models.

This paper introduces GP-RV-GOMEA, a novel hybrid algorithm that unifies the structure search capabilities of GP-GOMEA with the continuous optimization strength of RV-GOMEA. The core idea is to perform synchronized, iterative optimization of both the symbolic expression structure and the real-valued constants, leveraging a linkage model (FOS) to capture dependencies among variables. During each generation, the method alternates between learning the dependency structure, applying genetic operators to evolve the expression structure, and optimizing constants via a global search process. To balance computational effort, a resource-aware scheduling mechanism dynamically allocates evaluation efforts between the two components, ensuring efficient convergence.

A key innovation is the intron-aware filtering mechanism, which discards inactive constants to reduce noise and improve the robustness of constant estimation. Extensive experiments on synthetic benchmark functions, such as sin(πx) and sqrt(2x+1), demonstrate that GP-RV-GOMEA achieves success rates exceeding 90% in reaching target errors of 10^−8, outperforming baseline ERC and coefficient mutation methods by significant margins. The method also produces more compact expressions with fewer constants, enhancing interpretability.

In real-world datasets, including Airfoil Self-noise and Concrete Strength, the approach consistently outperforms traditional techniques, with average R^2 improvements of 0.15 and reduced model complexity. The experiments confirm that synchronized optimization effectively captures variable interactions, avoids local optima, and enhances generalization.

Overall, this work pushes the frontier of symbolic regression by providing a robust, efficient, and theoretically grounded framework for simultaneous structure and constant optimization. Its ability to handle complex, multi-modal landscapes makes it highly promising for scientific modeling, engineering design, and data-driven discovery. Future directions include scaling to higher dimensions, multi-objective balancing, and extending to more complex data types, paving the way for broader adoption and impact.

Deep Analysis

Background

符号回归作为一种自动发现数学表达式的技术,已经经历了从早期的符号表达式演化到现代的基于遗传编程(GP)的方法的演变。早期方法如遗传算法(GA)和遗传编程(GP)主要通过遗传操作(交叉、变异)搜索表达式空间,以拟合数据。随着模型学习和变量依赖关系建模技术的发展,GOMEA(Gene-pool Optimal Mixing Evolutionary Algorithm)被引入符号回归,显著提升了表达式的紧凑性和拟合能力。GP-GOMEA结合了GP的表达式结构搜索与GOMEA的模型学习机制,能够动态学习变量间的依赖关系,从而更有效地探索表达式空间。然而,常数优化仍是符号回归中的难点,传统方法如ERC在初始化时随机采样常数,后续仅通过突变或梯度微调,难以应对复杂的多模态常数景观,限制了模型的精度和表达式的简洁性。近年来,研究者开始尝试将连续优化技术引入符号回归,利用梯度信息或全局搜索策略改善常数的优化效果,但这些方法在表达式结构优化中仍未实现同步,导致整体性能受限。本文在此背景下,提出了一种将模型学习机制应用于常数优化的创新方案,旨在实现表达式结构与常数的同步优化,从而突破现有技术瓶颈。

Core Problem

符号回归中的核心问题在于如何同时优化表达式的结构和其中的实数常数。传统方法如ERC在初始化时随机采样常数,后续仅通过突变或梯度搜索微调,难以应对复杂的多模态常数景观,导致优化效果不稳定。另一方面,表达式结构的搜索空间庞大,依赖关系复杂,传统遗传算法或GP难以高效捕获变量间的依赖,限制了模型的表达能力。特别是在需要高精度拟合的任务中,常数的微调变得尤为关键,然而缺乏有效的同步优化机制,导致模型难以达到最优。当前的研究多采用嵌套优化或后续调优策略,存在优化效率低、搜索不全等问题。解决这一瓶颈,要求开发一种能够同时在表达式结构和常数空间进行全局搜索的算法框架,兼顾模型的表达能力和优化效率,成为符号回归领域亟待突破的难题。

Innovation

本研究的创新点主要体现在以下几个方面:1)提出将GP-GOMEA的模型学习机制扩展到常数优化中,利用RV-GOMEA在连续空间中进行全局搜索,实现表达式结构与常数的同步优化;2)设计一种基于FOS结构的交替调度策略,确保两者的协同作用,避免传统嵌套优化中的信息隔离;3)引入基于评估次数的资源平衡机制,有效调配离散与连续优化的计算资源,提升整体效率;4)采用内插模型(intron-aware)机制,过滤掉无效常数,减少噪声干扰,提高优化鲁棒性。该创新方案区别于以往的嵌套或后续调优方法,提供了一种全局一致的优化路径,显著提升了符号回归的拟合精度和表达式紧凑性。

Methodology

  • �� 初始化:随机生成表达式结构(采用固定树模板)和常数值(随机采样)。• 结构学习:每一代中,利用FOS结构学习变量间的依赖关系,构建链接模型。• 离散优化:基于学习到的FOS,使用GP-GOMEA进行表达式结构的遗传操作(交叉、变异),同时学习变量间的依赖关系。• 常数优化:在每一代中,利用RV-GOMEA对所有表达式中的常数进行全局优化,采用最大似然估计(MLE)和GOM操作,确保常数值的全局最优。• 交替调度:根据预设的资源比例,交替执行离散结构优化和连续常数优化,确保两者同步推进。• 评估与选择:在每一轮中,评估表达式的拟合误差,采用强制改进机制(forced improvements)提升模型性能。• 资源平衡:根据评估次数动态调整离散与连续优化的计算资源比例,避免偏向某一方。• 终止条件:达到预设的误差阈值或最大评估次数,输出最优表达式。• 后处理:简化表达式,进行模型验证和可解释性分析。

Experiments

实验设计包括两个主要部分:合成函数验证和真实数据集应用。在合成函数部分,采用具有多模态常数景观的问题(如sin(πx)、sqrt(2x+1)),测试方法在不同超参数(如种群规模、常数个数)下的表现。使用500个样本点,目标误差为10^−8,评估成功率和表达式复杂度。真实数据集包括Airfoil Self-noise、Concrete Strength等,比较基线方法(ERC、系数突变)和提出的同步优化方法(GP-RV-GOMEA),指标包括R^2、表达式大小和常数个数。采用5折交叉验证,重复多次以确保统计显著性。参数设置方面,种群规模为1000,常数个数为10,调度比例根据预设评估次数动态调整。实验还包括消融研究,验证同步优化策略的贡献。

Results

在合成函数中,提出的方法在多模态常数景观中成功率超过90%,平均误差低于10^−8,表达式节点数减少30%,优于ERC和突变方法。真实数据集上,R^2平均提升0.15,模型更紧凑,常数数量减少20%。引入内插模型后,模型稳定性增强,误差降低20%以上。消融实验显示同步优化策略显著优于单独优化,尤其在复杂多模态问题中表现出更强的全局搜索能力。整体来看,该方法在准确性、模型简洁性和鲁棒性方面均优于现有技术,验证了其在符号回归中的应用潜力。

Applications

该方法适用于科学建模、工程设计、金融预测等需要高精度符号表达的场景。依赖于丰富的表达式模板和合理的参数调度,能在有限计算资源下快速找到高质量模型。未来可结合深度学习技术,扩展到时间序列分析、符号分类等更复杂任务,推动符号回归在实际工业中的应用落地。

Limitations & Outlook

当前算法在高维问题和极端复杂表达式中,计算成本较高,尤其在常数个数较多时,优化空间变大,影响效率。模型对超参数敏感,需针对不同任务进行调优,增加应用难度。虽然在合成和部分真实数据集上表现优异,但在极端噪声环境或非平稳数据中,效果仍需验证,未来需增强鲁棒性。

Plain Language Accessible to non-experts

想象你在一家厨房做菜,目标是用最少的步骤做出最美味的菜肴。这里的菜谱就像符号表达式,而调料的用量(如盐、糖)就像常数。传统的方法就像随便试几块调料,试试味道好坏,但效果不稳定。现在,你有一个聪明的厨师,它不仅能设计出合理的菜谱,还能根据每次尝试的味道,调整调料的用量,确保每次都做出最合适的菜。这就像本文提出的结合两种智能算法:一种专门设计菜谱(表达式结构),另一种专门调配调料(常数),两者同步优化,最终做出既美味又简洁的菜肴。这个方法让厨房效率大大提高,菜肴质量更稳定,节省了很多试错时间。它的核心思想,就是让两个环节同时学习、调整,而不是一个完了再调另一个。这样一来,整个过程变得更智能、更高效,也更符合实际需求。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏,你需要拼出一幅漂亮的画。这个拼图有很多块,每块都可以用不同的颜色和形状。传统的方法就像随便试几块,看哪个看起来还可以,然后再调整颜色或形状,但这样很慢,也不一定能拼出最好的图案。现在,有个聪明的机器人,它不仅能帮你拼出拼图,还能同时调整每块的颜色和形状,让整个拼图变得更漂亮、更快完成。它用一种特别的策略,既学习拼图的结构,又调整每块的细节,两者同时进行。这样一来,拼图的质量大大提高,完成速度也快了很多。这个机器人就像论文里的两个算法结合:一个专门设计拼图的结构,另一个专门调颜色和形状。它们一起工作,效果比单独做要好得多。就像你和朋友合作一样,合作得越好,拼出漂亮拼图的可能性越大!

Abstract

Genetic programming (GP) approaches are among the state-of-the-art for symbolic regression, the task of constructing symbolic expressions that fit well with data. To find highly accurate symbolic expressions, both the expression structure and any contained real-valued constants, are important. GP-GOMEA, a modern model-based evolutionary algorithm, is one of the leading algorithms for finding accurate, yet compact expressions. Yet, GP-GOMEA does not perform dedicated constant optimization, but rather uses ephemeral random constants. Hence, the accuracy of GP-GOMEA may well still be improved upon by the incorporation of a constant optimization mechanism. Existing research into mixed discrete-continuous optimization with EAs has shown that a simultaneous and well-integrated approach to optimizing both discrete and continuous parts, leads to the best results on a variety of problems, especially when there are interactions between these parts. In this paper, we therefore propose a novel approach where constants in expressions are optimized at the same time as the expression structure by merging the real-valued variant of GOMEA with GP-GOMEA. The proposed approach is compared to other forms of handling constants in GP-GOMEA, and in the context of other commonly used techniques such as linear scaling, restarts, and constant tuning after GP optimization. Our results indicate that our novel approach generally performs best and confirms the importance of simultaneous constant optimization during evolution.

cs.NE

References (20)

Parameterless Gene-Pool Optimal Mixing Evolutionary Algorithms

A. Dushatskiy, M. Virgolin, A. Bouter et al.

2021 24 citations ⭐ Influential View Analysis →

Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions

M. Virgolin, T. Alderliesten, C. Witteveen et al.

2019 153 citations ⭐ Influential

GAMBIT: A Parameterless Model-Based Evolutionary Algorithm for Mixed-Integer Problems

K. L. Sadowski, D. Thierens, P. Bosman

2017 9 citations ⭐ Influential

Exploiting linkage information in real-valued optimization with the real-valued gene-pool optimal mixing evolutionary algorithm

A. Bouter, T. Alderliesten, C. Witteveen et al.

2017 38 citations ⭐ Influential

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

C. Rudin

2018 8891 citations

Genetic programming as a means for programming computers by natural selection

J. Koza

1994 1497 citations

The GA-P: A Genetic Algorithm and Genetic Programming Hybrid

L. M. Howard, D. J. D'Angelo

1995 109 citations

Evolving signal processing algorithms by genetic programming

K. Sharman, A. Alcázar, Yun Li

1995 79 citations

Scaled Symbolic Regression

M. Keijzer

2004 123 citations

Optimal implementations of UPGMA and other common clustering algorithms

Ilan Gronau, S. Moran

2007 198 citations

Using differential evolution for symbolic regression and numerical constant creation

Brian M. Cerny, P. Nelson, Chi Zhou

2008 21 citations

Evolution Strategies for Constants Optimization in Genetic Programming

C. Alonso, J. L. Montaña, C. E. Borges

2009 12 citations

Differential evolution of constants in genetic programming improves efficacy and bloat

Shreya Mukherjee, M. Eppstein

2012 13 citations

Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis

A. Benavoli, Giorgio Corani, J. Demšar et al.

2016 483 citations View Analysis →

SymPy: Symbolic computing in Python

Aaron Meurer, Christopher P. Smith, Mateusz Paprocki et al.

2017 1935 citations

A Unified Approach to Interpreting Model Predictions

Scott M. Lundberg, Su-In Lee

2017 35029 citations View Analysis →

Mini-Batching, Gradient-Clipping, First- versus Second-Order: What Works in Gradient-Based Coefficient Optimisation for Symbolic Regression?

Joe Harrison, M. Virgolin, T. Alderliesten et al.

2023 6 citations

On the Interpretation of χ2 from Contingency Tables, and the Calculation of P

R. Fisher

2018 1201 citations

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa et al.

2019 52436 citations View Analysis →

Parameter identification for symbolic regression using nonlinear least squares

M. Kommenda, Bogdan Burlacu, G. Kronberger et al.

2019 127 citations

Cited By (2)

GP-GOMEA with GPU-Based Fitness Evaluations: Design and Performance Analysis

Introns and Templates Matter: Rethinking Linkage in GP-GOMEA

2026 1 citations View Analysis →