On the Generalization Bounds of Symbolic Regression with Genetic Programming

TL;DR

Analysis of generalization bounds in symbolic regression with genetic programming, revealing complexities in structure selection and constant fitting.

cs.LG 🔴 Advanced 2026-04-19 33 views

Masahiro Nomura Ryoki Hamano Isao Ono

AI Reader Arxiv Page Download PDF

symbolic regression genetic programming generalization bounds learning theory complexity analysis

Key Findings

Methodology

This paper provides a learning-theoretic analysis of symbolic regression models, particularly those represented as expression trees using genetic programming (GP). The study derives a generalization bound for GP-style symbolic regression under constraints on tree size, depth, and learnable constants. The result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure.

Key Results

Result 1: The study shows that structural restrictions can improve generalization performance by reducing hypothesis-class growth. Specifically, limiting the depth and size of trees significantly lowers the combinatorial complexity of the model, thereby reducing the risk of overfitting.
Result 2: The analysis finds that stability mechanisms such as protected operators and interval arithmetic effectively control the sensitivity of predictions to parameter perturbations, which is reflected in better test performance in experiments.
Result 3: Experiments also demonstrate that common GP practices, such as parsimony pressure and depth limits, are directly related to theoretical complexity terms, providing a theoretical explanation for empirically observed behaviors.

Significance

This research provides a theoretical framework for symbolic regression, explaining why GP models can generalize beyond training data. By linking practical design choices to explicit complexity terms in the generalization bound, the study offers a theoretical explanation for commonly observed empirical behaviors in GP-based symbolic regression. This not only helps understand the effectiveness of existing methods but also lays the foundation for developing more rigorous methods for controlling model complexity.

Technical Contribution

Technical contributions include deriving an explicit generalization bound for GP-style symbolic regression models for the first time and decomposing the generalization gap into structure selection and constant fitting. This decomposition provides a theoretical lens for understanding how common mechanisms in GP influence generalization. Additionally, the study demonstrates how structural restrictions and stability mechanisms can control model complexity.

Novelty

This study is the first to link the generalization bounds of symbolic regression to specific complexity terms, providing a theoretical explanation for common mechanisms in GP. Unlike previous methods that primarily rely on empirical validation, this paper offers a unified learning-theoretic framework that reveals the joint influence of structure selection and parameter fitting on generalization.

Limitations

Limitation 1: The study is based on worst-case complexity measures, which may be conservative in practice and fail to fully capture the effective complexity of observed data.
Limitation 2: The generalization bounds are data-independent, which may affect their quantitative tightness, necessitating the development of data-dependent bounds.
Limitation 3: The current analysis does not fully consider the effective set of structures explored by the search process.

Future Work

Future directions include developing data-dependent bounds to better capture the effective complexity of symbolic regression models on observed data. Additionally, leveraging the proposed bound for the design of generalization-aware symbolic regression algorithms, such as optimizing surrogate objectives derived from the generalization bound, is a promising avenue.

AI Executive Summary

Symbolic regression (SR) is a method aimed at discovering interpretable mathematical expressions directly from data, widely used in scientific discovery and interpretable modeling. Despite its significant empirical success, the theoretical understanding of why GP-based SR generalizes beyond training data remains limited.

This decomposition provides a theoretical lens for understanding how common mechanisms in GP influence generalization. Specifically, structural restrictions improve generalization performance by reducing hypothesis-class growth, while stability mechanisms such as protected operators and interval arithmetic control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, the study offers a theoretical explanation for commonly observed empirical behaviors in GP-based symbolic regression.

Experimental results show that limiting the depth and size of trees significantly lowers the combinatorial complexity of the model, thereby reducing the risk of overfitting. Additionally, stability mechanisms are reflected in better test performance, validating the theoretical analysis.

In conclusion, this research not only helps understand the effectiveness of existing methods but also lays the foundation for developing more rigorous methods for controlling model complexity. Future work includes developing data-dependent bounds to better capture the effective complexity of symbolic regression models on observed data.

Deep Analysis

Background

Symbolic regression (SR) is a method aimed at discovering analytical expressions that describe relationships between input variables and target outputs directly from data. Unlike conventional regression methods, which assume a predefined model class and then optimize parameters within that class, SR simultaneously searches for both the functional form of the predictor and its numerical coefficients. Because the output of SR is an explicit mathematical expression, the resulting models are often interpretable and can sometimes reveal underlying mechanisms in the data. Genetic programming (GP) is one of the most widely used approaches to SR. In GP-based SR, candidate models are typically represented as expression trees whose internal nodes correspond to operators and whose leaves correspond to variables or constants. Evolutionary search is then used to explore the space of symbolic structures through operations such as mutation, crossover, and selection. Additionally, the numerical constants within a given structure are often refined using dedicated continuous optimization techniques, such as nonlinear least squares. This hybrid nature—combining discrete structure search with continuous parameter optimization—makes GP particularly attractive for SR, as it enables open-ended exploration of functional forms without requiring prior commitment to a specific model class.

Core Problem

Despite the significant empirical success of GP in symbolic regression, the theoretical understanding of its generalization capabilities remains limited. Specifically, why a model found by GP should generalize beyond the training data is a question that has not been fully explained in learning-theoretic terms. This question concerns how well a model that fits the training data will perform on unseen data, typically formalized through the notion of a generalization gap, the difference between training error and expected prediction error. In practice, GP-based SR is known to be sensitive to design choices such as tree-size limits, maximum depth, constant optimization strategies, and the use of protected or numerically stable operators, many of which are motivated by the need to control overfitting.

Innovation

The core innovation of this paper lies in deriving an explicit generalization bound for GP-style symbolic regression models for the first time and decomposing the generalization gap into structure selection and constant fitting. This decomposition provides a theoretical lens for understanding how common mechanisms in GP influence generalization. Specifically, structural restrictions improve generalization performance by reducing hypothesis-class growth, while stability mechanisms such as protected operators and interval arithmetic control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, the study offers a theoretical explanation for commonly observed empirical behaviors in GP-based symbolic regression.

Methodology

The methodology of this paper includes the following key steps:

�� Deriving the generalization bound for GP-style symbolic regression: Under constraints on tree size, depth, and learnable constants, derive an explicit bound for the generalization gap.
�� Decomposition of the generalization gap: Decompose the generalization gap into a structure-selection term and a constant-fitting term, reflecting the combinatorial complexity of choosing an expression-tree structure and the complexity of optimizing numerical constants within a fixed structure, respectively.
�� Linking theoretical analysis to practice: By linking practical design choices to explicit complexity terms in the generalization bound, provide a theoretical explanation for commonly observed empirical behaviors in GP-based symbolic regression.
�� Experimental validation: Validate the theoretical analysis through experiments, demonstrating the impact of structural restrictions and stability mechanisms on generalization performance.

Experiments

The experimental design includes comparisons using multiple datasets and baseline methods to evaluate the generalization performance of the proposed method. Key parameters include tree size and depth limits, constant optimization strategies, etc. The experiments also include ablation studies to verify the impact of different design choices on generalization performance. By comparing the performance of different models on test datasets, the validity of the theoretical analysis is verified.

Results

Results analysis shows that limiting the depth and size of trees significantly lowers the combinatorial complexity of the model, thereby reducing the risk of overfitting. Additionally, stability mechanisms such as protected operators and interval arithmetic are reflected in better test performance, validating the theoretical analysis. Experiments also demonstrate that common GP practices, such as parsimony pressure and depth limits, are directly related to theoretical complexity terms, providing a theoretical explanation for empirically observed behaviors.

Applications

The applications of this research include scientific discovery and interpretable modeling, where symbolic regression can provide compact analytical expressions that offer insights in addition to predictive accuracy. Through theoretical analysis, the study lays the foundation for developing more rigorous methods for controlling model complexity, with broad academic and industrial impact.

Limitations & Outlook

Although the study provides a theoretical framework for symbolic regression, the worst-case complexity measures used may be conservative in practice and fail to fully capture the effective complexity of observed data. Additionally, the generalization bounds are data-independent, which may affect their quantitative tightness. Future work needs to develop data-dependent bounds to better capture the effective complexity of symbolic regression models on observed data.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a lot of ingredients (data) but don't know what dish (model) to make. Symbolic regression is like a smart chef who can discover the perfect recipe (mathematical expression) from these ingredients without knowing in advance what dish to make. Genetic programming is like the chef's assistant, trying different combinations (expression trees) to find the best recipe. Our research is like a cookbook, guiding the chef on how to choose the right combination of ingredients (structure selection) and how to adjust the amount of seasoning (constant fitting) to ensure the dish is not only tasty (fits the training data) but also suitable for more people's tastes (generalizes well). By limiting the complexity of the recipe (tree size and depth) and using stable cooking methods (protected operators and interval arithmetic), the chef can make more stable and delicious dishes.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game where the goal is to find a hidden treasure (mathematical expression). You have a super helper called symbolic regression, which can help you find the treasure's location from a bunch of clues (data). This helper has a special tool called genetic programming, which is like a magic key that tries different paths (expression trees) to find the treasure. Our research is like a guidebook that tells you how to choose the right path and how to adjust your tools to make sure you not only find the treasure but also succeed in other levels. By limiting the complexity of the paths (tree size and depth) and using stable tools (protected operators and interval arithmetic), you can find the treasure faster and more reliably. Isn't that cool?

Glossary

Symbolic Regression

Symbolic regression is a method for discovering mathematical expressions directly from data, known for its high interpretability.

Used in the paper to discover analytical expressions describing data relationships.

Genetic Programming

Genetic programming is a computational technique based on evolutionary algorithms for automatically generating computer programs.

Used in the paper to explore the space of symbolic structures.

Generalization Bound

A generalization bound refers to the difference between a model's performance on unseen data and its performance on training data.

Used in the paper to evaluate the generalization capability of models.

Expression Tree

An expression tree is a data structure representing the hierarchical structure of a mathematical expression.

Used in the paper to represent symbolic regression models.

Structure Selection

Structure selection refers to choosing the structure of an expression tree in symbolic regression.

Used in the paper to analyze the combinatorial complexity of models.

Constant Fitting

Constant fitting refers to optimizing numerical constants within an expression tree in symbolic regression.

Used in the paper to analyze the complexity of parameter optimization.

Parsimony Pressure

Parsimony pressure is a strategy in genetic programming to favor smaller models.

Used in the paper to control the structural complexity of models.

Protected Operator

A protected operator is an operator used in computations to avoid numerical instability.

Used in the paper to enhance the numerical stability of models.

Interval Arithmetic

Interval arithmetic is a mathematical method for handling uncertainty and numerical errors.

Used in the paper to enhance the numerical stability of models.

Combinatorial Complexity

Combinatorial complexity refers to the number of possible combinations when selecting model structures.

Used in the paper to analyze the complexity of structure selection.

Open Questions Unanswered questions from this research

1 Open Question 1: How can the effective complexity of observed data be better captured in symbolic regression? Current methods primarily rely on worst-case complexity measures, which may be conservative in practice.
2 Open Question 2: How can data-dependent generalization bounds be developed to better reflect model performance on actual data?
3 Open Question 3: How can structure selection and constant fitting be effectively combined to optimize the generalization performance of symbolic regression models?
4 Open Question 4: How can stability mechanisms be better utilized in symbolic regression to enhance numerical stability?
5 Open Question 5: How can generalization-aware symbolic regression algorithms be designed to better control model complexity and improve generalization performance?

Applications

Immediate Applications

Scientific Discovery

Symbolic regression can be used to discover underlying mechanisms in scientific data, providing interpretable mathematical expressions that help scientists understand complex phenomena.

Interpretable Modeling

In fields requiring high interpretability, such as finance and healthcare, symbolic regression can provide transparent models that help decision-makers understand the reasoning process.

Engineering Optimization

Symbolic regression can be used in engineering optimization problems, helping engineers improve design and performance by discovering analytical expressions of systems.

Long-term Vision

Automated Scientific Research

Symbolic regression has the potential to become an important tool in automated scientific research, accelerating the process of scientific discovery by automatically uncovering patterns in data.

Intelligent Decision Systems

Combining the interpretability and predictive power of symbolic regression, more intelligent and transparent decision systems can be developed in the future, with wide applications across industries.

Abstract

Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.

cs.LG cs.NE

References (20)

Foundations of Machine Learning

N. Nathani, Abhishek Singh

2021 3493 citations ⭐ Influential

Spectrally-normalized margin bounds for neural networks

P. Bartlett, Dylan J. Foster, Matus Telgarsky

2017 1397 citations View Analysis →

THE AVERAGE HEIGHT OF PLANTED PLANE TREES

de Ng Dick Bruijn, D. Knuth, S. Rice

1972 244 citations

High-Dimensional Probability: An Introduction with Applications in Data Science

O. Papaspiliopoulos

2020 3771 citations

Evolving Data Structures with Genetic Programming

W. Langdon

1995 35 citations

Structural Risk Minimization-Driven Genetic Programming for Enhancing Generalization in Symbolic Regression

Qi Chen, Mengjie Zhang, Bing Xue

2019 35 citations

Completely Derandomized Self-Adaptation in Evolution Strategies

N. Hansen, A. Ostermeier

2001 4560 citations

ON A GENERALIZATION

S. Theorem, S. Strunkov

1992 26 citations

The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes

R. Dudley

1967 717 citations

An Analysis of the Causes of Code Growth in Genetic Programming

T. Soule, Robert B. Heckendorn

2002 96 citations

Lexicographic Parsimony Pressure

S. Luke, Liviu Panait

2002 259 citations

Genetic Programming: On the Programming of Computers by Means of Natural Selection

J. Koza

1992 15141 citations

Symbolic regression in materials science

Yiqun Wang, Nicholas Wagner, J. Rondinelli

2019 304 citations View Analysis →

A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES

Kenneth Levenberg

1944 12367 citations

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur et al.

2018 696 citations View Analysis →

Scaled Symbolic Regression

M. Keijzer

2004 123 citations

Rademacher and Gaussian Complexities: Risk Bounds and Structural Results

P. Bartlett, S. Mendelson

2003 3156 citations

The CMA Evolution Strategy: A Tutorial

N. Hansen

2016 1633 citations View Analysis →

Symbolic regression

M. Keijzer

2008 38 citations

Discovering governing equations from data by sparse identification of nonlinear dynamical systems

S. Brunton, J. Proctor, J. Kutz

2015 4832 citations View Analysis →

On the Generalization Bounds of Symbolic Regression with Genetic Programming

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Symbolic Regression

Genetic Programming

Generalization Bound

Expression Tree

Structure Selection

Constant Fitting

Parsimony Pressure

Protected Operator

Interval Arithmetic

Combinatorial Complexity

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Scientific Discovery

Interpretable Modeling

Engineering Optimization

Long-term Vision

Automated Scientific Research

Intelligent Decision Systems

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data