HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

TL;DR

HorizonMath evaluates AI progress in mathematical discovery using an automated verification framework, with GPT 5.4 Pro achieving breakthroughs on two problems.

cs.LG 🔴 Advanced 2026-03-17 60 views

Erik Y. Wang Sumeet Motwani James V. Roggeveen Eliot Hodges Dulhan Jayalath Charles London Kalyan Ramakrishnan Flaviu Cipcigan Philip Torr Alessandro Abate

AI Reader Arxiv Page Download PDF

AI mathematical discovery automated verification benchmarking GPT 5.4 Pro

Key Findings

Methodology

HorizonMath offers a benchmark of over 100 unsolved problems across 8 domains in computational and applied mathematics. The framework employs high-precision numerical comparison and deterministic constraint-checkers to automatically verify the correctness of candidate solutions. This approach leverages the generator-verifier gap, where discovery is hard but verification is simple, providing a scalable evaluation platform for AI progress in mathematical discovery.

Key Results

GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. These problems involve constructing an object to surpass published results, with verification showing GPT 5.4 Pro's solutions outperform human optimizations in certain scenarios.
Out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.
HorizonMath's design prevents data contamination as all problem solutions are unknown in the training corpus. Most state-of-the-art models score near 0% on this benchmark, further validating its challenge.

Significance

The introduction of HorizonMath provides a standardized evaluation tool for AI progress in mathematical discovery, addressing the scalability issues of existing benchmarks that rely on formal proof verification or manual review. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.

Technical Contribution

HorizonMath significantly differs from existing mathematical benchmarks by introducing an automated verification framework. Its design leverages the generator-verifier gap, making the verification process fast and free from human intervention. Additionally, HorizonMath's open-source and modular problem format facilitates community contributions and feedback, promoting AI's autonomy in mathematical discovery.

Novelty

HorizonMath is the first automated verification benchmark focused on unsolved mathematical problems. Its design not only avoids data contamination but also provides a fast, objective correctness signal through high-precision numerical comparison and deterministic constraint-checking. This innovation gives HorizonMath a unique advantage in evaluating AI's mathematical discovery capabilities.

Limitations

While matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Such solutions are best regarded as strong conjectures until proven.
The compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions.
Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.

Future Work

Future research directions include expanding the benchmark to accept simplifications that are not necessarily exact closed forms according to the current definition. This flexibility would help capture a broader spectrum of research, especially in fields like physics, where the goal is not always an exact closed-form solution. As AI capabilities advance, HorizonMath will provide a concrete and reproducible signal of progress toward autonomous mathematical research.

AI Executive Summary

HorizonMath is an innovative benchmark designed to evaluate AI progress in mathematical discovery. Traditional mathematical benchmarks often rely on formal proof verification or manual review, which are costly and difficult to scale. HorizonMath addresses this issue by introducing an automated verification framework. Its design leverages the generator-verifier gap, where candidate solutions are hard to produce but easy to verify, enabling fast and objective evaluation.

HorizonMath comprises over 100 unsolved problems across 8 domains in computational and applied mathematics. Each problem is carefully designed to ensure its solution is unknown in the training corpus, preventing data contamination. By employing high-precision numerical comparison and deterministic constraint-checkers, HorizonMath can automatically verify the correctness of candidate solutions. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.

In experiments, GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. These problems involve constructing an object to surpass published results, with verification showing GPT 5.4 Pro's solutions outperform human optimizations in certain scenarios. Additionally, out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.

However, HorizonMath also has limitations. While matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Such solutions are best regarded as strong conjectures until proven. Additionally, the compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions.

Overall, HorizonMath provides a standardized evaluation tool for AI progress in mathematical discovery, addressing the scalability issues of existing benchmarks that rely on formal proof verification or manual review. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Future research directions include expanding the benchmark to accept simplifications that are not necessarily exact closed forms according to the current definition. This flexibility would help capture a broader spectrum of research, especially in fields like physics, where the goal is not always an exact closed-form solution. As AI capabilities advance, HorizonMath will provide a concrete and reproducible signal of progress toward autonomous mathematical research.

Deep Analysis

Background

In the field of artificial intelligence, autonomous mathematical discovery has been a significant research direction. Recently, with the rapid development of large-scale language models, AI's capabilities in mathematical and scientific reasoning have significantly improved. However, whether AI can perform novel research remains a widely debated and underexplored question. Existing mathematical benchmarks, such as GSM8K and MATH, primarily assess AI's performance on known problems and are nearing saturation. Even more challenging benchmarks, like IMO-Bench and Putnam-Bench, only evaluate problems with known solutions, thus providing minimal signal on whether an AI system can produce novel mathematical results. To fill this gap, HorizonMath was introduced to evaluate AI progress in mathematical discovery through an automated verification framework.

Core Problem

The core problem of HorizonMath is how to evaluate AI's performance on unsolved mathematical problems. These problems are hard to discover, requiring meaningful mathematical insight, but are computationally efficient and simple to verify. Since the solutions to these problems are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0% on this benchmark. The importance of this problem lies in its challenge to AI's mathematical reasoning capabilities and its provision of a standardized evaluation tool for AI's autonomous capabilities in mathematical research.

Innovation

HorizonMath's core innovations lie in its automated verification framework and problem design. First, HorizonMath automatically verifies the correctness of candidate solutions through high-precision numerical comparison and deterministic constraint-checking. Second, HorizonMath's design leverages the generator-verifier gap, making the verification process fast and free from human intervention. Finally, HorizonMath's open-source and modular problem format facilitates community contributions and feedback, promoting AI's autonomy in mathematical discovery.

Methodology

HorizonMath's methodology includes the following key steps:

�� Problem Selection: Identify candidate problems from the mathematical literature, ensuring these problems are unknown in the training corpus.
�� Automated Verification: Verify the correctness of candidate solutions through high-precision numerical comparison and deterministic constraint-checking.
�� Generator-Verifier Gap: Leverage the generator-verifier gap, making the verification process fast and free from human intervention.
�� Open Source: Provide open-source and modular problem formats to facilitate community contributions and feedback.

Experiments

In the experimental design, HorizonMath comprises over 100 unsolved problems across 8 domains in computational and applied mathematics. Each problem is carefully designed to ensure its solution is unknown in the training corpus, preventing data contamination. By employing high-precision numerical comparison and deterministic constraint-checkers, HorizonMath can automatically verify the correctness of candidate solutions. The models used in the experiments include GPT 5.4 Pro, Gemini 3.1 Pro, and Opus 4.6, each tested on different problem sets.

Results

The experimental results show that GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. Additionally, out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.

Applications

HorizonMath's application scenarios include providing a standardized evaluation tool for AI progress in mathematical discovery. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.

Limitations & Outlook

HorizonMath's limitations include that while matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Additionally, the compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions. Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.

Plain Language Accessible to non-experts

Imagine you're in a giant library filled with countless books and problems. Each book represents a mathematical problem, and your task is to find those unsolved puzzles. HorizonMath is like a smart librarian that not only helps you find these puzzles but also tells you if they've been solved. It uses something called automated verification to check the correctness of answers, like a smart answer checker.

Think of it as playing a complex jigsaw puzzle game. Each puzzle piece is a mathematical problem, and you need to find the right pieces to complete the picture. HorizonMath is like a smart puzzle assistant that helps you quickly find the right pieces and ensures they all fit perfectly.

In this process, HorizonMath uses a trick called the generator-verifier gap. It's like when you're looking for puzzle pieces, it quickly filters out the ones that don't fit, leaving only the ones that might be correct. This not only saves time but also makes the puzzle-solving process more efficient.

In summary, HorizonMath is like a smart assistant that helps you find unsolved puzzles in the ocean of mathematics and quickly verifies if your answers are correct. This makes mathematical research more efficient and fun.

ELI14 Explained like you're 14

Hey there! Did you know there are tons of math problems out there that haven't been solved yet? They're like unsolved mysteries! And HorizonMath is a super cool tool that helps AI solve these mysteries!

Imagine you're playing a super complex puzzle game, and each puzzle piece is a math problem. HorizonMath is like a smart puzzle assistant that helps you find the right pieces quickly and makes sure they all fit perfectly.

How does it do that? HorizonMath uses something called automated verification to check if the answers are correct, like a smart answer checker. That way, we know which answers are right and which ones need more work.

So next time you're in math class and come across a tough problem, think of HorizonMath. It's like AI's best buddy, helping solve those super tricky math problems! Isn't that cool?

Glossary

Automated Verification

Automated verification is a method of automatically checking the correctness of mathematical problem answers using computer programs. It verifies candidate solutions through high-precision numerical comparison and deterministic constraint-checking.

Used in HorizonMath to verify AI-generated mathematical problem solutions.

Generator-Verifier Gap

The generator-verifier gap refers to the characteristic where candidate solutions are hard to generate but easy to verify. This gap makes the verification process fast and free from human intervention.

HorizonMath leverages this gap to design problems, making the verification process more efficient.

Numerical Comparison

Numerical comparison is a method of verifying the correctness of mathematical expressions by comparing computed results with high-precision reference values.

Used in HorizonMath to verify the correctness of candidate solutions.

Deterministic Constraint Checking

Deterministic constraint checking is a method of verifying the correctness of candidate solutions by checking if they satisfy all required properties.

Used to verify construction problems in HorizonMath.

Data Contamination

Data contamination refers to the situation where test data is included in the training dataset, potentially leading to unrealistically high performance during testing.

HorizonMath avoids data contamination by designing problems with unknown solutions.

Optimization Problem

An optimization problem involves finding the optimal solution to a target function, often involving constructing an object to surpass published results.

Used in HorizonMath to evaluate AI's capabilities in mathematical discovery.

Closed-form Expression

A closed-form expression is a mathematical expression that can be represented using a finite number of standard mathematical operators and functions.

HorizonMath requires candidate solutions to be closed-form expressions for verification.

Modular Problem Format

A modular problem format is a way of designing problems that allows them to be independently verified and expanded.

HorizonMath uses a modular problem format to facilitate community contributions.

High-Precision Numerical Reference

A high-precision numerical reference is a computed result used to verify the correctness of candidate solutions.

HorizonMath verifies candidate solutions by comparing them to high-precision numerical references.

Community Resource

A community resource is a tool or platform open for use and contribution by the research community.

HorizonMath, as an open benchmark, welcomes community contributions of new problems.

Open Questions Unanswered questions from this research

1 How can we verify the correctness of closed-form expressions without relying on high-precision numerical comparison? Current methods provide strong evidence but do not formally prove the exact correctness of closed-form expressions.
2 How can we improve the accuracy of the compliance checker to avoid accepting solutions exploiting subtle loopholes or rejecting valid ones using unusual but legitimate constructions?
3 How can we design more challenging mathematical problems to better evaluate AI's capabilities in mathematical discovery? Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.
4 How can we scale HorizonMath without increasing verification costs? Existing mathematical benchmarks often rely on formal proof verification or manual review, which are costly and difficult to scale.
5 How can we improve the evaluation precision of HorizonMath without affecting verification speed? While automated verification enhances objectivity and speed, it may not provide sufficient precision in some cases.

Applications

Immediate Applications

Mathematical Research Evaluation

HorizonMath can serve as an evaluation tool in mathematical research, helping researchers quickly verify the correctness of their solutions.

AI Model Testing

HorizonMath can be used to test AI models' capabilities in mathematical discovery, helping developers identify strengths and weaknesses.

Educational Tool

HorizonMath can serve as an educational tool, helping students understand the complexity of mathematical problems and solutions.

Long-term Vision

Autonomous Mathematical Research

As AI capabilities advance, HorizonMath is expected to promote AI's autonomy in mathematical research, becoming a major driver of mathematical discovery.

Cross-Disciplinary Applications

HorizonMath's automated verification framework can be extended to other disciplines, promoting AI applications in scientific research.

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.

cs.LG

References (20)

Claude’s Cycles

2 citations ⭐ Influential

UQ: Assessing Language Models on Unsolved Questions

Fan Nie, Ken Ziyu Liu, Zihao Wang et al.

2025 7 citations View Analysis →

Resolution of Erd\H{o}s Problem #728: a writeup of Aristotle's Lean proof

Nat Sothanaphan

2026 7 citations View Analysis →

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger et al.

2025 314 citations View Analysis →

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain et al.

2026 7 citations View Analysis →

CLOSED FORMS: WHAT THEY ARE AND WHY WE CARE

J. Borwein, R. Crandall

2013 113 citations

Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Tony Feng, Trieu H. Trinh, G. Bingham et al.

2026 8 citations View Analysis →

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliott S. Glazer, Ege Erdil, T. Besiroglu et al.

2024 152 citations View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 7759 citations View Analysis →

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, M. Barekatain, Alexander Novikov et al.

2023 736 citations

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai et al.

2024 846 citations View Analysis →

Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang Nguyen et al.

2025 25 citations View Analysis →

First Proof

M. Abouzaid, Andrew J. Blumberg, Martin Hairer et al.

2026 2 citations View Analysis →

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

J. Roggeveen, Erik Y. Wang, Will Flintoft et al.

2025 5 citations View Analysis →

Single-minus gluon tree amplitudes are nonzero

A. Guevara, A. Lupsasca, David Skinner et al.

2026 6 citations View Analysis →

Mathematical exploration and discovery at scale

Bogdan Georgiev, Javier G'omez-Serrano, Terence Tao et al.

2025 34 citations View Analysis →

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li et al.

2026 21 citations View Analysis →

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

G. Tsoukalas, Jasper Lee, J. Jennings et al.

2024 100 citations View Analysis →

Theory and computation of spheroidal wavefunctions

P. Falloon, P. Abbott, J. B. Wang

2002 88 citations View Analysis →

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

2021 4587 citations View Analysis →

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Automated Verification

Generator-Verifier Gap

Numerical Comparison

Deterministic Constraint Checking

Data Contamination

Optimization Problem

Closed-form Expression

Modular Problem Format

High-Precision Numerical Reference

Community Resource

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Mathematical Research Evaluation

AI Model Testing

Educational Tool

Long-term Vision

Autonomous Mathematical Research

Cross-Disciplinary Applications

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data