HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
HorizonMath evaluates AI progress in mathematical discovery using an automated verification framework, with GPT 5.4 Pro achieving breakthroughs on two problems.
Key Findings
Methodology
HorizonMath offers a benchmark of over 100 unsolved problems across 8 domains in computational and applied mathematics. The framework employs high-precision numerical comparison and deterministic constraint-checkers to automatically verify the correctness of candidate solutions. This approach leverages the generator-verifier gap, where discovery is hard but verification is simple, providing a scalable evaluation platform for AI progress in mathematical discovery.
Key Results
- GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. These problems involve constructing an object to surpass published results, with verification showing GPT 5.4 Pro's solutions outperform human optimizations in certain scenarios.
- Out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.
- HorizonMath's design prevents data contamination as all problem solutions are unknown in the training corpus. Most state-of-the-art models score near 0% on this benchmark, further validating its challenge.
Significance
The introduction of HorizonMath provides a standardized evaluation tool for AI progress in mathematical discovery, addressing the scalability issues of existing benchmarks that rely on formal proof verification or manual review. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.
Technical Contribution
HorizonMath significantly differs from existing mathematical benchmarks by introducing an automated verification framework. Its design leverages the generator-verifier gap, making the verification process fast and free from human intervention. Additionally, HorizonMath's open-source and modular problem format facilitates community contributions and feedback, promoting AI's autonomy in mathematical discovery.
Novelty
HorizonMath is the first automated verification benchmark focused on unsolved mathematical problems. Its design not only avoids data contamination but also provides a fast, objective correctness signal through high-precision numerical comparison and deterministic constraint-checking. This innovation gives HorizonMath a unique advantage in evaluating AI's mathematical discovery capabilities.
Limitations
- While matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Such solutions are best regarded as strong conjectures until proven.
- The compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions.
- Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.
Future Work
Future research directions include expanding the benchmark to accept simplifications that are not necessarily exact closed forms according to the current definition. This flexibility would help capture a broader spectrum of research, especially in fields like physics, where the goal is not always an exact closed-form solution. As AI capabilities advance, HorizonMath will provide a concrete and reproducible signal of progress toward autonomous mathematical research.
AI Executive Summary
HorizonMath is an innovative benchmark designed to evaluate AI progress in mathematical discovery. Traditional mathematical benchmarks often rely on formal proof verification or manual review, which are costly and difficult to scale. HorizonMath addresses this issue by introducing an automated verification framework. Its design leverages the generator-verifier gap, where candidate solutions are hard to produce but easy to verify, enabling fast and objective evaluation.
HorizonMath comprises over 100 unsolved problems across 8 domains in computational and applied mathematics. Each problem is carefully designed to ensure its solution is unknown in the training corpus, preventing data contamination. By employing high-precision numerical comparison and deterministic constraint-checkers, HorizonMath can automatically verify the correctness of candidate solutions. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.
In experiments, GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. These problems involve constructing an object to surpass published results, with verification showing GPT 5.4 Pro's solutions outperform human optimizations in certain scenarios. Additionally, out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.
However, HorizonMath also has limitations. While matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Such solutions are best regarded as strong conjectures until proven. Additionally, the compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions.
Overall, HorizonMath provides a standardized evaluation tool for AI progress in mathematical discovery, addressing the scalability issues of existing benchmarks that rely on formal proof verification or manual review. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Future research directions include expanding the benchmark to accept simplifications that are not necessarily exact closed forms according to the current definition. This flexibility would help capture a broader spectrum of research, especially in fields like physics, where the goal is not always an exact closed-form solution. As AI capabilities advance, HorizonMath will provide a concrete and reproducible signal of progress toward autonomous mathematical research.
Deep Analysis
Background
In the field of artificial intelligence, autonomous mathematical discovery has been a significant research direction. Recently, with the rapid development of large-scale language models, AI's capabilities in mathematical and scientific reasoning have significantly improved. However, whether AI can perform novel research remains a widely debated and underexplored question. Existing mathematical benchmarks, such as GSM8K and MATH, primarily assess AI's performance on known problems and are nearing saturation. Even more challenging benchmarks, like IMO-Bench and Putnam-Bench, only evaluate problems with known solutions, thus providing minimal signal on whether an AI system can produce novel mathematical results. To fill this gap, HorizonMath was introduced to evaluate AI progress in mathematical discovery through an automated verification framework.
Core Problem
The core problem of HorizonMath is how to evaluate AI's performance on unsolved mathematical problems. These problems are hard to discover, requiring meaningful mathematical insight, but are computationally efficient and simple to verify. Since the solutions to these problems are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0% on this benchmark. The importance of this problem lies in its challenge to AI's mathematical reasoning capabilities and its provision of a standardized evaluation tool for AI's autonomous capabilities in mathematical research.
Innovation
HorizonMath's core innovations lie in its automated verification framework and problem design. First, HorizonMath automatically verifies the correctness of candidate solutions through high-precision numerical comparison and deterministic constraint-checking. Second, HorizonMath's design leverages the generator-verifier gap, making the verification process fast and free from human intervention. Finally, HorizonMath's open-source and modular problem format facilitates community contributions and feedback, promoting AI's autonomy in mathematical discovery.
Methodology
HorizonMath's methodology includes the following key steps:
- �� Problem Selection: Identify candidate problems from the mathematical literature, ensuring these problems are unknown in the training corpus.
- �� Automated Verification: Verify the correctness of candidate solutions through high-precision numerical comparison and deterministic constraint-checking.
- �� Generator-Verifier Gap: Leverage the generator-verifier gap, making the verification process fast and free from human intervention.
- �� Open Source: Provide open-source and modular problem formats to facilitate community contributions and feedback.
Experiments
In the experimental design, HorizonMath comprises over 100 unsolved problems across 8 domains in computational and applied mathematics. Each problem is carefully designed to ensure its solution is unknown in the training corpus, preventing data contamination. By employing high-precision numerical comparison and deterministic constraint-checkers, HorizonMath can automatically verify the correctness of candidate solutions. The models used in the experiments include GPT 5.4 Pro, Gemini 3.1 Pro, and Opus 4.6, each tested on different problem sets.
Results
The experimental results show that GPT 5.4 Pro proposed solutions that potentially improve upon known results for two optimization problems, indicating potential novel contributions in mathematical discovery. Additionally, out of 101 problems, GPT 5.4 Pro solved 5 of the 10 solvability 0 problems, while Gemini 3.1 Pro and Opus 4.6 solved only 3. This demonstrates GPT 5.4 Pro's superior performance in tackling complex mathematical problems.
Applications
HorizonMath's application scenarios include providing a standardized evaluation tool for AI progress in mathematical discovery. By automating verification, HorizonMath not only reduces evaluation costs but also enhances objectivity and speed. Its open and scalable framework serves as a growing community resource, advancing AI's autonomous capabilities in mathematical research.
Limitations & Outlook
HorizonMath's limitations include that while matching a high-precision numerical reference provides strong evidence, it does not formally prove the exact correctness of a closed-form expression. Additionally, the compliance checker may occasionally accept solutions exploiting subtle loopholes or reject valid ones using unusual but legitimate constructions. Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.
Plain Language Accessible to non-experts
Imagine you're in a giant library filled with countless books and problems. Each book represents a mathematical problem, and your task is to find those unsolved puzzles. HorizonMath is like a smart librarian that not only helps you find these puzzles but also tells you if they've been solved. It uses something called automated verification to check the correctness of answers, like a smart answer checker.
Think of it as playing a complex jigsaw puzzle game. Each puzzle piece is a mathematical problem, and you need to find the right pieces to complete the picture. HorizonMath is like a smart puzzle assistant that helps you quickly find the right pieces and ensures they all fit perfectly.
In this process, HorizonMath uses a trick called the generator-verifier gap. It's like when you're looking for puzzle pieces, it quickly filters out the ones that don't fit, leaving only the ones that might be correct. This not only saves time but also makes the puzzle-solving process more efficient.
In summary, HorizonMath is like a smart assistant that helps you find unsolved puzzles in the ocean of mathematics and quickly verifies if your answers are correct. This makes mathematical research more efficient and fun.
ELI14 Explained like you're 14
Hey there! Did you know there are tons of math problems out there that haven't been solved yet? They're like unsolved mysteries! And HorizonMath is a super cool tool that helps AI solve these mysteries!
Imagine you're playing a super complex puzzle game, and each puzzle piece is a math problem. HorizonMath is like a smart puzzle assistant that helps you find the right pieces quickly and makes sure they all fit perfectly.
How does it do that? HorizonMath uses something called automated verification to check if the answers are correct, like a smart answer checker. That way, we know which answers are right and which ones need more work.
So next time you're in math class and come across a tough problem, think of HorizonMath. It's like AI's best buddy, helping solve those super tricky math problems! Isn't that cool?
Glossary
Automated Verification
Automated verification is a method of automatically checking the correctness of mathematical problem answers using computer programs. It verifies candidate solutions through high-precision numerical comparison and deterministic constraint-checking.
Used in HorizonMath to verify AI-generated mathematical problem solutions.
Generator-Verifier Gap
The generator-verifier gap refers to the characteristic where candidate solutions are hard to generate but easy to verify. This gap makes the verification process fast and free from human intervention.
HorizonMath leverages this gap to design problems, making the verification process more efficient.
Numerical Comparison
Numerical comparison is a method of verifying the correctness of mathematical expressions by comparing computed results with high-precision reference values.
Used in HorizonMath to verify the correctness of candidate solutions.
Deterministic Constraint Checking
Deterministic constraint checking is a method of verifying the correctness of candidate solutions by checking if they satisfy all required properties.
Used to verify construction problems in HorizonMath.
Data Contamination
Data contamination refers to the situation where test data is included in the training dataset, potentially leading to unrealistically high performance during testing.
HorizonMath avoids data contamination by designing problems with unknown solutions.
Optimization Problem
An optimization problem involves finding the optimal solution to a target function, often involving constructing an object to surpass published results.
Used in HorizonMath to evaluate AI's capabilities in mathematical discovery.
Closed-form Expression
A closed-form expression is a mathematical expression that can be represented using a finite number of standard mathematical operators and functions.
HorizonMath requires candidate solutions to be closed-form expressions for verification.
Modular Problem Format
A modular problem format is a way of designing problems that allows them to be independently verified and expanded.
HorizonMath uses a modular problem format to facilitate community contributions.
High-Precision Numerical Reference
A high-precision numerical reference is a computed result used to verify the correctness of candidate solutions.
HorizonMath verifies candidate solutions by comparing them to high-precision numerical references.
Community Resource
A community resource is a tool or platform open for use and contribution by the research community.
HorizonMath, as an open benchmark, welcomes community contributions of new problems.
Open Questions Unanswered questions from this research
- 1 How can we verify the correctness of closed-form expressions without relying on high-precision numerical comparison? Current methods provide strong evidence but do not formally prove the exact correctness of closed-form expressions.
- 2 How can we improve the accuracy of the compliance checker to avoid accepting solutions exploiting subtle loopholes or rejecting valid ones using unusual but legitimate constructions?
- 3 How can we design more challenging mathematical problems to better evaluate AI's capabilities in mathematical discovery? Current frontier models score near 0% on HorizonMath, indicating that the benchmark's challenge may exceed the capabilities of existing AI systems.
- 4 How can we scale HorizonMath without increasing verification costs? Existing mathematical benchmarks often rely on formal proof verification or manual review, which are costly and difficult to scale.
- 5 How can we improve the evaluation precision of HorizonMath without affecting verification speed? While automated verification enhances objectivity and speed, it may not provide sufficient precision in some cases.
Applications
Immediate Applications
Mathematical Research Evaluation
HorizonMath can serve as an evaluation tool in mathematical research, helping researchers quickly verify the correctness of their solutions.
AI Model Testing
HorizonMath can be used to test AI models' capabilities in mathematical discovery, helping developers identify strengths and weaknesses.
Educational Tool
HorizonMath can serve as an educational tool, helping students understand the complexity of mathematical problems and solutions.
Long-term Vision
Autonomous Mathematical Research
As AI capabilities advance, HorizonMath is expected to promote AI's autonomy in mathematical research, becoming a major driver of mathematical discovery.
Cross-Disciplinary Applications
HorizonMath's automated verification framework can be extended to other disciplines, promoting AI applications in scientific research.
Abstract
Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.
References (20)
Claude’s Cycles
UQ: Assessing Language Models on Unsolved Questions
Fan Nie, Ken Ziyu Liu, Zihao Wang et al.
Resolution of Erd\H{o}s Problem #728: a writeup of Aristotle's Lean proof
Nat Sothanaphan
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger et al.
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain et al.
CLOSED FORMS: WHAT THEY ARE AND WHY WE CARE
J. Borwein, R. Crandall
Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems
Tony Feng, Trieu H. Trinh, G. Bingham et al.
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Elliott S. Glazer, Ege Erdil, T. Besiroglu et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
Mathematical discoveries from program search with large language models
Bernardino Romera-Paredes, M. Barekatain, Alexander Novikov et al.
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He, Renjie Luo, Yuzhuo Bai et al.
Towards Robust Mathematical Reasoning
Thang Luong, Dawsen Hwang, Hoang Nguyen et al.
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
J. Roggeveen, Erik Y. Wang, Will Flintoft et al.
Single-minus gluon tree amplitudes are nonzero
A. Guevara, A. Lupsasca, David Skinner et al.
Mathematical exploration and discovery at scale
Bogdan Georgiev, Javier G'omez-Serrano, Terence Tao et al.
Learning to Discover at Test Time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li et al.
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
G. Tsoukalas, Jasper Lee, J. Jennings et al.
Theory and computation of spheroidal wavefunctions
P. Falloon, P. Abbott, J. B. Wang
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath et al.