MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet provides a global multimodal benchmark for mathematical reasoning and retrieval, covering 30,676 Olympiad-level problems from 47 countries.
Key Findings
Methodology
MathNet's methodology includes three core tasks: problem solving, math-aware retrieval, and retrieval-augmented problem solving. The dataset comprises Olympiad-level problems from 47 countries, covering 17 languages and various mathematical domains. The retrieval benchmark consists of mathematically equivalent and structurally similar problem pairs curated by human experts. Experiments evaluated several state-of-the-art reasoning and embedding models, revealing the challenges current models face in mathematical reasoning and retrieval tasks.
Key Results
- Gemini-3.1-Pro achieved 78.4% accuracy in the problem-solving task, while GPT-5 scored 69.3%, indicating that even state-of-the-art models struggle with Olympiad-level problems.
- In the math-aware retrieval task, embedding models performed poorly, struggling to retrieve equivalent problems.
- DeepSeek-V3.2-Speciale achieved up to a 12% performance gain in the retrieval-augmented generation task, obtaining the highest scores on the benchmark.
Significance
The significance of MathNet lies in its ability to fill the gaps in existing benchmarks regarding scale, language coverage, and task diversity. By providing a large-scale, multimodal, and multilingual dataset of Olympiad-level problems, MathNet offers a new evaluation platform for mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. This will help advance research in mathematical reasoning, particularly in how models handle complex mathematical problems and retrieve equivalent problems.
Technical Contribution
MathNet's technical contributions include being the first benchmark focused on mathematical problem retrieval, providing a large-scale, high-quality Olympiad dataset. It supports not only problem generation and retrieval but also demonstrates the significant impact of retrieval quality on generation performance through the retrieval-augmented generation task. The public release of MathNet's dataset and benchmark will serve as a valuable resource for academia and industry.
Novelty
MathNet's novelty lies in its global multimodal and multilingual coverage and the introduction of the first mathematical problem retrieval benchmark. Unlike existing Olympiad datasets, MathNet is larger in scale and richer in language and task diversity.
Limitations
- Current models perform poorly in retrieving equivalent problems, especially when dealing with complex mathematical structures.
- The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance.
Future Work
Future research directions include improving embedding models' ability to recognize mathematical structures and exploring better integration of retrieval and generation models to enhance mathematical reasoning capabilities. Further work could also focus on expanding MathNet's dataset and benchmark to cover more mathematical domains and languages.
AI Executive Summary
Mathematical problem solving has long been a crucial test of reasoning for large language models and multimodal models. However, existing benchmarks are limited in size, language coverage, and task diversity. To address this, we introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems, along with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.
MathNet's dataset spans 47 countries, 17 languages, and various mathematical domains, comprising 30,676 expert-authored problems with solutions. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.
Experimental results show that even state-of-the-art reasoning models, such as Gemini-3.1-Pro and GPT-5, remain challenged by Olympiad-level problems, while embedding models struggle to retrieve equivalent problems. We further demonstrate that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark.
MathNet not only provides the largest high-quality Olympiad dataset but also introduces the first benchmark for evaluating mathematical problem retrieval. We publicly release both the dataset and benchmark to promote research in mathematical reasoning and retrieval in academia and industry.
Despite MathNet's significance in advancing mathematical reasoning research, current models still face challenges in handling complex mathematical structures and retrieving equivalent problems. Future research could focus on improving embedding models' ability to recognize mathematical structures and better integrating retrieval and generation models to enhance mathematical reasoning capabilities.
Deep Analysis
Background
Mathematical reasoning has long been a core benchmark for evaluating AI reasoning capabilities. Early efforts focused on text-based arithmetic problems, while recent research has expanded to competition-level reasoning, theorem proving, and multimodal problem-solving. Existing datasets can be broadly categorized into text-only benchmarks, multimodal benchmarks, and aggregates. Despite these datasets pushing mathematical reasoning research to some extent, they remain limited in scale, language diversity, and structured similarity annotations. MathNet fills this gap by providing a large-scale, multimodal, and multilingual dataset of Olympiad-level problems.
Core Problem
Mathematical problem solving is a core benchmark for evaluating AI reasoning capabilities. However, existing Olympiad-level datasets are typically drawn from community platforms such as AoPS and cover only a handful of competitions in the U.S and China. This constrains research progress due to the lack of open, high-quality, and diverse benchmarks. MathNet addresses this gap by presenting mathematics problems sourced from 47 countries across four decades, providing an unprecedented foundation for exploring mathematical generalization and analogical reasoning.
Innovation
MathNet's core innovations include its global multimodal and multilingual coverage and the introduction of the first mathematical problem retrieval benchmark. Unlike existing Olympiad datasets, MathNet is larger in scale and richer in language and task diversity. It supports three tasks: problem solving, math-aware retrieval, and retrieval-augmented problem solving. By providing a large-scale, high-quality Olympiad dataset, MathNet offers a new evaluation platform for mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.
Methodology
- οΏ½οΏ½ MathNet-Solve: A collection of 30,676 Olympiad-level math problems with aligned LaTeX and natural-language statements, expert solutions, and metadata spanning 47 countries, 17 languages, and 65+ mathematical domains.
- οΏ½οΏ½ MathNet-Retrieve: A dataset for retrieval consisting of 40,000 additional synthetic problems derived from 10,000 anchor problems, each paired with 1 equivalent positive and 3 hard negatives.
- οΏ½οΏ½ MathNet-RAG: An evaluation dataset of 35 anchor problems and 35 expert-paired real problems, all drawn entirely from MathNet-Solve.
Experiments
The experimental design includes evaluating 27 models on MathNet-Solve, MathNet-Retrieve, and MathNet-RAG. On MathNet-Solve, we evaluate two types of models: text-only and multimodal models. On MathNet-Retrieve, we assess retrieval performance using embeddings derived from a diverse set of state-of-the-art models. On MathNet-RAG, we limit evaluations to seven state-of-the-art open-source and proprietary models, as this benchmark requires human grading.
Results
On MathNet-Solve, the strongest model is Gemini-3.1-Pro, achieving 76.3% overall accuracy. MathNet-Retrieve remains highly challenging at the top-1 level, with even the strongest models achieving only βΌ5% Recall@1. On MathNet-RAG, Expert-RAG is the strongest setting overall, with DeepSeek-V3.2-Speciale reaching the best result at 97.3% under human grading.
Applications
MathNet's dataset and benchmark provide a valuable resource for academia and industry, particularly in mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. It can be used to evaluate and improve existing mathematical reasoning models and provide a foundation for developing new mathematical reasoning and retrieval methods.
Limitations & Outlook
Despite MathNet's significance in advancing mathematical reasoning research, current models still face challenges in handling complex mathematical structures and retrieving equivalent problems. The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance. Future research could focus on improving embedding models' ability to recognize mathematical structures and better integrating retrieval and generation models to enhance mathematical reasoning capabilities.
Plain Language Accessible to non-experts
Imagine you're in a kitchen preparing a big meal. You have a variety of ingredients (like MathNet's dataset), each with different flavors and uses (like different math problems). You need to choose and combine these ingredients according to recipes (like the model's algorithms) to create delicious dishes (like solving math problems).
However, sometimes you might face challenges, such as not finding a specific ingredient (like models struggling to retrieve equivalent problems) or being unsure of the best use of an ingredient (like difficulties in mathematical reasoning).
To overcome these challenges, you can try different combinations and cooking methods (like experimenting and adjusting parameters in models) or refer to other chefs' experiences (like using retrieved related problems in retrieval-augmented generation tasks).
Ultimately, through continuous trial and improvement, you can create a delicious meal (like achieving success in mathematical reasoning tasks). MathNet is like a rich pantry, offering you endless possibilities.
ELI14 Explained like you're 14
Hey there! Did you know that math isn't just those formulas and problems you see in class? It's actually like a super fun puzzle game! Imagine you have a giant puzzle, and each piece represents a math problem. MathNet is like a huge puzzle library with pieces from all over the world.
Now, imagine you're a puzzle master, and you need to use these pieces to complete a super complex puzzle. Every time you find the right piece, it's like solving a math problem. But sometimes, you might find pieces that look similar but don't fit, just like how models struggle to find equivalent problems.
To help you finish the puzzle faster, you can use some tricks, like finding the edge pieces first (like using retrieval-augmented generation tasks in models), which helps you build the puzzle's framework quicker.
So, MathNet is like a super cool puzzle library that helps you explore and discover more fun in the world of math!
Glossary
MathNet
A global multimodal and multilingual dataset of Olympiad-level math problems for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.
The core dataset and benchmark in this paper.
Multimodal
Involves processing and analyzing multiple data forms (e.g., text, images).
MathNet supports multimodal problem-solving.
Retrieval-Augmented Generation
Enhancing a generative model's reasoning ability by retrieving related problems.
Used in MathNet to improve mathematical reasoning performance.
Embedding Model
A model that converts data (e.g., text) into vector representations for similarity computation and retrieval.
Used for MathNet's math-aware retrieval task.
Equivalent Problem
Problems that are structurally identical or similar in mathematics.
Included in MathNet's retrieval benchmark.
Recall@k
The proportion of correct answers within the top k results in a retrieval task.
Used to evaluate MathNet-Retrieve's performance.
Generative Model
A model capable of generating text or other data forms.
Used for MathNet's mathematical reasoning task.
Olympiad
Short for the International Mathematical Olympiad, representing high-difficulty math problems.
The source of MathNet's dataset.
Structural Similarity
In mathematical problems, refers to similarity in structure rather than surface form.
A key concept in MathNet-Retrieve task.
Multilingual
A dataset or system that supports multiple languages.
MathNet covers 17 languages.
Open Questions Unanswered questions from this research
- 1 How to improve embedding models' ability to recognize mathematical structures remains an open question. Current methods perform poorly when dealing with complex mathematical structures, requiring new techniques to enhance models' structural recognition capabilities.
- 2 The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance. Improving retrieval accuracy remains a challenge.
- 3 Existing multimodal models have limited performance in handling symbolic tasks. How to better integrate multimodal information to enhance mathematical reasoning capabilities remains to be explored.
- 4 While MathNet provides a new evaluation platform for mathematical reasoning, how to expand the dataset to cover more mathematical domains and languages remains an open question.
- 5 In mathematical problem retrieval, how to better identify and retrieve equivalent problems remains a challenge, especially when dealing with complex mathematical structures.
Applications
Immediate Applications
Mathematical Reasoning Model Evaluation
MathNet can be used to evaluate existing mathematical reasoning models, helping researchers identify model strengths and weaknesses and guide model improvements.
Educational Tool Development
Using MathNet's dataset and benchmark, new educational tools can be developed to help students improve their mathematical reasoning skills.
Mathematics Competition Preparation
MathNet's dataset can be used for mathematics competition preparation, helping students practice and improve their ability to solve complex mathematical problems.
Long-term Vision
Cross-Language Mathematics Education
MathNet's multilingual support can promote cross-language mathematics education, helping students from different language backgrounds learn mathematics better.
Intelligent Mathematics Assistant
By combining MathNet's dataset and benchmark, intelligent mathematics assistants can be developed to help users solve complex mathematical problems.
Abstract
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
References (20)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
Thibault Formal, C. Lassance, Benjamin Piwowarski et al.
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia et al.
CMMLU: Measuring massive multitask language understanding in Chinese
Haonan Li, Yixuan Zhang, Fajri Koto et al.
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models
Daman Arora, H. Singh, Mausam
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu et al.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.
GPT-4 Technical Report
OpenAI Josh Achiam, Steven Adler, S. Agarwal et al.
Unsupervised Dense Information Retrieval with Contrastive Learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang et al.
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath et al.
Datasets
S. Stowell
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart et al.
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
O. Khattab, M. Zaharia
Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas OΔuz, Sewon Min et al.
Paper
N. Cambridge
NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions
Jia Li, E. Beeching, Lewis Tunstall et al.
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Yumeng Li, Guang Yang, Hao Liu et al.
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He, Renjie Luo, Yuzhuo Bai et al.
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Ke Wang, Junting Pan, Weikang Shi et al.
Cited By (1)
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors