MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

TL;DR

MathNet provides a global multimodal benchmark for mathematical reasoning and retrieval, covering 30,676 Olympiad-level problems from 47 countries.

cs.AI πŸ”΄ Advanced 2026-04-21 1 citations 33 views
Shaden Alshammari Kevin Wen Abrar Zainal Mark Hamilton Navid Safaei Sultan Albarakati William T. Freeman Antonio Torralba
mathematical reasoning multimodal retrieval Olympiad dataset

Key Findings

Methodology

MathNet's methodology includes three core tasks: problem solving, math-aware retrieval, and retrieval-augmented problem solving. The dataset comprises Olympiad-level problems from 47 countries, covering 17 languages and various mathematical domains. The retrieval benchmark consists of mathematically equivalent and structurally similar problem pairs curated by human experts. Experiments evaluated several state-of-the-art reasoning and embedding models, revealing the challenges current models face in mathematical reasoning and retrieval tasks.

Key Results

  • Gemini-3.1-Pro achieved 78.4% accuracy in the problem-solving task, while GPT-5 scored 69.3%, indicating that even state-of-the-art models struggle with Olympiad-level problems.
  • In the math-aware retrieval task, embedding models performed poorly, struggling to retrieve equivalent problems.
  • DeepSeek-V3.2-Speciale achieved up to a 12% performance gain in the retrieval-augmented generation task, obtaining the highest scores on the benchmark.

Significance

The significance of MathNet lies in its ability to fill the gaps in existing benchmarks regarding scale, language coverage, and task diversity. By providing a large-scale, multimodal, and multilingual dataset of Olympiad-level problems, MathNet offers a new evaluation platform for mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. This will help advance research in mathematical reasoning, particularly in how models handle complex mathematical problems and retrieve equivalent problems.

Technical Contribution

MathNet's technical contributions include being the first benchmark focused on mathematical problem retrieval, providing a large-scale, high-quality Olympiad dataset. It supports not only problem generation and retrieval but also demonstrates the significant impact of retrieval quality on generation performance through the retrieval-augmented generation task. The public release of MathNet's dataset and benchmark will serve as a valuable resource for academia and industry.

Novelty

MathNet's novelty lies in its global multimodal and multilingual coverage and the introduction of the first mathematical problem retrieval benchmark. Unlike existing Olympiad datasets, MathNet is larger in scale and richer in language and task diversity.

Limitations

  • Current models perform poorly in retrieving equivalent problems, especially when dealing with complex mathematical structures.
  • The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance.

Future Work

Future research directions include improving embedding models' ability to recognize mathematical structures and exploring better integration of retrieval and generation models to enhance mathematical reasoning capabilities. Further work could also focus on expanding MathNet's dataset and benchmark to cover more mathematical domains and languages.

AI Executive Summary

Mathematical problem solving has long been a crucial test of reasoning for large language models and multimodal models. However, existing benchmarks are limited in size, language coverage, and task diversity. To address this, we introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems, along with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.

MathNet's dataset spans 47 countries, 17 languages, and various mathematical domains, comprising 30,676 expert-authored problems with solutions. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.

Experimental results show that even state-of-the-art reasoning models, such as Gemini-3.1-Pro and GPT-5, remain challenged by Olympiad-level problems, while embedding models struggle to retrieve equivalent problems. We further demonstrate that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark.

MathNet not only provides the largest high-quality Olympiad dataset but also introduces the first benchmark for evaluating mathematical problem retrieval. We publicly release both the dataset and benchmark to promote research in mathematical reasoning and retrieval in academia and industry.

Despite MathNet's significance in advancing mathematical reasoning research, current models still face challenges in handling complex mathematical structures and retrieving equivalent problems. Future research could focus on improving embedding models' ability to recognize mathematical structures and better integrating retrieval and generation models to enhance mathematical reasoning capabilities.

Deep Analysis

Background

Mathematical reasoning has long been a core benchmark for evaluating AI reasoning capabilities. Early efforts focused on text-based arithmetic problems, while recent research has expanded to competition-level reasoning, theorem proving, and multimodal problem-solving. Existing datasets can be broadly categorized into text-only benchmarks, multimodal benchmarks, and aggregates. Despite these datasets pushing mathematical reasoning research to some extent, they remain limited in scale, language diversity, and structured similarity annotations. MathNet fills this gap by providing a large-scale, multimodal, and multilingual dataset of Olympiad-level problems.

Core Problem

Mathematical problem solving is a core benchmark for evaluating AI reasoning capabilities. However, existing Olympiad-level datasets are typically drawn from community platforms such as AoPS and cover only a handful of competitions in the U.S and China. This constrains research progress due to the lack of open, high-quality, and diverse benchmarks. MathNet addresses this gap by presenting mathematics problems sourced from 47 countries across four decades, providing an unprecedented foundation for exploring mathematical generalization and analogical reasoning.

Innovation

MathNet's core innovations include its global multimodal and multilingual coverage and the introduction of the first mathematical problem retrieval benchmark. Unlike existing Olympiad datasets, MathNet is larger in scale and richer in language and task diversity. It supports three tasks: problem solving, math-aware retrieval, and retrieval-augmented problem solving. By providing a large-scale, high-quality Olympiad dataset, MathNet offers a new evaluation platform for mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.

Methodology

  • οΏ½οΏ½ MathNet-Solve: A collection of 30,676 Olympiad-level math problems with aligned LaTeX and natural-language statements, expert solutions, and metadata spanning 47 countries, 17 languages, and 65+ mathematical domains.
  • οΏ½οΏ½ MathNet-Retrieve: A dataset for retrieval consisting of 40,000 additional synthetic problems derived from 10,000 anchor problems, each paired with 1 equivalent positive and 3 hard negatives.
  • οΏ½οΏ½ MathNet-RAG: An evaluation dataset of 35 anchor problems and 35 expert-paired real problems, all drawn entirely from MathNet-Solve.

Experiments

The experimental design includes evaluating 27 models on MathNet-Solve, MathNet-Retrieve, and MathNet-RAG. On MathNet-Solve, we evaluate two types of models: text-only and multimodal models. On MathNet-Retrieve, we assess retrieval performance using embeddings derived from a diverse set of state-of-the-art models. On MathNet-RAG, we limit evaluations to seven state-of-the-art open-source and proprietary models, as this benchmark requires human grading.

Results

On MathNet-Solve, the strongest model is Gemini-3.1-Pro, achieving 76.3% overall accuracy. MathNet-Retrieve remains highly challenging at the top-1 level, with even the strongest models achieving only ∼5% Recall@1. On MathNet-RAG, Expert-RAG is the strongest setting overall, with DeepSeek-V3.2-Speciale reaching the best result at 97.3% under human grading.

Applications

MathNet's dataset and benchmark provide a valuable resource for academia and industry, particularly in mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. It can be used to evaluate and improve existing mathematical reasoning models and provide a foundation for developing new mathematical reasoning and retrieval methods.

Limitations & Outlook

Despite MathNet's significance in advancing mathematical reasoning research, current models still face challenges in handling complex mathematical structures and retrieving equivalent problems. The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance. Future research could focus on improving embedding models' ability to recognize mathematical structures and better integrating retrieval and generation models to enhance mathematical reasoning capabilities.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a big meal. You have a variety of ingredients (like MathNet's dataset), each with different flavors and uses (like different math problems). You need to choose and combine these ingredients according to recipes (like the model's algorithms) to create delicious dishes (like solving math problems).

However, sometimes you might face challenges, such as not finding a specific ingredient (like models struggling to retrieve equivalent problems) or being unsure of the best use of an ingredient (like difficulties in mathematical reasoning).

To overcome these challenges, you can try different combinations and cooking methods (like experimenting and adjusting parameters in models) or refer to other chefs' experiences (like using retrieved related problems in retrieval-augmented generation tasks).

Ultimately, through continuous trial and improvement, you can create a delicious meal (like achieving success in mathematical reasoning tasks). MathNet is like a rich pantry, offering you endless possibilities.

ELI14 Explained like you're 14

Hey there! Did you know that math isn't just those formulas and problems you see in class? It's actually like a super fun puzzle game! Imagine you have a giant puzzle, and each piece represents a math problem. MathNet is like a huge puzzle library with pieces from all over the world.

Now, imagine you're a puzzle master, and you need to use these pieces to complete a super complex puzzle. Every time you find the right piece, it's like solving a math problem. But sometimes, you might find pieces that look similar but don't fit, just like how models struggle to find equivalent problems.

To help you finish the puzzle faster, you can use some tricks, like finding the edge pieces first (like using retrieval-augmented generation tasks in models), which helps you build the puzzle's framework quicker.

So, MathNet is like a super cool puzzle library that helps you explore and discover more fun in the world of math!

Glossary

MathNet

A global multimodal and multilingual dataset of Olympiad-level math problems for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.

The core dataset and benchmark in this paper.

Multimodal

Involves processing and analyzing multiple data forms (e.g., text, images).

MathNet supports multimodal problem-solving.

Retrieval-Augmented Generation

Enhancing a generative model's reasoning ability by retrieving related problems.

Used in MathNet to improve mathematical reasoning performance.

Embedding Model

A model that converts data (e.g., text) into vector representations for similarity computation and retrieval.

Used for MathNet's math-aware retrieval task.

Equivalent Problem

Problems that are structurally identical or similar in mathematics.

Included in MathNet's retrieval benchmark.

Recall@k

The proportion of correct answers within the top k results in a retrieval task.

Used to evaluate MathNet-Retrieve's performance.

Generative Model

A model capable of generating text or other data forms.

Used for MathNet's mathematical reasoning task.

Olympiad

Short for the International Mathematical Olympiad, representing high-difficulty math problems.

The source of MathNet's dataset.

Structural Similarity

In mathematical problems, refers to similarity in structure rather than surface form.

A key concept in MathNet-Retrieve task.

Multilingual

A dataset or system that supports multiple languages.

MathNet covers 17 languages.

Open Questions Unanswered questions from this research

  • 1 How to improve embedding models' ability to recognize mathematical structures remains an open question. Current methods perform poorly when dealing with complex mathematical structures, requiring new techniques to enhance models' structural recognition capabilities.
  • 2 The performance of retrieval-augmented generation tasks is highly dependent on retrieval quality, meaning retrieval errors can lead to decreased generation performance. Improving retrieval accuracy remains a challenge.
  • 3 Existing multimodal models have limited performance in handling symbolic tasks. How to better integrate multimodal information to enhance mathematical reasoning capabilities remains to be explored.
  • 4 While MathNet provides a new evaluation platform for mathematical reasoning, how to expand the dataset to cover more mathematical domains and languages remains an open question.
  • 5 In mathematical problem retrieval, how to better identify and retrieve equivalent problems remains a challenge, especially when dealing with complex mathematical structures.

Applications

Immediate Applications

Mathematical Reasoning Model Evaluation

MathNet can be used to evaluate existing mathematical reasoning models, helping researchers identify model strengths and weaknesses and guide model improvements.

Educational Tool Development

Using MathNet's dataset and benchmark, new educational tools can be developed to help students improve their mathematical reasoning skills.

Mathematics Competition Preparation

MathNet's dataset can be used for mathematics competition preparation, helping students practice and improve their ability to solve complex mathematical problems.

Long-term Vision

Cross-Language Mathematics Education

MathNet's multilingual support can promote cross-language mathematics education, helping students from different language backgrounds learn mathematics better.

Intelligent Mathematics Assistant

By combining MathNet's dataset and benchmark, intelligent mathematics assistants can be developed to help users solve complex mathematical problems.

Abstract

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

cs.AI cs.DL cs.IR cs.LG

References (20)

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

Thibault Formal, C. Lassance, Benjamin Piwowarski et al.

2021 235 citations View Analysis β†’

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

2023 1434 citations View Analysis β†’

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto et al.

2023 462 citations View Analysis β†’

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Daman Arora, H. Singh, Mausam

2023 91 citations View Analysis β†’

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu et al.

2023 810 citations View Analysis β†’

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.

2023 802 citations View Analysis β†’

GPT-4 Technical Report

OpenAI Josh Achiam, Steven Adler, S. Agarwal et al.

2023 23689 citations View Analysis β†’

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini et al.

2021 1432 citations View Analysis β†’

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 8183 citations View Analysis β†’

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

2023 1984 citations View Analysis β†’

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

2021 4915 citations View Analysis β†’

Datasets

S. Stowell

2021 941 citations

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 7708 citations View Analysis β†’

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

O. Khattab, M. Zaharia

2020 2008 citations View Analysis β†’

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.

2020 5496 citations View Analysis β†’

Paper

N. Cambridge

1977 5017 citations

NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions

Jia Li, E. Beeching, Lewis Tunstall et al.

234 citations

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li, Guang Yang, Hao Liu et al.

2025 26 citations View Analysis β†’

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai et al.

2024 941 citations View Analysis β†’

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi et al.

2024 545 citations View Analysis β†’

Cited By (1)

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors