COMPOSE: Composing Future Theorems from Citations and Formal Structure

TL;DR

Proposes COMPOSE, a dual-graph framework combining citation and formal theorem graphs, generating plausible future theorems with 108K training pairs and 47K future papers tested.

cs.CL 🔴 Advanced 2026-05-29 93 views

David Busbib Michael Werman

AI Reader Arxiv Page Download PDF

Mathematical Reasoning Graph Neural Networks Scientific Text Generation Knowledge Graphs Formal Dependencies

Key Findings

Methodology

This paper introduces a dual-graph architecture, COMPOSE, which encodes scientific citation graphs and formal theorem dependency graphs separately using dedicated GNN encoders. These representations are fused via cross-attention mechanisms to produce a unified knowledge embedding, conditioned on which a pre-trained math-specialized language model generates future theorem-like claims. The dataset comprises 108,000 paired scientific-formal graph samples from arXiv and Mathlib, with a two-stage training process: first optimizing graph encoders with alignment objectives, then fine-tuning the generation model with graph-conditioned loss functions. Experiments demonstrate superior performance over baselines in future paper retrieval and theorem generation, validated through automatic metrics and LLM judge evaluations.

Key Results

On the 47K future paper test set, COMPOSE achieves a Tgt-Sim score of 0.525, outperforming models using only citation or formal graphs (max 0.471), with a Gap of 0.240, indicating its generated claims are more aligned with actual future research. It ranks the correct future papers in the top 10 in 50.8% of cases, significantly better than baselines.
In generation quality assessments, COMPOSE scores an average of 3.36/5 in LLM judge evaluations, excelling particularly in mathematical content, depth, and precision. Ablation studies confirm the importance of dual-graph fusion, with performance drops when removing either graph source or fusion step.
Across different decoders (DeepSeek-Math 7B and Mistral 7B), COMPOSE consistently outperforms competitors, demonstrating robustness and generalization. The model not only predicts specific future theorems but also offers valuable insights into broader research directions.

Significance

This work addresses a longstanding challenge in mathematical AI: integrating scientific literature's evolutionary paths with formal logical structures to generate meaningful future results. By bridging informal research narratives and formal theorem dependencies, COMPOSE advances the field toward automated mathematical discovery, potentially transforming how researchers identify promising directions and validate hypotheses. Its ability to generate grounded, mathematically rich claims paves the way for intelligent scientific assistants, accelerating innovation across disciplines.

Technical Contribution

The paper introduces a novel dual-graph encoding framework, leveraging GNNs for separate but interconnected scientific and formal structures. The fusion via cross-attention enables effective knowledge integration, while the two-stage training optimizes both graph representations and generation quality. The dataset construction employs dense retrieval and alignment strategies, combining informal and formal sources. The architecture's modular design allows flexible adaptation to different decoders, demonstrating scalability and robustness. These innovations collectively push the frontier of grounded scientific language modeling.

Novelty

This is the first work to systematically combine scientific citation graphs with formal theorem dependency graphs for the purpose of future theorem generation. Unlike prior models that focus solely on textual or formal structures, COMPOSE fuses these sources to produce more grounded and mathematically rigorous claims. Its dual-graph architecture, alignment strategies, and training regimen represent a significant step forward in integrating informal scientific narratives with formal logical dependencies, opening new avenues for automated mathematical reasoning.

Limitations

The model's performance heavily depends on the quality of the constructed graphs and alignment accuracy; errors in these steps can lead to less coherent or incorrect generated claims, especially in less-studied or emerging fields.
Currently tailored for mathematics, its applicability to other scientific domains remains to be validated, as different fields may require different graph structures and alignment strategies.
Computational costs are high due to large-scale graph encoding and fine-tuning, which may limit real-time deployment or scaling to larger datasets without further optimization.

Future Work

Future research could explore multi-modal knowledge integration, such as incorporating figures, code snippets, or experimental data, to enrich the reasoning process. Developing more efficient graph construction and alignment algorithms will be crucial for scalability. Extending this framework to other scientific disciplines, like physics or biology, could broaden its impact. Additionally, integrating reinforcement learning to optimize the creativity and correctness of generated claims, and deploying in real-world scientific workflows, are promising directions.

AI Executive Summary

The rapid growth of scientific literature, especially in mathematics, presents both opportunities and challenges for automated knowledge discovery. Traditional models have struggled to synthesize the vast amount of informal research narratives with the formal logical structures underlying mathematical theorems. This disconnect hampers the ability of AI systems to generate meaningful future results, limiting their usefulness in advancing scientific frontiers.

Addressing this challenge, the authors propose COMPOSE, a novel dual-graph framework that integrates scientific citation networks with formal theorem dependency graphs. This architecture leverages the complementary strengths of both sources: the citation graph captures the evolution and direction of research, while the formal graph encodes the logical dependencies among theorems. By encoding these graphs separately with dedicated GNNs and then fusing their representations through cross-attention, COMPOSE creates a rich, grounded knowledge base that conditions a language model to generate plausible future theorems.

The construction of a large-scale dataset, comprising 108,000 paired scientific-formal graph samples from arXiv and Mathlib, underpins the training process. The dataset employs dense retrieval and alignment strategies to link informal research narratives with formal theorem structures, enabling effective supervision. The training proceeds in two stages: first, optimizing graph encoders with link prediction and alignment objectives; second, fine-tuning a math-specialized language model with graph-conditioned generation loss.

Experimental results demonstrate that COMPOSE outperforms existing baselines in multiple metrics. It achieves a Tgt-Sim of 0.525, surpassing models that rely on single sources, and ranks the correct future papers in the top 10 in over half of the cases. Qualitative assessments via LLM judges confirm that the generated claims are more mathematically rich, precise, and logically consistent. Ablation studies further validate the importance of dual-graph fusion, with performance degrading when either graph source or fusion mechanism is removed.

This work marks a significant step toward automated mathematical discovery, offering a grounded, scalable approach to predicting and generating future research directions. Its implications extend beyond mathematics, hinting at broader applications in scientific knowledge synthesis and AI-assisted research. Nonetheless, challenges remain in graph construction quality, computational efficiency, and generalization to other domains. Future efforts will focus on multi-modal data integration, cross-disciplinary adaptation, and real-world deployment, aiming to realize fully autonomous scientific reasoning systems that can accelerate innovation across fields.

Deep Dive

Abstract

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.

cs.CL

References (20)

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Kaiyu Yang, Aidan M. Swope, Alex Gu et al.

2023 448 citations ⭐ Influential View Analysis →

GoAI: Enhancing AI Students'Learning Paths and Idea Generation via Graph of AI Ideas

Xian Gao, Zongyun Zhang, Ting Liu et al.

2025 3 citations ⭐ Influential View Analysis →

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models

Jinheon Baek, S. Jauhar, Silviu Cucerzan et al.

2024 185 citations ⭐ Influential View Analysis →

ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization

Rafael Cabral, T. Do, Xuejun Yu et al.

2025 9 citations ⭐ Influential View Analysis →

The lean mathematical library

The mathlib Community

2019 343 citations View Analysis →

Enhancing Scientific Papers Summarization with Citation Graph

Chen An, Ming Zhong, Yiran Chen et al.

2021 49 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 19456 citations View Analysis →

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, O. Vinyals

2018 13387 citations View Analysis →

Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models

Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari et al.

2024 27 citations View Analysis →

Advancing mathematics by guiding human intuition with AI

A. Davies, Petar Velickovic, L. Buesing et al.

2021 554 citations

Autoformalization with Large Language Models

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li et al.

2022 283 citations View Analysis →

A Semantic Search Engine for Mathlib4

Guoxiong Gao, Haocheng Ju, Jiedong Jiang et al.

2024 26 citations View Analysis →

Neural Message Passing for Quantum Chemistry

J. Gilmer, S. Schoenholz, Patrick F. Riley et al.

2017 8995 citations View Analysis →

The coq proof assistant reference manual

G. Huet, Christine Paulin-Mohring

2000 1211 citations

GIANTS: Generative Insight Anticipation from Scientific Literature

Joy He-Yueya, Anikait Singh, Ge Gao et al.

2026 3 citations View Analysis →

Autoformalization in the Wild: Assessing LLMs on Real-World Mathematical Definitions

Lan Zhang, Marco Valentino, Andr'e Freitas

2025 13 citations View Analysis →

ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings

Prithwish Jana, Kaan Kale, Ahmet Ege Tanriverdi et al.

2025 7 citations View Analysis →

DeepMath - Deep Sequence Models for Premise Selection

G. Irving, Christian Szegedy, Alexander A. Alemi et al.

2016 258 citations View Analysis →

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Kefan Dong, Tengyu Ma

2025 65 citations View Analysis →

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang et al.

2022 1318 citations View Analysis →

COMPOSE: Composing Future Theorems from Citations and Formal Structure

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs