COMPOSE: Composing Future Theorems from Citations and Formal Structure
Proposes COMPOSE, a dual-graph framework combining citation and formal theorem graphs, generating plausible future theorems with 108K training pairs and 47K future papers tested.
Key Findings
Methodology
This paper introduces a dual-graph architecture, COMPOSE, which encodes scientific citation graphs and formal theorem dependency graphs separately using dedicated GNN encoders. These representations are fused via cross-attention mechanisms to produce a unified knowledge embedding, conditioned on which a pre-trained math-specialized language model generates future theorem-like claims. The dataset comprises 108,000 paired scientific-formal graph samples from arXiv and Mathlib, with a two-stage training process: first optimizing graph encoders with alignment objectives, then fine-tuning the generation model with graph-conditioned loss functions. Experiments demonstrate superior performance over baselines in future paper retrieval and theorem generation, validated through automatic metrics and LLM judge evaluations.
Key Results
- On the 47K future paper test set, COMPOSE achieves a Tgt-Sim score of 0.525, outperforming models using only citation or formal graphs (max 0.471), with a Gap of 0.240, indicating its generated claims are more aligned with actual future research. It ranks the correct future papers in the top 10 in 50.8% of cases, significantly better than baselines.
- In generation quality assessments, COMPOSE scores an average of 3.36/5 in LLM judge evaluations, excelling particularly in mathematical content, depth, and precision. Ablation studies confirm the importance of dual-graph fusion, with performance drops when removing either graph source or fusion step.
- Across different decoders (DeepSeek-Math 7B and Mistral 7B), COMPOSE consistently outperforms competitors, demonstrating robustness and generalization. The model not only predicts specific future theorems but also offers valuable insights into broader research directions.
Significance
This work addresses a longstanding challenge in mathematical AI: integrating scientific literature's evolutionary paths with formal logical structures to generate meaningful future results. By bridging informal research narratives and formal theorem dependencies, COMPOSE advances the field toward automated mathematical discovery, potentially transforming how researchers identify promising directions and validate hypotheses. Its ability to generate grounded, mathematically rich claims paves the way for intelligent scientific assistants, accelerating innovation across disciplines.
Technical Contribution
The paper introduces a novel dual-graph encoding framework, leveraging GNNs for separate but interconnected scientific and formal structures. The fusion via cross-attention enables effective knowledge integration, while the two-stage training optimizes both graph representations and generation quality. The dataset construction employs dense retrieval and alignment strategies, combining informal and formal sources. The architecture's modular design allows flexible adaptation to different decoders, demonstrating scalability and robustness. These innovations collectively push the frontier of grounded scientific language modeling.
Novelty
This is the first work to systematically combine scientific citation graphs with formal theorem dependency graphs for the purpose of future theorem generation. Unlike prior models that focus solely on textual or formal structures, COMPOSE fuses these sources to produce more grounded and mathematically rigorous claims. Its dual-graph architecture, alignment strategies, and training regimen represent a significant step forward in integrating informal scientific narratives with formal logical dependencies, opening new avenues for automated mathematical reasoning.
Limitations
- The model's performance heavily depends on the quality of the constructed graphs and alignment accuracy; errors in these steps can lead to less coherent or incorrect generated claims, especially in less-studied or emerging fields.
- Currently tailored for mathematics, its applicability to other scientific domains remains to be validated, as different fields may require different graph structures and alignment strategies.
- Computational costs are high due to large-scale graph encoding and fine-tuning, which may limit real-time deployment or scaling to larger datasets without further optimization.
Future Work
Future research could explore multi-modal knowledge integration, such as incorporating figures, code snippets, or experimental data, to enrich the reasoning process. Developing more efficient graph construction and alignment algorithms will be crucial for scalability. Extending this framework to other scientific disciplines, like physics or biology, could broaden its impact. Additionally, integrating reinforcement learning to optimize the creativity and correctness of generated claims, and deploying in real-world scientific workflows, are promising directions.
AI Executive Summary
The rapid growth of scientific literature, especially in mathematics, presents both opportunities and challenges for automated knowledge discovery. Traditional models have struggled to synthesize the vast amount of informal research narratives with the formal logical structures underlying mathematical theorems. This disconnect hampers the ability of AI systems to generate meaningful future results, limiting their usefulness in advancing scientific frontiers.
Addressing this challenge, the authors propose COMPOSE, a novel dual-graph framework that integrates scientific citation networks with formal theorem dependency graphs. This architecture leverages the complementary strengths of both sources: the citation graph captures the evolution and direction of research, while the formal graph encodes the logical dependencies among theorems. By encoding these graphs separately with dedicated GNNs and then fusing their representations through cross-attention, COMPOSE creates a rich, grounded knowledge base that conditions a language model to generate plausible future theorems.
The construction of a large-scale dataset, comprising 108,000 paired scientific-formal graph samples from arXiv and Mathlib, underpins the training process. The dataset employs dense retrieval and alignment strategies to link informal research narratives with formal theorem structures, enabling effective supervision. The training proceeds in two stages: first, optimizing graph encoders with link prediction and alignment objectives; second, fine-tuning a math-specialized language model with graph-conditioned generation loss.
Experimental results demonstrate that COMPOSE outperforms existing baselines in multiple metrics. It achieves a Tgt-Sim of 0.525, surpassing models that rely on single sources, and ranks the correct future papers in the top 10 in over half of the cases. Qualitative assessments via LLM judges confirm that the generated claims are more mathematically rich, precise, and logically consistent. Ablation studies further validate the importance of dual-graph fusion, with performance degrading when either graph source or fusion mechanism is removed.
This work marks a significant step toward automated mathematical discovery, offering a grounded, scalable approach to predicting and generating future research directions. Its implications extend beyond mathematics, hinting at broader applications in scientific knowledge synthesis and AI-assisted research. Nonetheless, challenges remain in graph construction quality, computational efficiency, and generalization to other domains. Future efforts will focus on multi-modal data integration, cross-disciplinary adaptation, and real-world deployment, aiming to realize fully autonomous scientific reasoning systems that can accelerate innovation across fields.
Deep Dive
Abstract
A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.
References (20)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
Kaiyu Yang, Aidan M. Swope, Alex Gu et al.
GoAI: Enhancing AI Students'Learning Paths and Idea Generation via Graph of AI Ideas
Xian Gao, Zongyun Zhang, Ting Liu et al.
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models
Jinheon Baek, S. Jauhar, Silviu Cucerzan et al.
ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization
Rafael Cabral, T. Do, Xuejun Yu et al.
Enhancing Scientific Papers Summarization with Citation Graph
Chen An, Ming Zhong, Yiran Chen et al.
LoRA: Low-Rank Adaptation of Large Language Models
J. Hu, Yelong Shen, Phillip Wallis et al.
Representation Learning with Contrastive Predictive Coding
AΓ€ron van den Oord, Yazhe Li, O. Vinyals
Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models
Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari et al.
Advancing mathematics by guiding human intuition with AI
A. Davies, Petar Velickovic, L. Buesing et al.
Autoformalization with Large Language Models
Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li et al.
A Semantic Search Engine for Mathlib4
Guoxiong Gao, Haocheng Ju, Jiedong Jiang et al.
Neural Message Passing for Quantum Chemistry
J. Gilmer, S. Schoenholz, Patrick F. Riley et al.
The coq proof assistant reference manual
G. Huet, Christine Paulin-Mohring
GIANTS: Generative Insight Anticipation from Scientific Literature
Joy He-Yueya, Anikait Singh, Ge Gao et al.
Autoformalization in the Wild: Assessing LLMs on Real-World Mathematical Definitions
Lan Zhang, Marco Valentino, Andr'e Freitas
ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
Prithwish Jana, Kaan Kale, Ahmet Ege Tanriverdi et al.
DeepMath - Deep Sequence Models for Premise Selection
G. Irving, Christian Szegedy, Alexander A. Alemi et al.
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
Kefan Dong, Tengyu Ma
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang et al.