MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry
MolE-RAG integrates literature, molecular features, and structural similarity to enhance LLM-based molecular property prediction, boosting ROC-AUC by up to 28% and reducing RMSE by 67%.
Key Findings
Methodology
The proposed MolE-RAG framework is a training-free, molecule-centric retrieval-augmented generation approach that combines three inference-time contexts: textual retrieval using BM25 from a comprehensive chemistry literature corpus, injection of molecule-specific information including synonyms, identifiers, functional groups, and physicochemical descriptors, and structure-based retrieval utilizing task-adaptive molecular fingerprints for similarity search. The process begins with generating hybrid queries that incorporate task descriptions, domain-specific keywords, and molecule identifiers to retrieve relevant literature snippets. Concurrently, molecular features are extracted from SMILES strings and injected into prompts to provide explicit chemical information. Additionally, a set of structurally similar molecules is retrieved from training data based on Tanimoto similarity of selected fingerprints, serving as contextual examples. These sources are combined flexibly depending on the task and model, forming an augmented prompt for the LLM to predict molecular properties. Extensive evaluations across nine datasets demonstrate that this multi-source context significantly improves prediction accuracy, with ROC-AUC gains up to 28 points and RMSE reductions up to 67%, across various models including GPT-4, Qwen, and ChemDFM.
Key Results
- On six classification datasets, MolE-RAG improved ROC-AUC scores by 15-28 percentage points, with the largest gains seen in models with initially low performance, such as GPT-4o-mini, which increased from 54.9 to 74.7. In regression tasks, RMSE decreased by over 50%, with FreeSolv showing a maximum reduction of 67% (from 12.585 to 4.128). The effectiveness of different context sources varied across models; textual retrieval was most beneficial for some, while structural similarity was more impactful for others. Smaller open-source models like Qwen3 and Mistral, which initially underperformed, achieved performance comparable to or surpassing some proprietary models after applying MolE-RAG, highlighting its role in compensating for limited model capacity.
- The experimental results confirm that integrating external knowledge sources through retrieval significantly enhances the chemical reasoning capabilities of LLMs. The performance improvements were consistent across diverse datasets, demonstrating the robustness of the approach. Notably, the choice of molecular fingerprint (e.g., ECFP4, MACCS) and retrieval strategy was task-dependent, with validation-based selection further optimizing results. The framework's flexibility allows for tailored configurations, making it adaptable to various chemical prediction tasks and model architectures.
- Overall, MolE-RAG demonstrates that multi-source, inference-time knowledge integration can bridge the gap between SMILES representations and chemical understanding, enabling more accurate and reliable molecular property predictions without additional training. This approach paves the way for scalable, knowledge-rich AI systems in drug discovery, materials science, and beyond, where external scientific knowledge can be dynamically incorporated to inform decision-making.
Significance
This work addresses a fundamental challenge in applying large language models to chemistry: the semantic gap between molecular structure representations and natural language understanding. By introducing a flexible, multi-source retrieval framework, the authors enable LLMs to access external scientific knowledge, chemical descriptors, and structural analogs at inference time. This significantly enhances the models’ reasoning and prediction capabilities, especially in data-scarce scenarios or for complex molecules. The approach reduces reliance on extensive fine-tuning or large annotated datasets, democratizing advanced molecular prediction tools. Its modular design allows easy integration with existing models and datasets, fostering broader adoption in pharmaceutical research, toxicology, and materials engineering. The demonstrated performance gains across multiple datasets and models highlight its potential to accelerate discovery pipelines and improve predictive accuracy in real-world applications.
Technical Contribution
The primary technical innovation lies in the design of a multi-source, inference-time knowledge integration framework that combines textual retrieval (via BM25), chemical feature injection, and structure-based similarity search. Unlike prior work that often relies on fine-tuning or single-source knowledge bases, MolE-RAG operates without additional training, making it highly scalable and adaptable. The framework employs task-adaptive strategies for selecting optimal fingerprints and descriptors, ensuring relevance across diverse tasks. The use of large pre-trained LLMs as the backbone, coupled with dynamic retrieval modules, creates a hybrid system capable of sophisticated chemical reasoning. This approach opens new avenues for knowledge-augmented AI, enabling models to leverage external scientific literature, detailed molecular features, and analogs seamlessly during inference, thus extending the capabilities of existing models beyond their training data.
Novelty
This study is the first to systematically integrate three complementary inference-time knowledge sources—literature retrieval, molecular feature injection, and structural similarity—into a unified, training-free framework for molecular property prediction. While prior works have explored retrieval-augmented methods or molecular similarity search independently, the novelty here is in their combined, flexible application tailored to diverse chemical tasks. The task-adaptive selection of molecular fingerprints and descriptors further distinguishes this work, enabling optimal performance across datasets. This multi-modal, multi-source fusion at inference time represents a significant step forward in making large language models more chemically intelligent without the need for extensive retraining or fine-tuning.
Limitations
- The reliance on predefined molecular fingerprints and similarity metrics may limit the generalization to highly novel or complex molecules outside the chemical space covered by training data. Future work could explore learned representations for more robust similarity assessment.
- Textual retrieval quality depends heavily on the construction of hybrid queries; lexical mismatches or incomplete literature coverage can reduce effectiveness. Incorporating semantic retrieval methods could mitigate this.
- While the framework improves performance without fine-tuning, it introduces additional computational overhead due to retrieval steps, which may impact real-time applications. Optimization of retrieval pipelines is needed for industrial deployment.
Future Work
未来的研究将集中在:• 引入深度学习驱动的结构表示学习,提升结构相似性检索的准确性和泛化能力;• 结合实时科研文献和多模态数据(如图像、光谱信息),实现知识的动态更新和多源融合;• 设计更高效的检索策略和提示生成机制,降低计算成本,提升响应速度;• 将该框架应用于药物发现、材料设计等实际工业场景,验证其商业化潜力和实用性。
AI Executive Summary
在药物研发和材料科学的数字化转型中,分子性质预测一直是核心难题。传统方法依赖大量实验和专家知识,成本高昂且周期长。近年来,大型语言模型(LLMs)凭借其强大的自然语言理解能力,开始在化学领域展现潜力,但其对分子结构的理解仍有限,尤其是在处理SMILES等结构表示时,表现出明显的局限性。
为解决这一瓶颈,本文提出了MolE-RAG(Molecule-Centric Retrieval-Augmented Generation)框架,一种无需微调的多源检索增强生成方法,旨在提升LLMs在分子性质预测中的表现。该方法结合了三类推理上下文:首先,通过BM25算法从丰富的化学文献中检索相关文本段落,为模型提供科学背景;其次,从分子SMILES中提取结构信息、官能团和物理化学描述符,注入到提示中,增强分子表达能力;最后,利用任务适应的分子指纹进行结构相似性检索,找到训练集中结构相似的分子作为示例。
这种多源信息融合的策略极大地丰富了模型的推理基础,显著改善了分子性质的预测性能。在九个公开和专有数据集上的实验结果显示,采用MolE-RAG的模型在分类任务中的ROC-AUC提升最高达28个百分点,回归任务的RMSE降低了67%。尤其在数据有限或模型能力受限的情况下,检索增强的效果尤为明显,甚至使小型开源模型的性能接近或超越部分专用模型。
该研究的意义在于突破了传统依赖大量训练数据和微调的限制,为化学科学中的AI应用提供了新思路。通过引入多模态、多源的知识整合框架,极大地拓展了LLMs在科学推理中的潜能,为药物设计、材料创新等领域的智能化发展奠定了基础。未来,结合更丰富的知识库和多模态信息,MolE-RAG有望在实际工业场景中实现更广泛的应用,推动科学研究的数字化和智能化进程。
Deep Analysis
Background
化学AI的发展经历了从规则基方法到机器学习的演变。早期依赖专家知识和手工设计的特征工程,效率低下且难以推广。近年来,深度学习尤其是图神经网络(如SchNet、GROVER)在分子性质预测中取得突破,但仍受限于结构表示的表达能力。大规模预训练语言模型(如GPT系列)在自然语言处理中的成功激发了其在化学中的应用热潮,催生了ChemNet、MolInstructions等专门模型,试图用自然语言理解化学知识。然而,SMILES等结构表示在模型理解中存在语义鸿沟,限制了模型的推理能力。尽管如此,利用检索增强的方法(如ChemRAG、MolRAG)显示出在无需微调的情况下,显著提升模型性能的潜力。这些背景为本文提出的多源检索增强框架提供了理论基础。
Core Problem
当前大语言模型在化学领域的应用面临两个主要瓶颈:一是结构表示(如SMILES)与自然语言的语义鸿沟,导致模型难以准确推理分子性质;二是缺乏有效的化学知识整合机制,限制了模型在复杂任务中的表现。传统方法依赖大量标注数据和微调,成本高昂且不易推广。现有的知识增强技术多依赖单一信息源,难以全面覆盖化学知识体系,尤其在面对结构复杂或新颖分子时表现不足。这些问题严重制约了LLMs在药物设计、毒理学评估等实际应用中的效果,亟需一种高效、灵活的知识融合策略,突破信息表达和推理能力的局限。
Innovation
本文的核心创新在于提出一种多源推理上下文融合的检索增强框架(MolE-RAG),实现了在无需模型微调的情况下,结合文献检索、分子特征注入和结构相似性搜索,显著提升化学推理能力。具体创新点包括:• 利用基于BM25的文本检索,从海量化学文献中获取相关背景信息,弥补模型对专业知识的不足;• 从SMILES中提取结构信息、官能团和物理化学描述符,注入到提示丰富模型的分子表达;• 采用任务适应的分子指纹进行结构相似性检索,提供类比示例,增强模型的结构推理能力。这种多模态、多源信息的融合策略,突破了传统单一知识源的局限,为模型提供了更全面的化学推理基础。
Methodology
- �� 输入:待预测分子的SMILES字符串和任务描述。• 文本检索:利用LLM生成结合任务描述和化学关键词的混合查询,采用BM25算法从化学文献库中检索相关段落。• 分子特征注入:从SMILES中提取官能团、物理化学描述符(如LogP、分子量等),注入到提示中,丰富分子表达。• 结构相似性检索:计算分子指纹(如ECFP4),利用Tanimoto相似性从训练集检索出结构相似的分子作为示例。• 提示构建:将任务指令、检索到的文本、分子特征和相似分子作为输入,形成增强提示。• 预测:由LLM生成最终的分子性质预测结果。• 多源融合:根据任务和模型表现,灵活选择不同的上下文源组合,优化预测性能。
Experiments
- �� 数据集:采用九个分子性质预测任务,包括公开(如MoleculeNet)和专有数据集,采用 scaffold 分割,训练/验证/测试比例为8:1:1。• 模型:评估多类LLMs(GPT-4、Qwen、ChemDFM等),在零样本设置下进行推理。• 基线:仅使用SMILES的模型表现作为对比。• 评估指标:分类任务用ROC-AUC,回归任务用RMSE。• 超参数:检索Top-5相似分子,指纹类型通过验证集选择。• Ablation:逐步剔除不同上下文源,分析其对性能的贡献。
Results
- �� 分类任务中,MolE-RAG显著提升性能,最大ROC-AUC提升达28个百分点(如Qwen3从53.0到80.1),在性能较差的模型中效果尤为明显。• 回归任务中,RMSE平均降低超过50%,在FreeSolv任务中最高达67%的降幅(如Mistral从12.585降至4.128)。• 不同模型对不同上下文源的依赖性不同,文本检索在某些模型中效果最佳,而结构相似性在其他模型中表现更优。• 小型开源模型(Qwen3、Mistral)在引入MolE-RAG后,性能大幅接近甚至超越部分专用模型(如ChemDFM),验证了检索增强的补偿作用。• 结构指纹的选择(如ECFP4、MACCS)对不同任务影响显著,模型通过验证集动态选择最优指纹。
Applications
- �� 立即应用:该方法可用于药物筛选、毒理评估、材料设计等场景,利用现有化学文献和结构数据库,快速提升模型性能,降低实验成本。• 长期愿景:未来结合实时科研文献和多模态数据,打造智能化的药物发现平台,实现全流程的自动化设计与优化,推动精准医学和新材料的快速开发。
Limitations & Outlook
- �� 结构相似性检索依赖预定义指纹和相似性度量,可能在某些化学空间表现不足。• 目前的文献检索主要依赖关键词匹配,可能遗漏深层次的知识关联。• 复杂或异构的化学任务中,信息融合仍存在不足,未来需引入更丰富的知识源和多模态信息以提升鲁棒性。
Abstract
Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.
References (20)
Knowledge graph-enhanced molecular contrastive learning with functional prompt
Yin Fang, Qiang Zhang, Ningyu Zhang et al.
MoleculeNet: a benchmark for molecular machine learning
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg et al.
The Probabilistic Relevance Framework: BM25 and Beyond
S. Robertson, H. Zaragoza
Self-Supervised Graph Transformer on Large-Scale Molecular Data
Yu Rong, Yatao Bian, Tingyang Xu et al.
SchNet: A continuous-filter convolutional neural network for modeling quantum interactions
Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix et al.
Benchmarking Retrieval-Augmented Generation for Chemistry
Xianrui Zhong, Bowen Jin, Siru Ouyang et al.
Molecular Property Prediction: A Multilevel Quantum Interactions Modeling Perspective
Chengqiang Lu, Qi Liu, Chao Wang et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandara Piktus et al.
Addressing toxicity risk when designing and selecting compounds in early drug discovery.
M. Segall, Chris Barber
Molecular property prediction: recent trends in the era of artificial intelligence.
Jie Shen, C. Nicolaou
Molecular fingerprint similarity search in virtual screening.
Adrià Cereto-Massagué, María José Ojeda, Cristina Valls et al.
AccFG: Accurate Functional Group Extraction and Molecular Structure Comparison
Xuan Liu, Sarathkrishna Swaminathan, Dmitry Zubarev et al.
SPECTER: Document-level Representation Learning using Citation-informed Transformers
Arman Cohan, Sergey Feldman, Iz Beltagy et al.
Improvement of Prediction Performance With Conjoint Molecular Fingerprint in Deep Learning
Liangxu Xie, Lei Xu, R. Kong et al.
Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations
Pengcheng Jiang, Cao Xiao, Tianfan Fu et al.
Molecular similarity: a key technique in molecular informatics.
A. Bender, R. Glen
Understanding the Limitations of Deep Models for Molecular property prediction: Insights and Solutions
Jun Xia, Lecheng Zhang, Xiao Zhu et al.
Concepts and applications of molecular similarity
Marvin Johnson, G. Maggiora
Extended-Connectivity Fingerprints
David Rogers, M. Hahn
One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome
A. Capecchi, Daniel Probst, J. Reymond