Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

TL;DR

A smaller model post-trained with reinforcement learning excels in small-molecule drug design tasks, rivaling state-of-the-art frontier models.

cs.LG 🔴 Advanced 2026-04-18 31 views
Shriram Chennakesavalu Kirill Shmilovich Hayley Weir Colin Grambow John Bradshaw Patricia Suriana Chen Cheng Kangway Chuang
large language models small-molecule drug design reinforcement learning chemical tasks model evaluation

Key Findings

Methodology

This study introduces a suite of chemical tasks, including molecular property prediction, molecular representation transformations, and molecular design, structured as reinforcement learning environments. Post-training on these tasks, the study finds that smaller models can perform comparably to state-of-the-art models in small-molecule drug design, despite having weaker base models.

Key Results

  • Result 1: The post-trained smaller model excels in multi-turn molecular design tasks, performing comparably to closed frontier models, particularly in real-world lead optimization simulations.
  • Result 2: In RDKit property prediction tasks, the Aspen model significantly improved the prediction accuracy of H-bond donor and acceptor counts, reaching 0.80 and 0.85, respectively.
  • Result 3: In the multiproperty constrained generation task, Aspen's valid response rate increased from 0.77 to 1.00, and the all-constraint satisfaction rate improved from 0.09 to 0.21.

Significance

This study demonstrates the potential of large language models in small-molecule drug design by structuring chemical tasks as reinforcement learning environments. Post-training allows smaller models to achieve performance comparable to frontier models, providing a practical route for drug discovery, especially in low-data experimental settings.

Technical Contribution

The technical contribution lies in structuring small-molecule drug design tasks as reinforcement learning environments and demonstrating how post-training can significantly enhance model performance. This approach enables smaller models to excel in chemical tasks, narrowing the capability gap with state-of-the-art models.

Novelty

This study is the first to systematically structure small-molecule drug design tasks as reinforcement learning environments and demonstrates that post-training can significantly enhance model performance, particularly in low-data scenarios.

Limitations

  • Limitation 1: Despite post-training improvements, models still face challenges in low-data experimental settings, particularly in DMPK solubility prediction tasks where all models have negative R2 values.
  • Limitation 2: In molecular representation transformation tasks, the Aspen model still achieves near-zero accuracy on the most challenging nomenclature and representation tasks.
  • Limitation 3: In the multiproperty constrained generation task, although Aspen improved, constraint composition remains the core difficulty.

Future Work

Future research directions include exploring other training procedures, such as midtraining, to inject new knowledge into the base model and further enhance performance in chemical tasks.

AI Executive Summary

Large language models (LLMs) have shown potential to accelerate small-molecule drug design, yet their practical utility remains unclear due to the lack of benchmarks reflecting real-world scenarios. In this study, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design, formulated as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training.

Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model.

This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps. Our study demonstrates that despite a weaker base model, post-training allows smaller models to achieve performance comparable to frontier models in chemical tasks.

In our experimental design, we utilized RDKit property prediction tasks, experimental property prediction tasks, multiple choice tasks, molecular representation transformation tasks, and multiproperty constrained generation tasks. These tasks evaluate a model's ability to meaningfully reason about small-molecule chemistry in real-world settings. Experimental results show that the Aspen model excels in several tasks, particularly in the multiproperty constrained generation task, where its valid response rate increased from 0.77 to 1.00.

Despite significant progress, models still face challenges in low-data experimental settings, particularly in DMPK solubility prediction tasks where all models have negative R2 values. Future research directions include exploring other training procedures, such as midtraining, to inject new knowledge into the base model and further enhance performance in chemical tasks.

Deep Analysis

Background

In recent years, large language models (LLMs) have demonstrated remarkable capabilities across various domains, particularly in natural language processing and generation tasks. However, their practical application in small-molecule drug design remains limited, partly due to the lack of benchmarks that reflect real-world scenarios. Drug discovery is a complex process involving a vast array of computational, experimental, and clinical methods. Designing generalist systems that can seamlessly synthesize information and leverage tools across the full drug discovery pipeline can potentially reduce the time and cost of designing a drug. Recently, LLMs have been deployed in various drug discovery contexts, including target identification, lead optimization, and toxicity prediction. However, these applications are limited by the performance of the base LLM, especially in fundamental biological and chemical tasks.

Core Problem

Small-molecule drug design is a critical component of drug discovery, involving tasks such as molecular property prediction, molecular representation transformations, and molecular design. However, existing LLMs have limited performance in these tasks, particularly in low-data experimental settings. Improving LLM performance in small-molecule drug design tasks, especially under low-data conditions, is a significant challenge in current research.

Innovation

The core innovation of this study lies in structuring small-molecule drug design tasks as reinforcement learning environments and significantly enhancing model performance through post-training. Specifically, we designed a suite of chemical tasks, including molecular property prediction, molecular representation transformations, and molecular design, structured as RL environments. This approach enables smaller models to excel in chemical tasks, narrowing the capability gap with state-of-the-art models.

Methodology

  • �� Task Design: Construct a suite of chemical tasks, including molecular property prediction, molecular representation transformations, and molecular design.
  • �� RL Environment: Structure tasks as RL environments, enabling a unified approach for evaluation and post-training.
  • �� Model Selection: Choose three model families for study, including GPT-5, Claude Opus 4, and Qwen-30B-A3B.
  • �� Post-Training: Conduct RL post-training on smaller models to evaluate their performance in chemical tasks.

Experiments

The experimental design includes multiple chemical tasks: RDKit property prediction, experimental property prediction, multiple choice tasks, molecular representation transformations, and multiproperty constrained generation. These tasks evaluate a model's ability to reason about small-molecule chemistry in real-world settings. We utilized various datasets, including internal potency and DMPK datasets and the FS-Mol dataset. Results show that the Aspen model excels in several tasks, particularly in the multiproperty constrained generation task, where its valid response rate increased from 0.77 to 1.00.

Results

Experimental results demonstrate that the Aspen model excels in several tasks. In RDKit property prediction tasks, the Aspen model significantly improved the prediction accuracy of H-bond donor and acceptor counts, reaching 0.80 and 0.85, respectively. In the multiproperty constrained generation task, Aspen's valid response rate increased from 0.77 to 1.00, and the all-constraint satisfaction rate improved from 0.09 to 0.21. Despite significant progress, models still face challenges in low-data experimental settings, particularly in DMPK solubility prediction tasks where all models have negative R2 values.

Applications

The applications of this study include various stages of drug discovery, such as target identification, lead optimization, and toxicity prediction. By combining carefully-designed evaluation tasks with targeted post-training, we can elucidate and close critical capability gaps, providing a practical route for drug discovery.

Limitations & Outlook

Despite post-training improvements, models still face challenges in low-data experimental settings, particularly in DMPK solubility prediction tasks where all models have negative R2 values. Additionally, in molecular representation transformation tasks, the Aspen model still achieves near-zero accuracy on the most challenging nomenclature and representation tasks. Future research directions include exploring other training procedures, such as midtraining, to inject new knowledge into the base model and further enhance performance in chemical tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen, and a large language model is like a super chef assistant. This assistant not only helps you look up recipes but also adjusts them to your taste and even predicts dishes you might like. Just like in the kitchen, where you need to create delicious meals based on different ingredients and tools, in drug design, we need to design effective drugs based on different molecular properties. The large language model acts like this super chef assistant, helping us quickly find the right molecular combinations, thus speeding up the drug development process. Through reinforcement learning, this assistant can continuously improve its skills, even when ingredients are limited, it can still create delicious dishes. This is the role of large language models in small-molecule drug design: through learning and training, they help scientists design new drugs more quickly and effectively.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game with all sorts of tasks, like finding hidden treasures or unlocking new levels. A large language model is like the ultimate game assistant; it helps you find the fastest routes and even predicts enemy moves, making it easier for you to win. In the science world, this assistant is used to design new drugs. Scientists set it up with various tasks, like predicting molecular properties or designing new molecular combinations. Through continuous training, this assistant gets smarter and can perform well even when there's not much data. Just like in the game, your assistant helps you defeat the toughest bosses, in drug design, the large language model helps scientists find the most effective drug combinations. Isn't that cool?

Glossary

Large Language Model (LLM)

A large-scale machine learning model capable of understanding and generating natural language, typically containing billions of parameters.

Used in this paper to evaluate capabilities in small-molecule drug design tasks.

Small-Molecule Drug Design

The process of discovering new drugs by designing and optimizing small-molecule compounds.

The focus of this study, evaluating LLM capabilities in this task.

Reinforcement Learning (RL)

A machine learning method that trains models through rewards and punishments to improve performance on specific tasks.

Used for post-training LLMs to enhance performance in chemical tasks.

RDKit

An open-source toolkit for cheminformatics and molecular modeling.

Used to evaluate model performance in molecular property prediction tasks.

DMPK (Drug Metabolism and Pharmacokinetics)

The study of drug metabolism processes and pharmacokinetic properties within the body.

Used to evaluate model performance in experimental property prediction tasks.

SMILES (Simplified Molecular Input Line Entry System)

A text format used to describe molecular structures.

Used as input and output in molecular representation transformation tasks.

IUPAC Nomenclature

A set of rules established by the International Union of Pure and Applied Chemistry for naming chemical substances.

Used to evaluate model performance in molecular representation transformation tasks.

Multiproperty Constrained Generation

Generating molecules that satisfy multiple property constraints.

Used to evaluate model capability in complex molecular design tasks.

Qwen-30B-A3B

A Mixture-of-Experts language model with 30 billion parameters.

One of the models evaluated in this paper.

Claude Opus 4

A frontier large language model used for natural language processing tasks.

One of the models evaluated in this paper.

Open Questions Unanswered questions from this research

  • 1 How can we improve large language model performance in drug design tasks under low-data conditions? Current methods perform poorly in experimental data-limited scenarios, particularly in DMPK solubility prediction tasks where all models have negative R2 values.
  • 2 How can we enhance model accuracy in the most challenging nomenclature and representation tasks in molecular representation transformation? Current models perform poorly in IUPAC→SMILES and SMILES→IUPAC tasks.
  • 3 How can we design more effective reward functions to improve model performance in complex chemical tasks? Current reward functions fail to provide sufficient learning signals in some tasks.
  • 4 In the multiproperty constrained generation task, how can we improve model constraint composition ability? Although Aspen improved, constraint composition remains the core difficulty.
  • 5 How can we inject new knowledge into the base model to enhance its performance in chemical tasks? Current methods fail to significantly improve model performance in certain tasks.

Applications

Immediate Applications

Target Identification

Leveraging the reasoning capabilities of large language models, scientists can more quickly identify potential drug targets, accelerating the drug discovery process.

Lead Optimization

Using large language models for molecular design and optimization, improving the efficacy and safety of lead compounds.

Toxicity Prediction

Through the predictive capabilities of large language models, scientists can more accurately assess compound toxicity, reducing experimental costs.

Long-term Vision

Personalized Drug Design

Leveraging the powerful reasoning capabilities of large language models to design personalized drugs for individual patients, improving treatment outcomes.

Automated Drug Discovery

Achieving full automation of the drug discovery process through the automation capabilities of large language models, significantly reducing time and cost.

Abstract

Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.

cs.LG physics.chem-ph

References (20)

FS-Mol: A Few-Shot Learning Dataset of Molecules

Megan Stanley, J. Bronskill, Krzysztof Maziarz et al.

2021 73 citations ⭐ Influential

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqin Chen, Rui Lu et al.

2025 685 citations ⭐ Influential View Analysis →

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu et al.

2025 1550 citations ⭐ Influential View Analysis →

Training a Scientific Reasoning Model for Chemistry

Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths et al.

2025 34 citations ⭐ Influential View Analysis →

Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation

Mario Krenn, Florian Hase, AkshatKumar Nigam et al.

2019 879 citations ⭐ Influential

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 5402 citations ⭐ Influential View Analysis →

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu et al.

2024 5529 citations ⭐ Influential View Analysis →

Augmenting large language models with chemistry tools

Andrés M Bran, Sam Cox, Oliver Schilter et al.

2023 850 citations ⭐ Influential View Analysis →

Multitask Deep Learning Models of Combined Industrial Absorption, Distribution, Metabolism, and Excretion Datasets to Improve Generalization.

Joseph A Napoli, Michael Reutlinger, Patricia Brandl et al.

2025 8 citations

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou et al.

2025 10 citations View Analysis →

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Samuel R. Bowman, George E. Dahl

2021 210 citations View Analysis →

Vinardo: A Scoring Function Based on Autodock Vina Improves Scoring, Docking, and Virtual Screening

R. Quiroga, Marcos A. Villarreal

2016 304 citations

Policy Gradient Methods for Reinforcement Learning with Function Approximation

R. Sutton, David A. McAllester, Satinder Singh et al.

1999 7607 citations

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 7039 citations View Analysis →

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

2023 1921 citations View Analysis →

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin et al.

2025 249 citations View Analysis →

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

Vignesh Prabhakar, Md. Amirul Islam, Adam A. Atanas et al.

2025 17 citations View Analysis →

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

2023 2501 citations View Analysis →

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 7675 citations View Analysis →

Assessing the Chemical Intelligence of Large Language Models

Nicholas T. Runcie, Charlotte M. Deane, F. Imrie

2025 13 citations View Analysis →