Flexible Kernels for Protein Property Prediction

TL;DR

This paper introduces flexible sequence kernels based on evolutionary substitution matrices, leveraging Gaussian processes for data-efficient protein property prediction, outperforming embedding-based methods.

cs.LG 🔴 Advanced 2026-06-10 42 views
Martin Jankowiak Yerdos Ordabayev Rudraksh Tuwani Henry N. Ward Hunter Nisonoff James M. McFarland Gevorg Grigoryan
protein prediction kernel methods Gaussian processes structural integration multi-task learning

Key Findings

Methodology

The study develops a class of sequence kernels that integrate evolutionary substitution matrices with local linearity assumptions, employing Gaussian process (GP) models for efficient protein property landscape modeling. Central to this approach is the use of correlation matrices derived from matrices like BLOSUM50, with learnable exponents to modulate similarity scales. Additionally, the paper introduces structure-conditioned kernels (CLOCK) by mapping pre-trained structural embeddings into position-specific correlation matrices, enabling seamless incorporation of structural information. Hyperparameters are optimized via maximum likelihood, and the models are trained within a multi-task framework to leverage shared information across multiple properties. Extensive benchmarking across 21 datasets with over 1800 samples each demonstrates the superiority of sequence-only kernels, especially in data-scarce and extrapolation scenarios, often surpassing models relying on large foundation model embeddings.

Key Results

  • In protein property prediction tasks, the sequence-based Gaussian process models with the proposed kernels achieved Pearson correlation coefficients up to 0.807 and reduced MAE by over 15%, outperforming deep learning models based on ESM-2 embeddings, particularly when training data was limited to 48 samples.
  • The structure-conditioned kernels (CLOCK) demonstrated significant improvements in multi-task settings, with correlation coefficients exceeding 0.75 across multiple properties, and showed robustness in extrapolation tasks, with a 20% increase in correlation compared to baseline models.
  • Learning the exponential parameters allowed dynamic adjustment of sequence similarity scales, leading to more accurate predictions and better uncertainty quantification, with the models' predictive distributions aligning more closely with true data distributions, as evidenced by CRPS scores improving by 10-15%.

Significance

This work advances the field of protein property prediction by providing a biologically interpretable, data-efficient alternative to deep embedding methods. By leveraging evolutionary and structural knowledge directly within the kernel framework, it addresses key limitations of current approaches, such as high computational costs and poor generalization in low-data regimes. The proposed models facilitate rapid, reliable predictions essential for protein engineering, drug discovery, and functional annotation, especially when structural data is unavailable or unreliable. Furthermore, the integration of structure-conditioned kernels opens new avenues for combining pre-trained structural models with sequence-based methods, fostering a more holistic understanding of protein function.

Technical Contribution

The paper introduces the Locally Linear Correlation Kernel (kLOCK), which combines evolutionary substitution matrices with local linearity assumptions within a Gaussian process framework. Key innovations include: • Incorporating learnable exponents on substitution matrices to adapt similarity measures dynamically; • Developing the structure-conditioned kernel (CLOCK) by mapping structural embeddings to correlation matrices, enabling zero-shot structural integration; • Demonstrating how multi-task learning enhances predictive performance across diverse protein properties; • Providing a comprehensive hyperparameter regularization scheme to ensure model stability. These contributions extend kernel methods' applicability in bioinformatics, offering a transparent, biologically grounded alternative to deep embedding models.

Novelty

This research is the first to systematically embed evolutionary substitution matrices into flexible, learnable kernels for protein property prediction, combined with structure-conditioned mechanisms that leverage pre-trained structural embeddings. Unlike prior work relying solely on deep learning embeddings or handcrafted features, this approach offers a biologically interpretable, data-efficient framework that can adapt to different protein landscapes via learnable exponents. The integration of structural information through a zero-shot mapping from pre-trained models is particularly novel, enabling models to incorporate structural cues without retraining on structural data. This represents a significant step forward in kernel-based bioinformatics modeling.

Limitations

  • Despite strong performance, the models still face challenges in accurately extrapolating to completely unseen mutation combinations or novel structural classes, due to inherent limitations in kernel similarity measures over large sequence distances.
  • The structure-conditioned kernel relies on pre-trained structural embeddings, which, if inaccurate or biased, could impair the model's effectiveness, especially for proteins with poorly characterized structures.
  • Hyperparameter tuning, particularly the learning of exponents and kernel scales, remains computationally intensive and sensitive, requiring careful regularization to prevent overfitting, especially in small datasets.
  • The current framework primarily focuses on sequence and structural features, leaving out other relevant biological factors such as dynamics, post-translational modifications, or environmental effects, which could further improve predictions.

Future Work

Future directions include integrating additional modalities such as protein dynamics and interaction networks, developing more efficient hyperparameter optimization techniques, and scaling the models to larger, more diverse protein datasets. Exploring end-to-end training pipelines that combine structure prediction, property modeling, and design optimization could further accelerate protein engineering workflows. Additionally, extending the kernel framework to incorporate non-linear structural features or to model epistatic interactions explicitly would enhance predictive accuracy. The ultimate goal is to develop a comprehensive, interpretable, and computationally efficient platform for protein design and functional annotation, bridging the gap between biological knowledge and machine learning.

AI Executive Summary

Predicting protein properties such as binding affinity, thermostability, and fluorescence from limited experimental data remains a fundamental challenge in molecular biology and bioinformatics. Traditional methods often depend heavily on structural information or large-scale deep learning embeddings, which, while powerful, are computationally expensive and sometimes lack interpretability. This paper introduces a novel class of sequence kernels grounded in evolutionary biology, specifically leveraging substitution matrices like BLOSUM50, to build data-efficient, interpretable Gaussian process models for protein property prediction.

The core innovation lies in designing flexible kernels that incorporate biological prior knowledge through correlation matrices derived from substitution matrices, with learnable exponents that adapt to specific protein landscapes. These kernels, termed kLOCK, are combined with local linearity assumptions to capture additive effects of mutations, providing a nuanced balance between linear and non-linear modeling. Furthermore, the authors propose structure-conditioned kernels (CLOCK), which map pre-trained structural embeddings into correlation matrices, enabling the model to incorporate structural information without explicit structural data during training.

Extensive benchmarking across 21 datasets demonstrates that sequence-only kernels outperform many embedding-based models, especially in data-scarce and extrapolation scenarios. In particular, the models achieve Pearson correlation coefficients up to 0.807 and reduce mean absolute errors by over 15% compared to baseline deep learning approaches. The multi-task learning framework further enhances performance by sharing information across related properties, facilitating knowledge transfer and improving generalization.

This work has profound implications for protein engineering, drug discovery, and functional annotation. By providing a biologically interpretable, computationally efficient, and highly accurate prediction tool, it paves the way for more rapid and reliable protein design workflows. The ability to incorporate structural cues via zero-shot learning from structural embeddings broadens the applicability of the approach, especially when experimental structural data is unavailable.

Despite these advances, challenges remain in extrapolating to entirely novel mutation combinations and in optimizing hyperparameters for large-scale applications. Future research will likely focus on integrating additional biological modalities, automating hyperparameter tuning, and scaling the models to encompass broader protein families. Overall, this study marks a significant step toward more intelligent, data-efficient, and biologically grounded computational tools in molecular biology.

Deep Analysis

Background

蛋白质的结构和功能研究经历了从序列比对到结构预测的演变。早期方法如BLAST和FASTA主要依赖序列相似性,逐步引入了演化信息(如PAM、BLOSUM矩阵)以提升比对效果。近年来,深度学习模型如ESM、ProtBert等通过大规模预训练显著改善了蛋白质序列到结构的映射能力,但在性质预测方面仍受限于数据稀缺和泛化能力不足。传统核方法(如线性核、RBF核)在蛋白质序列分析中应用广泛,但难以结合生物学知识,限制了模型的表达能力。近年来,利用演化信息的替代矩阵被证明在捕获序列相似性方面具有重要优势,但如何系统性融入核函数设计仍是难点。本文在此基础上,结合局部线性性假设和结构条件化机制,提出了新型核函数,旨在解决蛋白质性质预测中的数据效率和泛化能力不足的问题。

Core Problem

蛋白质性质预测的核心难题在于如何在有限的实验数据下,准确刻画序列与性质之间的复杂关系。现有方法多依赖深度嵌入或结构信息,计算成本高且对结构预测的依赖较强。另一方面,传统核方法虽具备良好的数据效率,但难以融入生物学知识,导致性能受限。特别是在外推未知突变组合时,模型表现不稳定,缺乏合理的生物学解释。如何设计既能利用演化信息,又能结合结构知识的核函数,成为亟待解决的问题。本文试图通过引入演化替代矩阵和局部线性性假设,构建具有生物学可解释性的核函数,提升模型在数据稀缺和外推场景下的表现。

Innovation

本研究的创新点主要包括:1)提出结合演化替代矩阵的核函数(如kLOCK),利用生物学中已知的序列相似性信息,增强模型的生物学解释性;2)引入指数参数调节相似性尺度,实现对不同蛋白质景观的自适应调节;3)设计结构条件化核(CLOCK),通过预训练结构嵌入映射到相关矩阵,实现序列与结构信息的无缝融合;4)采用多任务学习框架,有效利用多组蛋白质性质数据,提升模型的泛化能力。这些创新突破了传统核方法的局限,为蛋白质性质预测提供了新思路。

Methodology

  • �� 核函数设计:结合演化替代矩阵(如BLOSUM50)生成相关矩阵,利用学习的指数α调节相似性尺度,构建可调节的核函数(如kLOCK)。
  • �� 结构条件化:利用预训练的结构嵌入(hℓ)通过线性映射生成序列位置的相关矩阵(Cℓ),实现结构信息的引入。
  • �� 多任务学习:在高斯过程框架中,将多个性质任务联合建模,通过最大似然优化超参数和指数α,提升模型的泛化能力。
  • �� 超参数调优:采用贝叶斯优化或梯度下降方法,调节核尺度、指数参数和噪声参数,确保模型稳定性。
  • �� 训练策略:利用大规模蛋白质数据集,采用交叉验证和外推测试,验证模型在不同场景下的性能。
  • �� 结构条件化核:通过映射结构嵌入到相关矩阵,实现从序列到结构的无缝融合,增强模型的表达能力。

Experiments

研究中,作者使用包括ProteinGym在内的21个蛋白质性质数据集,涵盖热稳定性、结合亲和力、荧光强度等多种属性。每个数据集至少包含1800个样本,变量位置超过10个,且存在丰富的高阶突变组合。模型训练采用最大似然方法,超参数通过梯度优化调节。对比基线包括传统核方法(如Tanimoto核、RBF核)和深度学习模型(如ESM-2特征结合的MLP、Ridge回归)。评估指标包括相关系数(Pearson、Spearman)、平均绝对误差(MAE)和连续排名概率评分(CRPS),在不同场景(交叉验证、外推、未见突变)下进行。还进行了超参数敏感性和不同核设计的消融分析,以验证模型的稳健性。

Results

在大部分数据集上,基于演化替代矩阵的核(如kLOCK)显著优于传统核和深度模型,相关系数提升至0.75(最高0.807),MAE降低15%以上。在外推任务中,模型表现尤为优异,相关系数平均提升20%,显示出良好的泛化能力。多任务学习框架下,模型能有效迁移知识,不同性质任务的相关性得到充分利用,提升整体预测性能。指数参数的学习使模型能自适应调节相似性尺度,增强对不同蛋白质景观的适应性。结构条件化核在结合结构信息时,显著改善了预测的准确性和不确定性表达,验证了其在蛋白质设计中的潜力。

Applications

该模型可广泛应用于蛋白质工程、药物设计和酶工程等领域,尤其适合在数据有限或结构信息缺失的情况下进行性质预测。通过仅利用序列信息,科学家可以快速筛选候选蛋白,减少实验成本。多任务学习框架还能实现多属性同时优化,为蛋白质设计提供多目标指导。未来,结合结构预测和动力学模拟,有望实现全流程的蛋白质设计自动化,推动生物医药产业的创新发展。此外,该模型还能辅助新药筛选、蛋白质功能注释等多种应用场景。

Limitations & Outlook

模型在极端外推场景下仍存在一定的不确定性,特别是在未见过的突变组合中预测效果有限,原因在于核函数对远距离序列的相似性刻画不足。结构条件化核依赖预训练结构嵌入,若结构预测偏差较大,可能影响核的表达效果。此外,超参数调优,尤其是指数α和核尺度的学习,计算成本较高,容易引发过拟合,尤其在样本较少的情况下。模型还未充分考虑蛋白质的动力学特性和环境因素,未来需结合多模态信息进行优化。

Plain Language Accessible to non-experts

想象你在一个厨房里准备做一道复杂的菜。每次你都用不同的食材组合,味道也会不同。以前,厨师们只用简单的规则,比如“多放盐会更咸”,来猜菜的味道,但这样太粗糙,不能准确预测复杂的味道。现在,这个研究就像发明了一种新工具,它可以学习食材之间的关系(哪些搭配会让菜更香),还可以考虑菜的整体结构(比如菜的摆盘和烹饪方式),用更聪明的方法预测出菜的味道。这个工具还能根据以前做过的菜,快速学习不同食材的搭配规律,即使遇到新食材,也能大致猜出味道。这样一来,厨师们就能更快地设计出美味的菜肴,不用试错那么多次。这就像给厨房带来了一个超级聪明的助手,帮你在没有试过的食材组合中,也能做出好吃的菜。

ELI14 Explained like you're 14

想象你在一个超级大的厨房里,准备做各种不同的菜。每个菜的味道(蛋白质的性质)都由食材的组合(氨基酸序列)决定。以前,厨师们用简单的规则,比如“多放点盐会更咸”,来猜测味道,但这太粗糙了,不能准确预测复杂的味道。现在,这个研究就像发明了一种新工具,它可以学习食材之间的关系(比如哪些搭配会让菜更香),还可以考虑菜的整体结构(比如菜的摆盘和烹饪方式),用更聪明的方法预测出菜的味道。这个工具还可以根据以前做过的菜,快速学习不同食材的搭配规律,即使遇到新食材,也能大致猜出味道。这样一来,厨师们就能更快地设计出美味的菜肴,不用试错那么多次。这就像给厨房带来了一个超级聪明的助手,帮你在没有试过的食材组合中,也能做出好吃的菜。

Abstract

Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.

cs.LG q-bio.BM stat.ML