Effective Biological Representation Learning by Masking Gene Expression

TL;DR

This paper introduces TxFM, a masked autoencoder trained on 1.4 million RNA-seq samples, outperforming large-scale foundation models in gene representation learning.

cs.LG 🔴 Advanced 2026-05-30 85 views

Kian Kenyon-Dean Alina Selega Ihab Bendidi Jordan M. Sorokin Luca Bertinetto David Errington Hayley Donnella Oren Kraus

AI Reader Arxiv Page Download PDF

transcriptomics self-supervised learning masked autoencoder gene representation transfer learning

Key Findings

Methodology

The study develops a transformer-based masked autoencoder (TxFM) tailored for count-based RNA-seq data. It comprises an encoder and an MLP decoder, utilizing a Poisson negative log-likelihood loss combined with a novel tanh activation function to respect the non-negative, discrete nature of gene counts. The training dataset, DiverseRNA-1.4M, includes 1.4 million curated bulk and single-cell samples across various biological conditions. The model employs a random masking strategy, masking a high proportion (~90%) of genes per sample, and trains solely on unmasked genes, which encourages the encoder to learn robust, biologically meaningful representations. Extensive ablation studies identify critical architectural choices, such as masking ratio, activation functions, and data preprocessing, that influence transfer performance. The model's learned gene parameters effectively capture functional relationships and gene modules, outperforming atlas-scale foundation models in transfer tasks, especially on unseen perturbation datasets.

Key Results

TxFM trained on DiverseRNA-1.4M achieves the highest perturbation and cell representation scores across three unseen datasets, with an average score of 39.11, surpassing all atlas-scale foundation models like Geneformer-v2 and Tahoe-x1 by over 10%. Notably, it performs well despite using nearly 100 times less data, demonstrating the efficiency of curated data and architecture design.
In gene relationship reconstruction, TxFM's decoder parameters reach a recall of 42.7%, outperforming scVI (40.4%) and PCA (29.2%). Post-PCA processing of transformer embeddings further improves recall by 42%, indicating that biologically relevant relationships naturally reside in low-dimensional manifolds within high-dimensional gene embedding spaces.
Ablation experiments reveal that data curation, masking ratio, and model architecture significantly influence performance. Removing specific datasets like K562 or bulk RNA-seq samples reduces performance marginally, confirming the robustness and generalizability of the approach. The model maintains superior performance even with limited data, emphasizing the importance of data quality and architecture over sheer size.

Significance

This work demonstrates that self-supervised masked autoencoding can produce high-fidelity gene representations from noisy, high-dimensional transcriptomic data. It challenges the notion that larger datasets are always necessary for effective transfer learning, showing instead that careful data curation and architecture design are crucial. The ability to learn meaningful gene relationships and cell states without relying on external priors or massive atlases opens new avenues for functional genomics, biomarker discovery, and personalized medicine. The approach provides a scalable, data-efficient framework adaptable to diverse biological contexts, potentially transforming how transcriptomic data is utilized in both research and clinical settings.

Technical Contribution

The paper introduces a novel transformer-based masked autoencoder architecture optimized for count data, incorporating a Poisson likelihood-based loss and a bounded tanh activation to handle the non-negative, discrete nature of RNA-seq counts. It emphasizes the importance of data curation, masking strategies, and model architecture for transferability. The model learns gene parameters that encode functional relationships, which can be extracted and analyzed without supervision. The systematic ablation studies provide practical guidelines for model design, and the evaluation across multiple datasets demonstrates robustness and superior transfer performance compared to existing foundation models.

Novelty

This is the first application of masked autoencoders specifically tailored for RNA-seq count data, combining Poisson loss and a bounded activation function to improve biological interpretability and transferability. Unlike prior models relying on external priors or large atlases, TxFM emphasizes data quality and architectural choices, achieving high-quality gene representations with significantly less data. The work also highlights the importance of data curation, showing that curated datasets can outperform larger, less curated ones, which is a paradigm shift in transcriptomics modeling.

Limitations

Despite its strengths, the model struggles with extremely sparse or highly noisy data, especially for lowly expressed genes, due to limitations of the Poisson assumption and masking strategy. The approach may also be sensitive to hyperparameters such as masking ratio and data preprocessing, requiring careful tuning for different datasets.
The current model architecture primarily captures static gene relationships and cell states, lacking explicit modeling of dynamic processes or spatial context. Extending the framework to multi-modal data or temporal sequences remains an open challenge.
Computational costs, although lower than atlas-scale models, are still significant, especially during hyperparameter tuning and large-scale training. Further optimization and hardware acceleration are needed for broader deployment.

Future Work

Future directions include integrating multi-omics data (e.g., proteomics, spatial transcriptomics) to enrich representations, developing more sophisticated noise models (e.g., negative binomial with zero inflation), and exploring dynamic modeling of cell states over time. Additionally, scaling the dataset and model capacity, along with transfer learning strategies, could further improve performance. The authors also plan to investigate the biological interpretability of learned gene modules and relationships, aiming to discover novel biomarkers and therapeutic targets.

AI Executive Summary

The rapid growth of transcriptomic data from RNA sequencing has revolutionized our understanding of cellular function and heterogeneity. However, extracting meaningful, transferable representations from these high-dimensional, noisy datasets remains a major challenge. Traditional linear methods such as PCA, while computationally efficient, often fall short in capturing complex biological relationships. Deep learning approaches, especially transformer-based models, have shown promise in other domains but face difficulties when applied directly to count-based gene expression data due to the discrete, non-negative nature of counts and the high technical noise involved.

This paper introduces TxFM, a novel masked autoencoder architecture tailored for RNA-seq count data. The core idea is to mask a large fraction (~90%) of gene expression values randomly during training, then task the model with reconstructing the full expression profile. Unlike sequence-based models in NLP, gene expression data is unordered, so the transformer encoder processes only unmasked genes, and the model learns a compact, biologically meaningful cell and gene representation. The Poisson likelihood is used as the reconstruction loss, aligning well with the count data distribution, and a bounded tanh activation ensures stable training.

The training dataset, DiverseRNA-1.4M, is a curated collection of 1.4 million samples from various sources, including single-cell and bulk RNA-seq, emphasizing data quality and diversity. Extensive ablation studies identify optimal masking ratios, activation functions, and data preprocessing strategies, demonstrating their impact on transfer performance. The model’s learned gene parameters effectively capture functional relationships, enabling accurate gene clustering and relationship inference without external priors.

Experimental results show that TxFM trained on this curated dataset outperforms atlas-scale foundation models like Geneformer-v2 and Tahoe-x1 across multiple transfer tasks, including perturbation classification and cell state discrimination. Notably, the model achieves these results with significantly less data, highlighting the importance of data curation and architecture design. The learned representations are robust and transferable, making them valuable for downstream applications such as biomarker discovery, drug target identification, and functional annotation.

Overall, this work demonstrates that carefully designed self-supervised learning frameworks can unlock the full potential of transcriptomic data. By focusing on data quality, model architecture, and biological relevance, TxFM sets a new standard for gene expression representation learning. Future work will explore multi-omics integration, dynamic modeling, and biological interpretability, promising further advances in computational genomics and precision medicine.

Deep Dive

Abstract

RNA sequencing produces rich and diverse datasets of gene expression, offering compelling insights into cellular state and function that have many applications in drug discovery. Modeling such data is challenging due to inherent technical noise and experimental batch effects, as evidenced by many existing transcriptomic foundation models (FMs) underperforming relative to linear baselines. Such results raise the question of whether deep representation learning provides a distinct advantage over the direct use of raw transcript counts. Our work explores this by developing a new self-supervised model, TxFM, with a focus on inductive representation learning evaluations. TxFM employs a masked autoencoding approach tailored to diverse RNA-seq count data, and our ablation study empirically identifies crucial architecture configurations required for strong transfer performance. Additionally, we curate a public training corpus, DiverseRNA-1.4M, and find that TxFM trained on this curated dataset yields high-fidelity gene representations that outperform FMs trained on atlas-scale corpora over 100x larger. Overall, our results indicate that inductive self-supervised learning is a viable modeling approach for transcriptomics representation, provided a careful synthesis of model architecture and training data curation.

cs.LG

References (20)

Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq

J. Replogle, R. Saunders, Angela N. Pogson et al.

2021 619 citations ⭐ Influential

Zero-shot evaluation reveals limitations of single-cell foundation models

Kasia Z. Kedzierska, L. Crawford, A. Amini et al.

2025 77 citations ⭐ Influential

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie et al.

2021 11581 citations ⭐ Influential View Analysis →

Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all

Ihab Bendidi, Shawn T. Whitfield, Kian Kenyon-Dean et al.

2024 33 citations ⭐ Influential View Analysis →

Predicting cellular responses to perturbation across diverse contexts with State

Abhinav Adduri, Dhruv Gautam, Beatrice Bevilacqua et al.

2025 93 citations ⭐ Influential

Deep Generative Modeling for Single-cell Transcriptomics

Romain Lopez, J. Regier, Michael Cole et al.

2018 2294 citations ⭐ Influential

Universal Cell Embeddings: A Foundation Model for Cell Biology

Yanay Rosen, Yusuf H. Roohani, Ayush Agrawal et al.

2026 150 citations ⭐ Influential

A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

James D. Pearce, Sara E. Simmonds, Gita Mahmoudabadi et al.

2025 31 citations ⭐ Influential

CellPLM: Pre-training of Cell Language Model Beyond Single Cells

Hongzhi Wen, Wenzhuo Tang, Xinnan Dai et al.

2023 81 citations

A general and flexible method for signal extraction from single-cell RNA-seq data

D. Risso, Fanny Perraudeau, S. Gribkova et al.

2017 620 citations

Simple controls exceed best deep learning algorithms and reveal foundation model effectiveness for predicting genetic perturbations

Daniel R. Wong, A. Hill, Rob Moccia

2025 23 citations

GeneJepa: A Predictive World Model of the Transcriptome

Elon Litman, Tyler Myers, Vinayak Agarwal et al.

2025 3 citations

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Jérémie Kalfon, Jules Samaran, Gabriel Peyré et al.

2024 47 citations

Evolutionary-scale prediction of atomic level protein structure with a language model

Zeming Lin, Halil Akin, Roshan Rao et al.

2022 4625 citations

Large Scale Foundation Model on Single-cell Transcriptomics

Minsheng Hao, Jing Gong, Xin Zeng et al.

2023 514 citations

Scaling Large Language Models for Next-Generation Single-Cell Analysis

S. Rizvi, Daniel Levine, Aakash Patel et al.

2025 37 citations

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter et al.

2024 21 citations View Analysis →

SIGNOR: a database of causal relationships between biological entities

L. Perfetto, Leonardo Briganti, Alberto Calderone et al.

2015 226 citations

scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

Wenchuan Wang, Fan Yang, Yuejing Fang et al.

2022 573 citations

The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans

S. Quake

2021 1025 citations

Effective Biological Representation Learning by Masking Gene Expression

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies