Effective Biological Representation Learning by Masking Gene Expression
This paper introduces TxFM, a masked autoencoder trained on 1.4 million RNA-seq samples, outperforming large-scale foundation models in gene representation learning.
Key Findings
Methodology
The study develops a transformer-based masked autoencoder (TxFM) tailored for count-based RNA-seq data. It comprises an encoder and an MLP decoder, utilizing a Poisson negative log-likelihood loss combined with a novel tanh activation function to respect the non-negative, discrete nature of gene counts. The training dataset, DiverseRNA-1.4M, includes 1.4 million curated bulk and single-cell samples across various biological conditions. The model employs a random masking strategy, masking a high proportion (~90%) of genes per sample, and trains solely on unmasked genes, which encourages the encoder to learn robust, biologically meaningful representations. Extensive ablation studies identify critical architectural choices, such as masking ratio, activation functions, and data preprocessing, that influence transfer performance. The model's learned gene parameters effectively capture functional relationships and gene modules, outperforming atlas-scale foundation models in transfer tasks, especially on unseen perturbation datasets.
Key Results
- TxFM trained on DiverseRNA-1.4M achieves the highest perturbation and cell representation scores across three unseen datasets, with an average score of 39.11, surpassing all atlas-scale foundation models like Geneformer-v2 and Tahoe-x1 by over 10%. Notably, it performs well despite using nearly 100 times less data, demonstrating the efficiency of curated data and architecture design.
- In gene relationship reconstruction, TxFM's decoder parameters reach a recall of 42.7%, outperforming scVI (40.4%) and PCA (29.2%). Post-PCA processing of transformer embeddings further improves recall by 42%, indicating that biologically relevant relationships naturally reside in low-dimensional manifolds within high-dimensional gene embedding spaces.
- Ablation experiments reveal that data curation, masking ratio, and model architecture significantly influence performance. Removing specific datasets like K562 or bulk RNA-seq samples reduces performance marginally, confirming the robustness and generalizability of the approach. The model maintains superior performance even with limited data, emphasizing the importance of data quality and architecture over sheer size.
Significance
This work demonstrates that self-supervised masked autoencoding can produce high-fidelity gene representations from noisy, high-dimensional transcriptomic data. It challenges the notion that larger datasets are always necessary for effective transfer learning, showing instead that careful data curation and architecture design are crucial. The ability to learn meaningful gene relationships and cell states without relying on external priors or massive atlases opens new avenues for functional genomics, biomarker discovery, and personalized medicine. The approach provides a scalable, data-efficient framework adaptable to diverse biological contexts, potentially transforming how transcriptomic data is utilized in both research and clinical settings.
Technical Contribution
The paper introduces a novel transformer-based masked autoencoder architecture optimized for count data, incorporating a Poisson likelihood-based loss and a bounded tanh activation to handle the non-negative, discrete nature of RNA-seq counts. It emphasizes the importance of data curation, masking strategies, and model architecture for transferability. The model learns gene parameters that encode functional relationships, which can be extracted and analyzed without supervision. The systematic ablation studies provide practical guidelines for model design, and the evaluation across multiple datasets demonstrates robustness and superior transfer performance compared to existing foundation models.
Novelty
This is the first application of masked autoencoders specifically tailored for RNA-seq count data, combining Poisson loss and a bounded activation function to improve biological interpretability and transferability. Unlike prior models relying on external priors or large atlases, TxFM emphasizes data quality and architectural choices, achieving high-quality gene representations with significantly less data. The work also highlights the importance of data curation, showing that curated datasets can outperform larger, less curated ones, which is a paradigm shift in transcriptomics modeling.
Limitations
- Despite its strengths, the model struggles with extremely sparse or highly noisy data, especially for lowly expressed genes, due to limitations of the Poisson assumption and masking strategy. The approach may also be sensitive to hyperparameters such as masking ratio and data preprocessing, requiring careful tuning for different datasets.
- The current model architecture primarily captures static gene relationships and cell states, lacking explicit modeling of dynamic processes or spatial context. Extending the framework to multi-modal data or temporal sequences remains an open challenge.
- Computational costs, although lower than atlas-scale models, are still significant, especially during hyperparameter tuning and large-scale training. Further optimization and hardware acceleration are needed for broader deployment.
Future Work
Future directions include integrating multi-omics data (e.g., proteomics, spatial transcriptomics) to enrich representations, developing more sophisticated noise models (e.g., negative binomial with zero inflation), and exploring dynamic modeling of cell states over time. Additionally, scaling the dataset and model capacity, along with transfer learning strategies, could further improve performance. The authors also plan to investigate the biological interpretability of learned gene modules and relationships, aiming to discover novel biomarkers and therapeutic targets.
AI Executive Summary
The rapid growth of transcriptomic data from RNA sequencing has revolutionized our understanding of cellular function and heterogeneity. However, extracting meaningful, transferable representations from these high-dimensional, noisy datasets remains a major challenge. Traditional linear methods such as PCA, while computationally efficient, often fall short in capturing complex biological relationships. Deep learning approaches, especially transformer-based models, have shown promise in other domains but face difficulties when applied directly to count-based gene expression data due to the discrete, non-negative nature of counts and the high technical noise involved.
This paper introduces TxFM, a novel masked autoencoder architecture tailored for RNA-seq count data. The core idea is to mask a large fraction (~90%) of gene expression values randomly during training, then task the model with reconstructing the full expression profile. Unlike sequence-based models in NLP, gene expression data is unordered, so the transformer encoder processes only unmasked genes, and the model learns a compact, biologically meaningful cell and gene representation. The Poisson likelihood is used as the reconstruction loss, aligning well with the count data distribution, and a bounded tanh activation ensures stable training.
The training dataset, DiverseRNA-1.4M, is a curated collection of 1.4 million samples from various sources, including single-cell and bulk RNA-seq, emphasizing data quality and diversity. Extensive ablation studies identify optimal masking ratios, activation functions, and data preprocessing strategies, demonstrating their impact on transfer performance. The model’s learned gene parameters effectively capture functional relationships, enabling accurate gene clustering and relationship inference without external priors.
Experimental results show that TxFM trained on this curated dataset outperforms atlas-scale foundation models like Geneformer-v2 and Tahoe-x1 across multiple transfer tasks, including perturbation classification and cell state discrimination. Notably, the model achieves these results with significantly less data, highlighting the importance of data curation and architecture design. The learned representations are robust and transferable, making them valuable for downstream applications such as biomarker discovery, drug target identification, and functional annotation.
Overall, this work demonstrates that carefully designed self-supervised learning frameworks can unlock the full potential of transcriptomic data. By focusing on data quality, model architecture, and biological relevance, TxFM sets a new standard for gene expression representation learning. Future work will explore multi-omics integration, dynamic modeling, and biological interpretability, promising further advances in computational genomics and precision medicine.
Deep Dive
Abstract
RNA sequencing produces rich and diverse datasets of gene expression, offering compelling insights into cellular state and function that have many applications in drug discovery. Modeling such data is challenging due to inherent technical noise and experimental batch effects, as evidenced by many existing transcriptomic foundation models (FMs) underperforming relative to linear baselines. Such results raise the question of whether deep representation learning provides a distinct advantage over the direct use of raw transcript counts. Our work explores this by developing a new self-supervised model, TxFM, with a focus on inductive representation learning evaluations. TxFM employs a masked autoencoding approach tailored to diverse RNA-seq count data, and our ablation study empirically identifies crucial architecture configurations required for strong transfer performance. Additionally, we curate a public training corpus, DiverseRNA-1.4M, and find that TxFM trained on this curated dataset yields high-fidelity gene representations that outperform FMs trained on atlas-scale corpora over 100x larger. Overall, our results indicate that inductive self-supervised learning is a viable modeling approach for transcriptomics representation, provided a careful synthesis of model architecture and training data curation.
References (20)
Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq
J. Replogle, R. Saunders, Angela N. Pogson et al.
Zero-shot evaluation reveals limitations of single-cell foundation models
Kasia Z. Kedzierska, L. Crawford, A. Amini et al.
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie et al.
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all
Ihab Bendidi, Shawn T. Whitfield, Kian Kenyon-Dean et al.
Predicting cellular responses to perturbation across diverse contexts with State
Abhinav Adduri, Dhruv Gautam, Beatrice Bevilacqua et al.
Deep Generative Modeling for Single-cell Transcriptomics
Romain Lopez, J. Regier, Michael Cole et al.
Universal Cell Embeddings: A Foundation Model for Cell Biology
Yanay Rosen, Yusuf H. Roohani, Ayush Agrawal et al.
A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model
James D. Pearce, Sara E. Simmonds, Gita Mahmoudabadi et al.
CellPLM: Pre-training of Cell Language Model Beyond Single Cells
Hongzhi Wen, Wenzhuo Tang, Xinnan Dai et al.
A general and flexible method for signal extraction from single-cell RNA-seq data
D. Risso, Fanny Perraudeau, S. Gribkova et al.
Simple controls exceed best deep learning algorithms and reveal foundation model effectiveness for predicting genetic perturbations
Daniel R. Wong, A. Hill, Rob Moccia
GeneJepa: A Predictive World Model of the Transcriptome
Elon Litman, Tyler Myers, Vinayak Agarwal et al.
scPRINT: pre-training on 50 million cells allows robust gene network predictions
Jérémie Kalfon, Jules Samaran, Gabriel Peyré et al.
Evolutionary-scale prediction of atomic level protein structure with a language model
Zeming Lin, Halil Akin, Roshan Rao et al.
Large Scale Foundation Model on Single-cell Transcriptomics
Minsheng Hao, Jing Gong, Xin Zeng et al.
Scaling Large Language Models for Next-Generation Single-Cell Analysis
S. Rizvi, Daniel Levine, Aakash Patel et al.
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations
Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter et al.
SIGNOR: a database of causal relationships between biological entities
L. Perfetto, Leonardo Briganti, Alberto Calderone et al.
scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data
Wenchuan Wang, Fan Yang, Yuejing Fang et al.
The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans
S. Quake