The download progress bar reads: terabytes of data. Tens of millions of cells. About 20,000 genes per cell. Human Cell Atlas-scale efforts and related public atlases are steadily mapping cell types across the body, and the datasets keep growing. If you printed each cell’s expression profile on a single sheet of paper, the stack would stretch across cities.
The scale is not just logistically staggering — it breaks traditional analysis pipelines. No graduate student can manually annotate tens of millions of cells. No standard clustering algorithm scales elegantly to this size. Batch correction across dozens of labs, protocols, and patient cohorts becomes a combinatorial nightmare. And yet, buried somewhere in those expression profiles are undiscovered cell states, rare populations that appear in disease but not in health, and regulatory programs that no one has described. The data contains answers. The question is how to ask it.
The solution mirrors what happened in natural language processing when the internet became too large for human curation: stop annotating everything manually, and instead train a model large enough to learn the structure of the data itself. GPT didn’t need humans to label every sentence as grammatical or ungrammatical — it learned grammar by reading enough text. A single-cell foundation model doesn’t need humans to annotate every cell — it learns what “normal” cell states look like by reading enough expression profiles, so it can spot what’s abnormal, incomplete, or novel.
This is the shift this chapter describes: from analysis pipelines that require labeled data at every step, to foundation models that learn a general cellular language and then apply it wherever it’s needed — across tissues, diseases, and species they’ve never seen before.
Single-cell RNA-sequencing has generated an unprecedented explosion of data. By the mid-2020s, public repositories and atlas projects contained gene expression measurements from well over 100 million individual cells. The Human Cell Atlas aims to create reference maps of human cell types across tissues, while related portals such as CELLxGENE aggregate standardized datasets at even larger scale.
But this wealth of data creates new challenges:
The Scale Problem: Each scRNA-seq experiment generates a matrix with 20,000–30,000 genes (rows) and 10,000–500,000 cells (columns). That’s up to 15 billion measurements per experiment. Training a model on even a fraction of published datasets requires processing tens of billions of data points.
The Integration Problem: Cells sequenced in different labs, using different protocols, from different individuals show substantial technical variation. A beta cell from Lab A looks different from a beta cell from Lab B, even though they’re biologically similar. Integrating data across studies requires sophisticated methods to separate biological variation from technical noise.
The Annotation Problem: Manually identifying cell types requires expert knowledge and is incredibly time-consuming. The same cell type may have different names in different papers (pancreatic beta cell = β-cell = insulin-producing cell = INS+ cell). Creating consistent annotations across millions of cells is impossible without automation.
The Generalization Problem: Traditional machine learning models trained on one tissue or condition often fail when applied to new contexts. A model trained on healthy pancreatic tissue may not recognize stressed beta cells from patients with diabetes. We need models that capture fundamental principles of cellular biology that transfer across contexts.
By the end of this chapter, you should be able to:
The term “foundation model” emerged in 2021 to describe large AI systems trained on broad data that can be adapted to many downstream tasks. Think of GPT-3 for language or CLIP for images. These models learn general representations during pre-training, then can be fine-tuned for specific applications with minimal additional data.
For single-cell biology, this paradigm is transformative. Instead of training a new neural network from scratch for every biological question, we can:
The key innovation is that the pre-trained model learns a “universal language” of gene expression—patterns that hold across tissues, conditions, and even species.
Single-cell datasets share critical properties with language:
Structure: Just as words appear in sequences with grammatical rules, genes are expressed in coordinated patterns governed by regulatory networks. Co-expressed genes are like phrases; regulatory modules are like grammar.
Context-dependence: A gene’s “meaning” depends on what other genes are expressed, just as a word’s meaning depends on surrounding words. The gene CD4 in a T cell context has different biological implications than CD4 in a different cellular context.
Transferability: Core biological principles (cell cycle, stress response, differentiation) appear across cell types, similar to how narrative structure transfers across texts.
Scale: We have hundreds of millions of single-cell profiles—sufficient data to learn meaningful patterns.
Biological analogy: Think of scBERT, Geneformer, and scGPT as BERT for text, but instead of words, the “vocabulary” is genes, and instead of sentences, the “text” is a cell’s gene expression profile. Just as BERT learns that “bank” means something different in “river bank” vs. “bank account,” these models learn that CD4 means something different in a T cell versus a macrophage.
Most single-cell foundation models use masked gene prediction as their pre-training task, directly inspired by BERT’s masked language modeling:
This seemingly simple task forces the model to learn:
scBERT (2022) was one of the first attempts to apply transformer architecture to single-cell data. The key innovation was treating genes as “words” and cells as “sentences.”
Input representation: Each gene’s expression is represented as:
Gene Embedding = Gene Identity Embedding + Expression Value Embedding
The gene identity tells the model “this is the BRCA1 gene,” while the expression value indicates “it’s highly expressed” or “barely detectable.” This dual embedding captures both what genes are present and how active they are.
Architecture note: scBERT follows the BERT pretraining/fine-tuning idea for scRNA-seq and uses attention-efficient transformer components to handle thousands of genes per cell. Exact layer counts, embedding sizes, and training sets depend on the implementation and should be cited from the model paper or repository when used in a methods section.
Let’s walk through what happens when scBERT sees a T cell:
Tokenization: The cell’s 20,000 gene expression values are converted to tokens. Genes with zero or very low expression might be filtered out, leaving ~5,000–8,000 “active” genes.
Embedding: Each gene gets a 128-dimensional vector that combines gene identity and expression level. Highly variable genes (like T cell markers CD3D, CD3E) get specialized representations.
Self-attention: The transformer layers allow every gene to “attend” to every other gene. CD3D can look at CD3E, CD8A, and other T cell markers to understand the cellular context.
Cell-level representation: The final hidden states are pooled to create a single vector representing the entire cell’s transcriptional state.
Cell type annotation: After pre-training, scBERT can be fine-tuned on labeled cells from one dataset and achieve strong performance on held-out cells. Performance on completely new datasets depends on tissue, batch effects, annotation granularity, and how closely the new data match the pretraining and fine-tuning data.
Batch effect correction: Because scBERT learned gene expression patterns across many studies, it naturally learns to separate biological variation (real differences between cell types) from technical variation (differences between sequencing runs).
Gene imputation: scBERT can predict expression levels for genes that were missed by sequencing (dropouts). The model uses patterns from other genes to infer what the missing values should be.
scBERT treats all genes equally, but we know that transcription factors and signaling molecules have outsized importance in determining cell state. The model doesn’t incorporate our prior biological knowledge about gene regulatory networks.
Additionally, scBERT operates on processed count data, losing information about RNA splicing, velocity (directionality of change), or spatial context that might be available in the original data.
Geneformer (2023) took a different approach by representing each cell as a ranked list of expressed genes. Instead of relying on absolute count values, it asks which genes are unusually high or low within a cell relative to a large reference corpus.
Rank-value encoding: Genes are sorted by expression level within each cell:
This rank-based approach makes the model robust to technical variation in absolute expression levels. A gene might have 100 counts in one experiment and 1,000 counts in another, but if it’s the 5th most highly expressed gene in both cases, it gets the same rank.
Why ranking matters: In single-cell data, the relative order of gene expression often matters more than absolute values. If FOXP3 is among the top 50 expressed genes in a T cell, that’s probably a regulatory T cell—regardless of whether FOXP3’s count is 50 or 500.
Model specifications:
Integration with biological knowledge: Geneformer does not need to initialize gene embeddings from Gene Ontology annotations. Instead, it learns gene-gene relationships from large-scale single-cell expression context. Biological structure emerges from the self-supervised task and can then be probed through attention, perturbation, and fine-tuning analyses.
Perhaps Geneformer’s most impressive capability is transfer learning for network biology. After pretraining, the model can be fine-tuned for tasks such as gene dosage sensitivity, chromatin dynamics, and disease-state modeling with much less task-specific data than would be needed to train from scratch.
Example: Prioritizing cardiomyopathy targets
These predictions are hypotheses for experimental follow-up, not direct proof of therapeutic efficacy.
Because Geneformer uses attention mechanisms, we can examine which genes the model focuses on when making predictions. In heart-cell analyses, attention and perturbation patterns can highlight cardiac transcription factors such as GATA4, NKX2-5, and MEF2C, which gives researchers hypotheses to compare with known biology.
This interpretability is crucial for biological applications. We don’t just want a “black box” that makes good predictions; we want to understand why it makes those predictions and whether its reasoning aligns with biological mechanisms.
scGPT (2023) asks a different question: instead of just predicting masked genes, can we generate entirely new cellular states? This generative approach enables new applications like:
Generative modeling: Given a cell’s current state, scGPT learns the probability distribution over possible next states. This is similar to how GPT predicts the next word in a sentence, but here we’re predicting cellular state transitions.
Multi-task pre-training: scGPT trains on three tasks simultaneously:
This multi-task approach forces the model to learn different aspects of cellular biology:
Model specifications:
A key innovation in scGPT is conditional generation—the ability to generate cells with specific properties.
Example: Simulating drug treatment
Input: Baseline cardiomyocyte + "doxorubicin treatment"
Output: Predicted gene expression after drug exposure
The model learned patterns of how cells respond to perturbations by training on datasets where the same cell types were measured with and without treatments. It can then predict responses to drugs or genetic perturbations it hasn’t seen during training.
One of scGPT’s most impressive capabilities is predicting cellular responses to genetic perturbations:
Task: What happens if we knock out gene X in cell type Y?
The model approaches this by:
For example, knocking out a transcription factor might cause downstream target genes to decrease, while stress response genes increase to compensate. The model predicts these cascade effects.
Validation: Perturbation prediction must be evaluated against CRISPR, drug-treatment, or other perturbation datasets in the relevant cell type. Reported correlations vary by benchmark and should not be treated as a universal performance guarantee.
| Model family | Released | Main input | Key Innovation | Best For |
|---|---|---|---|---|
| scBERT | 2022 | scRNA-seq | BERT-style pretraining for cell annotation | Cell type annotation |
| Geneformer | 2023 | scRNA-seq | Rank-value encoding and transfer learning | Network biology and in silico perturbation |
| scGPT | 2024 | single-cell multi-omics-style inputs | Generative pretraining and perturbation modeling | Perturbation prediction and integration |
| scFoundation | 2024 | large-scale transcriptomics | Large-scale transcriptomic representation learning | General transcriptomic transfer tasks |
| Multimodal/spatial models | emerging | RNA, ATAC, spatial, protein | Incorporate additional measurements | Tissue context and regulatory mechanisms |
Use scBERT when:
Use Geneformer when:
Use scGPT when:
Use multimodal or spatial models when:
Benchmark numbers depend heavily on train/test split, tissue, cell ontology, preprocessing, and whether a model is fine-tuned. A model that performs well for cell type annotation may not be best for perturbation prediction, batch integration, or rare-state discovery. Choose based on your biological question, not just overall accuracy.
[Optional: The Math] — Attention Mechanisms in Single-Cell Models
The self-attention mechanism is central to all these foundation models. Here’s how it works for gene expression:
Input: A cell with gene expression values g₁, g₂, …, gₙ
Step 1: Create Query, Key, Value matrices
Q = W_Q × [g₁, g₂, ..., gₙ] K = W_K × [g₁, g₂, ..., gₙ] V = W_V × [g₁, g₂, ..., gₙ]Where W_Q, W_K, W_V are learned weight matrices.
Step 2: Compute attention scores
Attention(gᵢ, gⱼ) = exp(Qᵢ · Kⱼ / √d) / Σⱼ exp(Qᵢ · Kⱼ / √d)This score represents “how much should gene i pay attention to gene j?”
Step 3: Weighted combination
Output_i = Σⱼ Attention(gᵢ, gⱼ) × VⱼBiological interpretation: High attention from CD3D to CD3E makes sense—they’re both T cell receptor components. High attention from FOXP3 to IL2RA makes sense—both are regulatory T cell markers. The model learns these gene-gene relationships from data.
Foundation models are powerful, but you’ll almost always need to adapt them to your specific biological question. Here’s the typical workflow:
Step 1: Load pre-trained model — obtain the publicly released model weights from the authors’ repository.
Step 2: Prepare your data — normalize and filter your scRNA-seq data as described in Chapter 15.
Step 3: Fine-tune on labeled subset — use 1,000 labeled cells to teach the model cell type classification for your specific tissue.
Step 4: Apply to all cells — classify all 50,000 cells using the fine-tuned model.
This is where foundation models shine. Traditional approaches might need:
With pre-trained models:
When fine-tuning, use a much smaller learning rate than pre-training:
Why? The pre-trained model has already learned good representations. You want to gently adjust these representations for your task, not overwrite them completely.
Layer freezing: You can freeze earlier layers and only train later layers. This prevents overfitting when you have limited labeled data. Early layers learn general features (basic gene co-expression patterns), while later layers learn task-specific features (which patterns indicate specific cell types).
After processing a cell through the model, we get a high-dimensional vector (e.g., 512 dimensions) that represents that cell’s transcriptional state. This is the cell embedding.
Cells with similar embeddings should be biologically similar:
We can use dimensionality reduction (UMAP, t-SNE) to visualize these embeddings in 2D:
High-dim embedding (512-D) → UMAP → 2D plot
Key insight: Foundation model embeddings often produce cleaner, more interpretable UMAP plots than traditional methods (PCA on raw counts) because they’ve learned to emphasize biologically relevant variation while ignoring technical noise.
Biological analogy: Gene regulatory network inference from embeddings is like reconstructing who gave instructions to whom in a cell — if gene A’s expression always predicts gene B’s across millions of cells, A might be upstream of B in the regulatory network. Foundation models make this inference much more reliable by drawing on patterns from far more cells than any single experiment.
Beyond cells, these models also create gene embeddings—vectors representing each gene’s functional properties:
Application: Finding genes with unknown function
For example, if an uncharacterized gene’s embedding is surrounded by cell cycle genes, it likely plays a role in cell division.
Some foundation models are trained on data from multiple species (human, mouse, zebrafish). This creates cross-species embeddings where:
This is powerful for studying human disorders where we can’t do experiments directly.
Background: Type 2 diabetes involves dysfunction of pancreatic beta cells, but the molecular mechanisms remain unclear. Different patients show different patterns of beta cell stress and failure.
The Challenge: Researchers collect pancreatic islet samples from individuals with type 2 diabetes and unaffected controls. After scRNA-seq, they have expression data from many thousands of cells. But which specific beta cell states associate with the disorder?
Foundation Model Approach:
Step 1: Transfer learning
Step 2: Identify altered states
Step 3: Cluster altered cells
Step 4: Patient stratification
Key Results:
Clinical Implications:
Why Foundation Models Helped:
The newest generation of foundation models extends beyond scRNA-seq to integrate multiple data types:
scFoundation (2024): A large-scale single-cell transcriptomics foundation model
Spatial foundation models: Incorporate spatial information from spatial transcriptomics
Traditional scRNA-seq loses spatial context—we don’t know where cells were in the tissue. Spatial transcriptomics preserves this, but generates different data types.
These models learn:
Trajectory and perturbation models: Predict cellular differentiation trajectories
During development, cells transition through intermediate states. Trajectory-aware models learn these temporal relationships from time-course, perturbation, or RNA velocity-style data.
Example: Reprogramming fibroblasts to neurons
Causality: These models learn correlations, not causal relationships. Just because genes X and Y are always co-expressed doesn’t mean X causes Y (they might both be caused by Z).
Rare cell types: Models are biased toward common cell types in training data. A cell type that appears in only 1% of training samples will be poorly represented.
Dynamic processes: Current models see static snapshots. They struggle with rapidly changing processes (immune responses, cell cycle) where timing matters.
Cellular interactions: Single-cell models analyze isolated cells. They miss intercellular signaling, physical contacts, and tissue-level organization (though spatial models are addressing this).
Computational requirements:
Data quality:
Foundation models are complex. Understanding why they make specific predictions remains difficult:
Best practice: Always validate model predictions experimentally when possible. Use the model to generate hypotheses, not as final proof.
Good use cases:
Problematic use cases:
Start simple:
Validate predictions:
Training data bias:
Clinical translation:
Foundation models learn general patterns from massive datasets and can be adapted to specific tasks through fine-tuning, requiring far less labeled data than training from scratch
scBERT pioneered BERT-style masked gene prediction for single cells, treating genes as tokens and cells as sequences, achieving strong performance on cell type annotation
Geneformer introduced rank-value encoding and self-supervised transfer learning, enabling network-biology predictions and in silico perturbation analysis with limited task-specific data
scGPT added generative capabilities, allowing prediction of cellular responses to perturbations and simulation of drug treatments or genetic modifications
Multimodal and spatial models integrate RNA-seq with chromatin accessibility, spatial position, protein abundance, or temporal dynamics to capture additional biological complexity
Transfer learning dramatically reduces data requirements: fine-tuning on 100–1,000 labeled cells often achieves performance that would require 10,000+ cells when training from scratch
Cell and gene embeddings provide interpretable, low-dimensional representations that capture biological similarity and can transfer knowledge across species
Limitations include lack of causal understanding, bias toward common cell types, computational requirements, and interpretability challenges that require careful validation
| Term | Definition |
|---|---|
| Attention mechanism | A neural network component that learns which parts of the input are most relevant for making predictions, allowing the model to focus on important gene-gene relationships |
| Batch effect | Technical variation in gene expression measurements arising from different experimental conditions, sequencing runs, or laboratories rather than true biological differences |
| Cell embedding | A high-dimensional vector representation of a cell’s transcriptional state learned by a neural network, where similar cells have similar embeddings |
| Conditional generation | The ability to generate synthetic data with specific properties by conditioning the generative model on desired characteristics (e.g., cell type, treatment condition) |
| Fine-tuning | Adapting a pre-trained model to a specific task by training on a small amount of task-specific data with a low learning rate |
| Foundation model | A large AI model trained on broad data that can be adapted to many downstream tasks, typically through fine-tuning or few-shot learning |
| Gene embedding | A vector representation of a gene that captures its functional properties and regulatory relationships based on co-expression patterns across many cells |
| Generative model | A model that learns the probability distribution of data and can generate new synthetic samples, rather than just classifying existing samples |
| In-context learning | The ability to perform a task by seeing a few examples without any parameter updates or fine-tuning, analogous to few-shot learning in language models |
| Masked gene prediction | A pre-training objective where random genes are hidden and the model learns to predict them from surrounding context, forcing it to learn gene-gene relationships |
| Multi-modal learning | Training models on multiple data types simultaneously (e.g., RNA expression and chromatin accessibility) to learn relationships between modalities |
| Perturbation prediction | Using trained models to forecast how cells will respond to genetic or chemical interventions without running actual experiments |
| Rank-value encoding | Representing gene expression by ranking genes within each cell rather than using absolute expression values, making models more robust to technical variation |
| Transfer learning | Training a model on one dataset or task, then applying or adapting it to a different but related task with minimal additional training |
| Zero-shot learning | Applying a model to tasks or data types it has never been explicitly trained on, relying on general patterns learned during pre-training |
Explain why masked gene prediction is an effective pre-training objective for single-cell data. What biological knowledge does a model gain by learning to predict masked genes from unmasked ones?
Compare rank-value encoding (Geneformer) with absolute expression values (scBERT). In what scenarios would each approach be more appropriate? What biological assumptions does each make?
A foundation model trained primarily on immune cells is applied to analyze neurons. What challenges might arise? How would you assess whether the model is making valid predictions?
Consider a model that achieves 98% accuracy on cell type classification but uses attention patterns that don’t match known biology. Is this model trustworthy? What additional validation would you require before using it for hypothesis generation?
How do foundation models handle the scale problem in single-cell biology? Explain why training one large model on millions of cells is more effective than training separate models for each experiment.
Describe the trade-off between model size and computational efficiency. When might you choose a smaller, faster model over a larger, more accurate one?
Explain how cell embeddings enable cross-species comparison. What does it mean for a human T cell and a mouse T cell to have similar embeddings?
Why is fine-tuning with a small learning rate important? What would happen if you used the same learning rate for fine-tuning as for pre-training?
scBERT
Geneformer
scGPT
In Chapter 17: Toward Whole-Cell Modeling, we’ll explore how foundation models are being integrated into comprehensive models of entire cells. We’ll see how:
Before moving on, make sure you can: