Consider a chef who has spent 20 years cooking every cuisine in the world — French, Japanese, Ethiopian, Peruvian. They’ve never cooked Korean food. But when they walk into a Korean kitchen for the first time, they don’t start from zero. They already understand heat, timing, flavor balance, fermentation. They know that doenjang is in the same family as miso. Their first attempt at kimchi jjigae will be far better than someone who has never cooked at all.
This is the core idea behind foundation models: train on a massive, diverse dataset to learn general principles, then specialize for a new task with very little additional data. The chef’s 20 years of broad cooking experience is the pre-training phase. Walking into the Korean kitchen for the first time — and succeeding quickly — is fine-tuning. In genomics, “cooking every cuisine” means pre-training on millions of DNA sequences from across the tree of life. “Walking into a Korean kitchen” means fine-tuning for your specific research question — with perhaps only a hundred labeled examples.
Why does this matter? Because the hardest biological questions are rarely the well-funded ones with thousands of labeled samples. A rare neurological disorder affecting 50 patients worldwide. A previously unstudied cell type from pancreatic islets. A pathogen that no one has sequenced before. In each case, a model trained from scratch would fail — you simply don’t have enough data to teach it genomics from the ground up. But a foundation model already knows genomics. You’re just teaching it the last mile.
The deep learning models we’ve encountered so far — DeepSEA, Basenji, Enformer — are powerful but face a critical limitation: every new biological question requires new labeled data and new training. Foundation models break this cycle. They learn the grammar of the genome once, from massive unlabeled sequence data, and transfer that grammar wherever it’s needed. This chapter is about how they do it.
The deep learning models we’ve encountered so far—DeepSEA, Basenji, Enformer—are powerful but face a critical limitation: they require task-specific training data. DeepSEA needs chromatin accessibility measurements. Basenji needs expression data across tissues. Each model must be trained from scratch on labeled examples for its specific task.
This creates several problems for biological research:
The Data Scarcity Problem: Many important biological questions involve rare cell types, rare conditions, or novel experimental contexts where labeled training data is limited or nonexistent. You might have RNA-seq from 30 patients with a rare disorder, or ATAC-seq from 100 cells of a newly discovered cell type. Traditional supervised learning fails here—you can’t train a deep neural network with 30 examples.
The Task-Specific Problem: Every new biological question requires collecting new training data and training a new model. Want to predict enhancers? Train a model. Want to predict promoters? Train another model. Want to predict splice sites? Train yet another model. Each task requires thousands of labeled examples and days of training time.
The Generalization Problem: Models trained on one species, tissue, or condition often fail when applied to different contexts. A model trained on human heart tissue might not work for brain tissue. A model trained on reference genomes might not handle patient-specific variants well.
These challenges highlight a fundamental inefficiency: we’re asking each model to learn the basic grammar of genomic sequences from scratch, over and over again. It’s like teaching someone to read Shakespeare by only showing them Shakespeare—they’ll learn Shakespeare specifically, but not the general rules of English that could help them read anything.
The experimental alternative—generating comprehensive labeled data for every possible biological question—would cost billions of dollars and take decades. A model trained to predict chromatin states in one cell type can’t simply be asked “What about this other cell type?” without retraining.
What we need is a different approach: models that learn general genomic principles from massive unlabeled data, then transfer that knowledge to specific tasks with minimal additional training. This is the promise of foundation models.
The term “foundation model” emerged in 2021 to describe a new paradigm in artificial intelligence. A foundation model is a large-scale model trained on broad, unlabeled data that can be adapted to a wide range of downstream tasks with minimal task-specific training.
The key characteristics are:
The foundation model concept originated in natural language processing (NLP) with models like BERT and GPT. These models are trained on billions of words of text to understand language structure, then can be adapted to specific tasks like sentiment analysis, question answering, or translation with relatively little task-specific data.
Genomics is actually an ideal domain for foundation models, perhaps even more so than natural language:
Abundant Unlabeled Data: There are billions of DNA sequences available—complete genomes from thousands of species, metagenomic data from environments, and sequence databases like GenBank containing trillions of base pairs. This data is unlabeled in the sense that we don’t have ground-truth annotations for every function, but it contains patterns.
Conserved Grammar: Just as all English text follows grammatical rules, all genomic sequences follow biological rules—promoter motifs, splice signals, regulatory grammars, codon usage patterns. A model that learns these rules could apply them across contexts.
Transfer Across Tasks: The same sequence features relevant for predicting enhancers might also be relevant for predicting transcription factor binding, chromatin accessibility, or evolutionary constraint. Knowledge should transfer.
Data Scarcity for Specific Tasks: While we have abundant sequence data, we often have limited labeled data for specific biological questions—especially for rare cell types, rare conditions, or novel organisms.
Cost of Experimental Validation: Generating labeled training data in genomics is expensive. A single ChIP-seq experiment costs $1,000-5,000. Whole-genome sequencing with deep phenotyping for thousands of individuals costs millions. If a foundation model could reduce the labeled data requirement from 10,000 examples to 100 examples, the cost savings would be enormous.
Let’s contrast the old paradigm with the new:
Old Paradigm (Task-Specific Models):
New Paradigm (Foundation Models):
The key insight: learning general genomic patterns from unlabeled data is the expensive part. Once learned, adapting to specific tasks is relatively cheap.
This is analogous to how humans learn. You don’t learn to read from scratch every time you encounter a new book. You learned general reading skills once, and now you can read anything—scientific papers, novels, recipes—with minimal adjustment.
Transfer learning is the ability of a model to apply knowledge learned from one task to a different but related task. Instead of starting from a blank slate (random parameters), you start from a model that already knows something useful.
The biological analogy: imagine you’re a cell biologist who studies yeast. You’ve spent years learning molecular biology techniques, experimental design, and data analysis. Now you switch to studying human cells. You don’t start from zero—much of your knowledge transfers. You understand Western blots, PCR, microscopy, statistical analysis. You only need to learn the specifics of human cell culture and human-specific biology. Your “pretrained” expertise transfers.
In neural networks, transfer learning works similarly:
Source Task (Pretraining): The model learns general features from a large dataset. For genomics, this might be “predict the next nucleotide in a sequence” trained on 3 billion base pairs of genomic DNA.
Target Task (Fine-tuning): The model adapts to a specific task using a smaller dataset. For example, “classify whether a sequence is an enhancer” using 5,000 labeled examples.
The pretrained model’s parameters encode general genomic knowledge—motif patterns, dinucleotide frequencies, splice signals, regulatory grammars. This knowledge helps even on tasks the model wasn’t explicitly trained for.
Transfer learning is effective when the source and target tasks share underlying structure. Genomics has abundant shared structure:
Shared Motifs: Transcription factor binding motifs like TATA boxes, E-boxes, and CTCF sites appear in many contexts. A model that learns to recognize these motifs when predicting chromatin accessibility can transfer that knowledge to predicting transcription factor binding.
Shared Regulatory Grammar: Enhancers in liver and enhancers in brain have different specific sequences, but they follow similar structural principles—transcription factor binding site clustering, appropriate distances from promoters, GC content patterns.
Conserved Evolutionary Patterns: Functionally important regions tend to be conserved across species. A model trained to recognize conservation patterns in one context can apply that to other contexts.
Shared Sequence Context: The nucleotide patterns surrounding splice sites, start codons, and poly-A signals are similar across genes and tissues.
Let’s formalize this intuition. Suppose we have:
A neural network can be thought of as two components:
$$f(x) = g(h(x))$$
Where:
Standard Training (No Transfer):
Transfer Learning:
The pretrained $h()$ learns features like “transcription factor motif detected at position 47” or “high GC content in this window”—features useful for many tasks, not just the source task.
Feature Extraction (Frozen Features):
Fine-tuning (Adapted Features):
Multi-task Learning:
Foundation models need to learn from massive datasets, but manually labeling billions of sequences is impossible. Even if you had the funding, what would you label them with? Most sequences don’t have functional annotations.
This is where self-supervised learning comes in: using the data itself to create training signals, without needing human-provided labels.
The key insight: you can create a supervised learning problem from unlabeled data by hiding parts of the data and asking the model to predict them.
The most successful self-supervised approach is language modeling: predicting masked or future tokens.
For Text (BERT approach):
The model must understand grammar, context, and semantics to predict masked words correctly. By training on billions of sentences, models learn language structure without anyone manually labeling what each sentence means.
For DNA Sequences (Genomic Language Models):
ATGCGATTACGATCGTACGATATGCGA[MASK][MASK]ACGATCGTACGATTo predict masked nucleotides, the model must learn:
ATG is masked, it might be a start codon)This self-supervised task doesn’t require any experimental data—just raw sequences. Yet it forces the model to learn genomic grammar.
Several self-supervised objectives have proven effective:
Masked Language Modeling (MLM):
Next Token Prediction (Autoregressive):
ATGCGA, predict next nucleotideSequence Order Prediction:
Contrastive Learning:
The effectiveness of self-supervised learning might seem magical—how can predicting masked nucleotides teach a model about enhancers, splice sites, or gene expression?
The key is that genomic sequences have structure, and that structure is relevant to function:
Local Structure: To predict masked nucleotides in motifs, the model must learn motif patterns. A model that learns TATA[MASK][MASK] is often TATAAA (TATA box) has learned something about promoters.
Context Dependencies: To predict a nucleotide, the model must consider surrounding context. This forces learning of regulatory grammar—transcription factor binding sites cluster near enhancers, splice donor sites appear in specific contexts.
Evolutionary Constraints: Functionally important sequences are conserved. A model trained on sequences from multiple species will learn that certain patterns are preserved, indicating functional importance.
Statistical Patterns: Gene-rich regions, repeat elements, CpG islands, and other large-scale genomic features have characteristic sequence statistics. Learning to predict nucleotides captures these patterns.
The result: a model pretrained to predict masked nucleotides develops internal representations that encode functional properties of genomic sequences, even though those properties were never explicitly labeled during training.
[Optional: The Math]
Math Box: The Masked Language Modeling Objective
Let’s formalize the masked language modeling objective mathematically.
Given a genomic sequence $\mathbf{s} = (s_1, s_2, …, s_L)$ where each $s_i \in {A, C, G, T}$ and $L$ is the sequence length:
Step 1: Masking
- Randomly select positions $M \subset {1, 2, …, L}$ to mask (typically 15% of positions)
- Create masked sequence $\tilde{\mathbf{s}}$ where $\tilde{s}_i = [MASK]$ if $i \in M$, otherwise $\tilde{s}_i = s_i$
Step 2: Prediction
- Model $f\theta$ (parameterized by $\theta$) takes masked sequence and outputs probability distribution over nucleotides for each position: $$P\theta(si | \tilde{\mathbf{s}}) = \text{softmax}(f\theta(\tilde{\mathbf{s}})_i)$$
Step 3: Loss Function
- Objective is to maximize probability of correct nucleotides at masked positions: $$\mathcal{L}(\theta) = -\sum{i \in M} \log P\theta(s_i | \tilde{\mathbf{s}})$$
This is negative log-likelihood (cross-entropy loss). We want to maximize log-probability, which is equivalent to minimizing negative log-probability.
Biological Interpretation:
- The model learns to encode each position’s context (surrounding nucleotides) into a representation
- This representation must contain information about motifs, patterns, and constraints
- These learned representations transfer to downstream tasks like enhancer prediction
Example Calculation: Suppose at a masked position, the true nucleotide is
A, and the model outputs probabilities:
- $P(A) = 0.7$
- $P(C) = 0.1$
- $P(G) = 0.1$
- $P(T) = 0.1$
Loss at this position: $-\log(0.7) = 0.357$
If the model is uncertain and outputs $P(A) = 0.25$ (random guessing), loss would be $-\log(0.25) = 1.386$—much higher penalty.
Over billions of sequences, minimizing this loss forces the model to learn predictive patterns in genomic sequences.
Pretraining is the computationally expensive stage where the model learns general genomic knowledge from massive unlabeled data.
Pretraining Data:
Pretraining Task:
Pretraining Outcome:
Biological Analogy: Pretraining is like learning to read and write in your native language. You read thousands of books, articles, and documents. You haven’t specifically trained for writing scientific papers, but you’ve learned grammar, vocabulary, and structure that will help with any writing task.
Biological Analogy: Like a medical residency program: after years of seeing diverse patients and building broad clinical experience, you then choose a specialty. The model reads millions of unlabeled sequences, developing general biological intuition.
Fine-tuning is the task-specific stage where the pretrained model adapts to your biological question using a small amount of labeled data.
Fine-tuning Data:
Fine-tuning Process:
Fine-tuning Outcome:
Biological Analogy: Fine-tuning is like taking a general biology course (pretraining) then specializing in neuroscience (fine-tuning). You don’t relearn cell biology from scratch—you build on foundation and add specialization.
Biological Analogy: Like choosing a residency specialty: building on foundational knowledge, you can specialize for a specific task (e.g., variant effect prediction) with relatively few labeled examples.
There are several approaches to fine-tuning, trading off between computation and adaptation:
Full Fine-Tuning:
Feature Extraction (Frozen Backbone):
Partial Fine-Tuning:
Low-Rank Adaptation (LoRA):
Let’s walk through a concrete example.
Problem: Predict enhancers in pancreatic beta cells. You have ChIP-seq data for the enhancer-associated histone mark H3K27ac in beta cells, giving you 3,000 positive examples (enhancers) and 10,000 negative examples (non-enhancers).
Approach 1: Training from Scratch
Approach 2: Using Pretrained Foundation Model
Actual Numbers from Research: Studies comparing these approaches on ENCODE enhancer prediction tasks found:
This demonstrates the power of transfer learning: pretrained knowledge dramatically reduces labeled data requirements.
Fine-tuning requires some labeled examples. But what if you have almost no labeled data? Foundation models enable a spectrum of learning paradigms:
Traditional Supervised Learning:
Fine-Tuning:
Few-Shot Learning:
One-Shot Learning:
Zero-Shot Learning:
Biological Analogy: Like a trained immunologist recognizing a pathogen they have never seen before. Prior knowledge generalizes to new situations.
Zero-shot learning is the ability to perform a task without any task-specific training examples. This might seem impossible—how can a model do something it was never trained to do?
The key: the pretrained model’s internal representations encode genomic properties. Even without explicit training on your task, the model might have learned relevant features during pretraining.
How It Works:
Example: Predicting Regulatory Function
Suppose you want to predict whether a sequence is a promoter, enhancer, or silencer, but you have no labeled training data.
With a pretrained foundation model:
This works because the model learned during pretraining that promoters have certain sequence characteristics (TATA boxes, CpG islands, proximity to transcription start sites), enhancers have others (TF motif clusters, H3K27ac association), etc. Even though it was never explicitly trained to classify regulatory elements, it learned features that distinguish them.
Limitations:
When to Use:
Few-shot learning extends zero-shot by allowing a handful of examples. The goal is to adapt to a new task with 5-100 examples instead of thousands.
Approach 1: Fine-Tuning with Regularization
Approach 2: Metric Learning
Approach 3: Prompt-Based Learning
Example: Discovering Binding Sites with Few Examples
Suppose you’re studying a transcription factor with only 10 known binding sites (from ChIP-seq peaks).
Traditional approach: Can’t train a deep learning model with 10 examples—would severely overfit.
Few-shot approach with foundation model:
Research has shown this can achieve ~70% recall at 5% false positive rate with just 10 examples—far better than motif-based methods.
Let’s examine a practical case study from recent research.
Problem: Predict which genomic variants affect gene splicing. Standard approach requires collecting thousands of variants with experimentally validated splicing outcomes—expensive and time-consuming.
Zero-Shot Approach Using Pretrained Model:
Researchers used a genomic foundation model (similar to models we’ll discuss in Chapters 13-14) and asked: can embeddings predict splicing effects without any splicing-specific training?
Method:
Results:
Key Insight: The model learned during pretraining that certain sequence patterns near exon-intron boundaries are important. When variants disrupt these patterns, embeddings change substantially. The model never saw splice labels during training, but learned features relevant to splicing.
Practical Application:
Foundation models for genomics are typically based on transformer architectures (which we introduced in Chapter 10). Key design decisions:
Model Size:
Context Length:
Tokenization:
Positional Encoding:
Training a genomic foundation model requires substantial resources:
Data Scale:
Computational Requirements:
Training Time:
Data Preprocessing:
How do we know if a foundation model is good? Unlike task-specific models, we can’t evaluate on a single task.
Pretraining Metrics:
Downstream Task Performance:
Probing Tasks:
Data Efficiency:
Research on language models has revealed “scaling laws”: predictable relationships between model size, data size, and performance.
Scaling Laws for Genomics:
Similar patterns emerge in genomic foundation models:
Larger models → better transfer learning (but with diminishing returns)
More diverse data → better generalization
Longer training → better, but eventually plateaus
Practical Implications:
One of the most exciting applications of genomic foundation models is transferring knowledge across species.
The Challenge:
Transfer Learning Approach:
Pretrain on multi-species data (human, mouse, zebrafish, fly, etc.)
Fine-tune on target species with limited data
Results from Research: Studies show impressive cross-species transfer:
Why It Works:
Similar benefits apply within a species across tissues or conditions:
Example: Tissue-Specific Enhancers
Example: Condition-Specific Expression
A challenge in transfer learning is domain shift: when test data differs systematically from training data.
Examples in Genomics:
Approaches to Handle Domain Shift:
Domain-Adversarial Training:
Unsupervised Domain Adaptation:
Multi-Task Learning:
Calibration:
Foundation models aren’t always the best choice. Use them when:
✅ Good Use Cases:
❌ Less Suitable Cases:
Many genomic foundation models are now publicly available. How to choose?
Consider:
Pretraining Data:
Model Size:
Context Length:
Task Performance:
Availability and Documentation:
To get best results when fine-tuning:
Data Preparation:
Hyperparameter Selection:
Regularization:
Gradual Unfreezing:
Task-Specific Architecture:
Foundation models are complex and can be hard to interpret. Approaches to understand them:
Attention Visualization:
Gradient-Based Methods:
In Silico Mutagenesis:
Embedding Analysis:
Probing Classifiers:
Despite their promise, genomic foundation models face challenges:
Limited Context Length:
Computational Cost:
Data Bias:
Interpretability:
Generalization to Rare Events:
Research is addressing these limitations:
Longer Context:
Efficient Models:
Diverse Training Data:
Interpretable Architectures:
Active Learning:
Exciting developments on the horizon:
Multi-Modal Foundation Models:
Foundation Models for Protein Sequences:
Genome-Scale Simulations:
Personalized Genomics:
Automated Scientific Discovery:
Key Takeaways:
Foundation models are large-scale models pretrained on broad data that can be adapted to many downstream tasks, addressing the challenge of limited labeled data in genomics
Transfer learning enables applying knowledge from one task to another, dramatically reducing the labeled data requirement from tens of thousands to hundreds of examples
Self-supervised learning trains models on unlabeled genomic sequences using objectives like masked nucleotide prediction, learning genomic grammar without experimental data
Pretraining and fine-tuning is a two-stage paradigm: first learn general genomic patterns from massive unlabeled data, then adapt to specific tasks with small labeled datasets
Zero-shot and few-shot learning extend foundation models to scenarios with minimal or no labeled data, using learned embeddings for prediction without task-specific training
Cross-species and cross-condition transfer allows knowledge learned in one biological context (e.g., human liver) to improve predictions in another (e.g., mouse pancreas) by leveraging conserved genomic principles
Foundation models work because genomic sequences have shared structure—transcription factor motifs, regulatory grammar, evolutionary conservation—that transfers across contexts and tasks
| Term | Definition |
|---|---|
| Domain adaptation | Adjusting a model to perform well on a new domain (e.g., different species or tissue) that differs from the training domain |
| Embedding | Vector representation of a sequence produced by a neural network, encoding its properties in continuous space |
| Feature extraction | Using a pretrained model’s learned representations without updating its parameters, only training task-specific output layers |
| Few-shot learning | Learning to perform a task with very limited labeled examples (typically 5-100), leveraging pretrained knowledge |
| Fine-tuning | Adapting a pretrained model to a specific task by training on task-specific labeled data with small learning rate adjustments |
| Foundation model | Large-scale model trained on broad data that can be adapted to many downstream tasks through transfer learning |
| Masked language modeling (MLM) | Self-supervised learning objective where random tokens are masked and the model learns to predict them from context |
| One-shot learning | Extreme case of few-shot learning where only a single example is available for a new task |
| Pretraining | Initial training phase where a model learns general patterns from large-scale unlabeled data before being adapted to specific tasks |
| Self-supervised learning | Learning from unlabeled data by creating training signals from the data itself (e.g., predicting masked parts) |
| Transfer learning | Applying knowledge learned from one task or domain to improve performance on a different but related task |
| Zero-shot learning | Performing a task without any task-specific training examples, relying only on pretrained knowledge and embeddings |
Explain why a genomic foundation model pretrained on human sequences might help predict enhancers in mouse liver, even though it never saw mouse liver ChIP-seq data during pretraining. What genomic features transfer across this species and tissue boundary?
A researcher has RNA-seq data from only 20 patients with a rare neurological disorder. Should they train a deep learning model from scratch or use a pretrained foundation model? Explain your reasoning considering both accuracy and the biology of rare disorders.
Masked language modeling requires a model to predict hidden nucleotides from context. How does this self-supervised task force the model to learn about biologically relevant features like transcription factor binding motifs or splice sites, even though these features are never explicitly labeled during training?
Compare the computational costs of: (a) training a task-specific model from scratch for 10 different genomic prediction tasks, versus (b) pretraining one foundation model then fine-tuning it for the same 10 tasks. When would each approach be more efficient?
Zero-shot learning allows making predictions without any task-specific training data by using sequence embeddings. Describe a biological scenario where zero-shot predictions would be particularly valuable, and explain what limitations you would expect compared to fine-tuned predictions.
Explain why longer context length (the number of nucleotides a model can process at once) is particularly important for foundation models in genomics, compared to foundation models in natural language processing.
A foundation model trained only on reference human genome sequences might perform poorly when applied to cancer genomes with many somatic mutations. This is called “domain shift.” Suggest two approaches to address this problem while still leveraging the pretrained model.
Some researchers argue that foundation models are “black boxes” that provide predictions without biological insight, while others claim they can reveal biological principles. Evaluate both perspectives using specific examples of what we can and cannot learn from foundation model embeddings.
Ethical Considerations in Population Genomics: Most genomic foundation models are pretrained predominantly on European ancestry genomes. Discuss how this could lead to performance disparities when models are applied to individuals of other ancestries. What steps should researchers take to address this bias, and what are the broader equity implications?
The Cost-Benefit of Foundation Models: Pretraining a large genomic foundation model might cost $500,000 in computational resources, while task-specific models for individual labs cost ~$1,000 each. From a scientific community perspective, when does it make sense to invest in foundation models versus encouraging labs to train task-specific models? Consider issues of access, reproducibility, and innovation.
Interpretability vs. Performance: Foundation models often provide better predictions than simpler, more interpretable models (like position weight matrices for motifs). In what biological contexts is it acceptable to use “black box” predictions, and when is mechanistic interpretability essential? Give specific examples.
Data Privacy and Sensitive Information: Genomic data is highly sensitive and can identify individuals. If foundation models are trained on genomic databases, could they inadvertently memorize and reveal information about individuals in the training data? What safeguards should be implemented?
Generalization to Truly Novel Biology: Foundation models excel at transferring knowledge to scenarios similar to their training data. However, what about truly novel biology—undiscovered regulatory mechanisms, synthetic genomes, or organisms from extreme environments? Discuss the limitations of transfer learning for genuinely new biological phenomena and how these might be addressed.
Bommasani, R., et al. (2021). “On the Opportunities and Risks of Foundation Models.” arXiv:2108.07258.
Devlin, J., et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL.
Pan, S. J., & Yang, Q. (2010). “A Survey on Transfer Learning.” IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
Dalla-Torre, H., et al. (2023). “The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics.” bioRxiv.
Jumper, J., et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature, 596, 583-589.
Nguyen, E., et al. (2023). “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.” arXiv:2306.15794.
Hugging Face Genomics Models: https://huggingface.co/models?other=genomics
Repository of pretrained genomic models including DNABERT, Nucleotide Transformer
Papers with Code - Transfer Learning in Genomics: https://paperswithcode.com/task/transfer-learning
Benchmarks and leaderboards for transfer learning methods
Awesome Genomic Language Models: https://github.com/topics/genomic-language-models
Community-curated list of genomic foundation models and resources
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Chapter 15 on representation learning provides theoretical foundations for why transfer learning works
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). “A survey of transfer learning.” Journal of Big Data, 3(1), 1-40.
Comprehensive technical survey of transfer learning methods
You’ve now learned the conceptual foundations of genomic foundation models—what they are, why they work, and how they’re changing genomics research. Foundation models address the fundamental challenge of data scarcity by learning general genomic principles from massive unlabeled data, then transferring that knowledge to specific tasks with minimal labeled examples.
In the next three chapters, we’ll examine specific genomic foundation models in detail:
Chapter 13: DNA Language Models will cover the first generation of genomic foundation models based on BERT-style architectures:
Chapter 14: Next-Generation DNA Models will explore cutting-edge architectures that address current limitations:
Chapter 15: Introduction to Single-Cell Omics will shift from DNA sequences to gene expression, preparing for:
Before proceeding, make sure you’re comfortable with:
If any of these concepts are unclear, review the relevant sections in this chapter. The specific models in Chapters 13-14 all build on these foundational concepts, so understanding them now will make the next chapters much easier to follow.
Ready to dive into specific DNA language models? Let’s explore how BERT-style transformers have been adapted for genomic sequences in Chapter 13!
Chapter 12 Complete
Next: Chapter 13 - DNA Language Models