Three photographers stand before the same mountain landscape at golden hour. The first uses a macro lens — pressed close to a wildflower in the foreground, resolving every vein in a single petal, the texture of a stamen, the precise arc of a dew drop. The second has a telephoto — trained on the horizon, tracking the V-formation of geese moving south across the ridge, capturing the sweep of their journey but nothing of the flower below. The third holds a fisheye — stepping back until the entire panorama fills the frame: mountain, meadow, birds, flower, sky, all at once, compressed into a single circular window on the world.
They are standing in the same place, photographing the same scene. Yet each returns with something fundamentally different — not because one is a better photographer, but because their lenses were built for different questions. The macro lens would be useless for tracking migration. The fisheye would dissolve the flower petal into an indistinct smear. The right lens depends entirely on what you are trying to see.
Genomic data presents exactly this problem. A short transcription factor binding motif — 8 nucleotides tucked inside a 200-base-pair regulatory sequence — calls for a macro lens: precise, local, pattern-sensitive. A gene expression time course across T cell activation calls for a telephoto: sequential, history-aware, tracking how the present depends on the past. Long-range chromatin interactions between enhancers and promoters separated by half a megabase call for the fisheye: a view wide enough to see connections that local inspection would miss entirely.
In Chapter 3, we learned how neural networks learn in general. This chapter asks a different question: what shape should a network take? Convolutional networks, recurrent networks, and Transformers are not interchangeable tools — they are different lenses. Choosing the wrong architecture is not a minor inefficiency; it is like trying to photograph a bird migration with a macro lens. This chapter is about learning to see which lens fits which problem.
The biological world generates diverse data types, each with unique structural properties:
Spatial patterns in sequences: Transcription factor binding sites, splice sites, and regulatory elements are position-specific patterns. A motif at position 50-58 carries information, but the same motif shifted to position 100-108 means something different. Fully-connected networks treat every position as independent features—destroying spatial relationships.
Sequential dependencies in time series: Gene expression, cell state trajectories, and developmental processes unfold over time. Expression at time t depends on history: what happened at times t-1, t-2, t-3. Treating each time point independently loses this temporal structure.
Long-range dependencies: Enhancers regulate genes megabases away. Alternative splicing depends on splicing signals hundreds of nucleotides apart. 3D genome organization brings distant loci together. Standard neural networks have limited “receptive fields”—they can’t see these long-range interactions.
Why not just use fully-connected networks for everything?
The solution: Specialized neural network architectures that match the structure of biological data.
By the end of this chapter, you will be able to:
[Biological Analogy] CNNs are like scanning a histology slide section by section, detecting cell patterns in small windows—the same pattern detector slides across the entire sample, regardless of position.
Imagine you’re looking for transcription factor binding motifs in DNA sequences. A motif like “GATAAG” might appear anywhere in the sequence—position 10, position 150, position 800. You don’t want to learn three different “GATAAG detectors” for three different positions.
CNNs solve this with a simple idea: Use the same pattern detector (called a filter or kernel) and slide it across the entire sequence, like using a magnifying glass to scan a document.
Connection to Chapter 3: Remember how individual neurons in Chapter 3 had weights for each input? CNNs use the same set of weights at every position—this is called weight sharing, and it dramatically reduces the number of parameters!
Think of it like this: You have a small window (the filter) that you slide along the DNA sequence, checking for a specific pattern at each position.
Step 1: Define a filter (kernel)
A filter is a small set of learnable weights. For DNA sequences, a filter might be 8 bp wide:
Filter (length 8):
[w₁, w₂, w₃, w₄, w₅, w₆, w₇, w₈]
Step 2: Slide the filter across the sequence
Apply the filter to every position:
DNA sequence: ATCGATAAGCCGTA...
Position 1: ATCGATAA → compute score
Position 2: TCGATAAG → compute score
Position 3: CGATAAGC → compute score
...
At each position, multiply filter weights by sequence features and sum them up.
Step 3: Apply activation function
Activated score = ReLU(Score + bias)
Result: A “feature map” showing where the pattern appears:
Feature map: [0.2, 0.0, 0.0, 8.5, 7.3, 0.1, ...]
↑ ↑
High activations = pattern detected!
Let’s detect “GATAAG” in a real sequence.
Input sequence:
Position: 1 2 3 4 5 6 7 8 9 10 11 12
Sequence: A T C G A T A A G C C G
After training, the CNN learns that certain positions should have high scores when the GATA pattern appears.
At position 5-9 (where “GATAAG” is present), the filter gives a high score (~9.3).
At other positions without the pattern, the score is low (~0.2).
The CNN automatically learns which weights detect important patterns by training on labeled data!
Real CNNs use dozens or hundreds of filters simultaneously:
Each filter creates one feature map, so 100 filters create 100 feature maps showing different patterns.
After convolution, we often use max pooling to:
Before pooling: [0.2, 0.0, 8.5, 7.3, 0.1, 0.3]
Max pooling (window=2): [0.2, 8.5, 0.3]
↑ ↑ ↑
Take max from each pair
Biological intuition: “Is there a GATA motif somewhere in this region?” is often more important than “Is there a GATA motif exactly at position 57?”
This is hierarchical feature learning—just like how your visual system detects edges → shapes → objects!
A typical CNN for sequence analysis:
Input: DNA sequence (1000 bp × 4 channels)
↓
Conv Layer 1: 128 filters, size 8
↓ (128 feature maps)
ReLU + Max Pool (window=2)
↓
Conv Layer 2: 64 filters, size 8
↓ (64 feature maps)
ReLU + Max Pool (window=2)
↓
Flatten → Fully Connected Layer → Output
Parameter efficiency:
[Optional: The Math]
At each position i, the convolution output is:
y(i) = ReLU( Σⱼ w(j) · x(i+j) + b )
where w is the filter weight vector, x is the input sequence, and b is a bias term. The same w and b are used at every position—this is weight sharing.
What is DeepSEA? A CNN-based model that predicts chromatin features directly from DNA sequence (Zhou & Troyanskaya 2015, Nature Methods).
Task: Given a 1000bp DNA sequence, predict 919 chromatin features:
Training data: DeepSEA and similar CNN models were trained on massive public epigenome datasets:
Major Genomics Consortia Data
ENCODE (Encyclopedia of DNA Elements)
Goal: Catalog all functional elements in the human genome
Data: >10,000 experiments across cell types
Includes: TF binding, histone marks, chromatin accessibility, RNA expression
Why it matters: Provides ground truth labels for training AI modelsRoadmap Epigenomics
Goal: Map epigenomic landscapes across human tissues
Data: 111 reference epigenomes
Focus: DNA methylation, histone modifications, chromatin states
Why it matters: Shows how the same DNA sequence has different functions in different cell typesFANTOM (Functional Annotation of the Mammalian Genome)
Goal: Identify transcription start sites and enhancers
Data: Cap Analysis Gene Expression (CAGE) across hundreds of samples
Why it matters: Defines where transcription actually begins genome-wide
Why it matters: A single nucleotide change can disrupt a binding site. DeepSEA can predict this effect without doing expensive experiments!
✅ Good for:
❌ Not ideal for:
[Biological Analogy] RNNs are like reading a protein sequence one amino acid at a time, updating your interpretation as you go—each new residue informs your understanding of the whole chain’s function.
Imagine predicting gene expression at hour 24 based on measurements at hours 0, 6, 12, 18:
Hour: 0 6 12 18 24
Expression: 2.3 4.1 7.8 6.2 ???
The challenge: Expression at hour 24 doesn’t just depend on hour 18—it depends on the entire trajectory. Did expression increase gradually (2.3→4.1→7.8) or spike suddenly? This history matters!
CNNs can’t handle this well because:
We need a network that processes sequences one step at a time, maintaining memory of what came before.
An RNN maintains a hidden state that gets updated at each time step. Think of it as the network’s “memory” or “notes” that it carries forward.
h₀ (initial memory: empty)
↓
x₁ → [RNN] → h₁ → output₁
↓
x₂ → [RNN] → h₂ → output₂
↓
x₃ → [RNN] → h₃ → output₃
At each step:
The hidden state is like taking notes as you read a story—each new sentence updates your understanding!
At each time step, the RNN does two things:
new_memory = combine(current_input, previous_memory)
output = process(new_memory)
Key point: The same processing happens at every time step—the network uses the same “rules” throughout the sequence.
Let’s predict expression at hour 24:
Input sequence: [2.3, 4.1, 7.8, 6.2] (hours 0, 6, 12, 18)
Target: 5.5 (hour 24)
Processing step by step:
Hour 0: Input: 2.3 → Memory: “I saw expression = 2.3”
Hour 6: Input: 4.1 + Previous memory → Memory: “I saw 2.3, then 4.1 (increasing trend!)”
Hour 12: Input: 7.8 + Memory → Memory: “Expression is rising: 2.3 → 4.1 → 7.8”
Hour 18: Input: 6.2 + Memory → Memory: “Rose to 7.8, then dropped to 6.2 (peak and decline!)”
Final prediction: Based on memory “peaked at 7.8, now declining to 6.2” → Prediction: ~5.5
The final memory contains information about the entire trajectory!
[Optional: The Math]
At each step t, the RNN update is:
h(t) = tanh( Wₓ · x(t) + Wₕ · h(t-1) + b )
output(t) = Wₒ · h(t)
where Wₓ, Wₕ, Wₒ are weight matrices shared across all time steps.
Connection to Chapter 2: This is like Bayesian updating! Each new observation updates the network’s “belief” (hidden state) about what’s happening.
RNNs have a serious limitation: they forget long-term dependencies.
Why? During training, the network needs to learn from examples that happened many steps ago. But as gradients flow backward through time, they get weaker and weaker—like a whisper that fades as it travels through a long corridor.
In practice: Basic RNNs can only remember ~10 time steps back. For biology, this is problematic:
Imagine trying to predict the ending of a book but only remembering the last 2 pages—that’s the vanishing gradient problem!
Solution: Long Short-Term Memory (LSTM) networks.
[Biological Analogy] LSTMs are like immunological memory—they can retain information about early antigens even after many cell divisions, selectively keeping important signals while discarding noise.
LSTMs solve the forgetting problem with a clever mechanism: gates that control information flow.
Think of memory like a notepad where you can:
Unlike basic RNNs that gradually forget everything, LSTMs actively choose what to remember and what to forget!
At each time step, an LSTM asks three questions:
1. What should I forget? (Forget gate)
2. What should I remember? (Input gate)
3. What should I output? (Output gate)
An LSTM has two types of memory:
Cell state (C): Long-term memory storage
↓
Hidden state (h): What's currently active/relevant
The cell state is like a conveyor belt that carries information forward, with gates deciding what gets added or removed along the way.
[Optional: The Math]
The three gates are computed as:
- Forget gate: f(t) = σ( Wf · [h(t-1), x(t)] + bf )
- Input gate: i(t) = σ( Wi · [h(t-1), x(t)] + bi )
- Output gate: o(t) = σ( Wo · [h(t-1), x(t)] + bo )
Cell state update: C(t) = f(t) ⊙ C(t-1) + i(t) ⊙ tanh( Wc · [h(t-1), x(t)] + bc )
where σ is the sigmoid function (outputs 0–1) and ⊙ is element-wise multiplication.
Imagine an LSTM processing a gene sequence to predict splice sites:
Position 1-50: Exon sequence
Memory: "I'm in an exon" (stored in cell state)
Output: "Not a splice site"
Position 100: Donor site (GT)
Forget gate: "Keep exon memory" (forget = 0.9, keep most of it)
Input gate: "Remember this GT!" (input = 1.0, store strongly)
Updated memory: "Was in exon, now saw donor site GT"
Output: "This is a donor splice site!"
Position 101-500: Intron sequence
Forget gate: "GT is still relevant, but exon info can fade"
Cell state: Maintains "GT donor site" memory across 400 nucleotides
Position 530: Acceptor site (AG)
Cell state still remembers: "Saw GT donor 430 nt ago"
Input gate: "Remember this AG!"
Output: "This AG pairs with the GT I saw earlier—it's an acceptor!"
The key insight: The cell state carried “GT donor” information across 430 nucleotides—something basic RNNs can’t do!
SpliceAI (Jaganathan et al., 2019) predicts splice sites using a deep residual convolutional network rather than an LSTM. It is useful to mention here because it solves the same biological problem LSTMs were designed to address: carrying sequence context across hundreds to thousands of nucleotides.
Task: Given a gene sequence, predict:
Challenge: Splice sites can depend on regulatory elements hundreds or thousands of nucleotides away.
Impact: Predicts how variants affect splicing—a mutation far from the canonical GT-AG splice dinucleotides can disrupt splicing by altering regulatory sequence.
For many biological sequences, context matters in both directions:
← Look backward
A T C G [?] T A G C
→ Look forward
Bidirectional LSTM: Process the sequence twice:
Example: Protein secondary structure prediction needs context from both amino acids before and after each position.
✅ Good for:
❌ Not ideal for:
[Biological Analogy] The Transformer attention mechanism is like a transcription factor that can “attend” to all accessible chromatin regions simultaneously, not just nearby ones—it can detect a distant enhancer at 50 kb as easily as one at 500 bp.
Even LSTMs have limitations:
1. Sequential processing: Must process step 1 before step 2 before step 3…
2. Still limited range: While better than RNNs, LSTMs struggle with dependencies 500+ steps apart
3. No direct “importance” mechanism: LSTM learns what’s important implicitly, but can’t explicitly say “position 500 is crucial for position 800”
Transformers solve all three problems with a revolutionary idea: attention.
Instead of reading a sequence step-by-step, look at all positions at once and learn which ones are important for each other.
Analogy: You’re writing a research paper. Instead of reading all 50 references sequentially, you:
For genomics: When processing position 800 (a promoter), the Transformer can directly look at position 500 (an enhancer) without reading through positions 501-799!
The key insight: Each position asks “Who should I pay attention to?” and gets answers from all other positions.
Three simple steps:
1. Each position announces what it has:
Position 500: "I'm an enhancer with GATA binding site"
Position 800: "I'm a promoter for gene X"
2. Each position asks what it needs:
Position 800 asks: "Where are my enhancers?"
Position 500 responds: "I'm an enhancer!" (high relevance)
Position 300 responds: "I'm in a coding region" (low relevance)
3. Each position gets information from relevant positions:
Position 800: Pays 70% attention to position 500 (enhancer)
Pays 5% attention to position 300 (coding)
Pays 25% to other positions
The beautiful part: The Transformer learns which positions are relevant through training!
[Optional: The Math]
For each position, three vectors are computed: Query (Q), Key (K), Value (V).
Attention score: score(Q, K) = QKᵀ / √d
Attention weights: α = softmax(score)
Output: z = α · V
The Query represents “what I’m looking for,” the Key represents “what I have to offer,” and the Value is the actual information passed. The division by √d prevents very large dot products.
When you train a Transformer, you can visualize attention weights to see what the model learned.
Example reading: “When processing the promoter at position 800, the model shows high attention weight (~1.0) at position 501, indicating this position is highly relevant.”
In biology, this might reveal: Position 501 is an enhancer that regulates the gene at position 800—discovered by the model without being told about such relationships.
Different biological relationships exist simultaneously:
Solution: Use multiple “attention heads,” each learning different relationships:
Head 1: Learns enhancer-promoter pairs
Head 2: Learns transcription factor sites
Head 3: Learns splice site pairs
Head 4: Learns chromatin structure
...
It’s like having multiple experts, each specializing in different types of genomic relationships!
1. Parallel processing: Look at all positions simultaneously → fast training
2. Long-range dependencies: Position 1 can directly attend to position 1000 → no distance limit
3. Interpretability: Visualize attention weights → see what the model learned
Connection to Chapters 2-3: Remember Bayesian inference from Chapter 2? Attention is like computing “how much does this piece of evidence matter for my prediction?” The Transformer learns these relevance weights automatically!
Enformer uses Transformers to predict gene expression from DNA sequence (Avsec et al 2021, Nature Methods):
Task: Given 200kb of DNA sequence, predict:
Architecture:
Key discovery: The model used sequence information from distal enhancers tens of kilobases away and prioritized enhancer-gene links that agreed with CRISPRi and chromatin evidence. This suggests that Transformer-style long-range integration can recover some regulatory relationships from sequence, but attention patterns should still be treated as hypotheses rather than direct measurements of 3D contacts.
Why it matters: When you find a disease variant, Enformer can predict whether it affects a distant gene’s expression, even without doing experiments.
✅ Good for:
❌ Not ideal for:
Ask yourself these questions:
1. What structure does my data have?
2. How long are the sequences/dependencies?
3. How much data do you have?
4. Do you need interpretability?
| Your Task | Recommended Architecture | Why |
|---|---|---|
| Find TF binding motifs | CNN | Local patterns, position-independent |
| Predict splice sites | Bidirectional LSTM or Transformer | Need context from both directions |
| Classify cell types from expression | Fully-connected | No spatial/temporal structure |
| Predict next time point in trajectory | LSTM | Sequential dependencies |
| Find enhancer-promoter pairs | Transformer | Long-range dependencies (10-100kb) |
| Segment cells in microscopy | CNN (U-Net) | Spatial image patterns |
| Predict protein structure contacts | Transformer | Long-range residue interactions |
Real-world problems often combine architectures:
CNN + Transformer (like Enformer):
CNN extracts local features (motifs)
↓
Transformer captures long-range interactions (regulatory elements)
↓
Final prediction
Why hybrid? CNNs are efficient for local patterns, Transformers excel at long-range. Use each for what it’s good at!
Start simple, then add complexity:
Don’t over-engineer:
1. Different architectures for different data structures:
2. Why these architectures matter:
3. Trade-offs:
4. Practical guidance:
5. Connection to foundations:
| Term | Definition |
|---|---|
| Convolutional Neural Network (CNN) | Architecture using sliding filters to detect local patterns |
| Filter/Kernel | Small matrix of weights that slides across input |
| Feature map | Output showing where patterns were detected |
| Pooling | Summarizing/reducing spatial dimensions |
| Recurrent Neural Network (RNN) | Architecture processing sequences step-by-step |
| Hidden state | RNN’s memory carried forward across time steps |
| Vanishing gradient | Problem where gradients become too small to enable learning |
| LSTM | RNN variant with gates to control memory |
| Gates | Mechanisms controlling information flow (forget, input, output) |
| Transformer | Architecture using attention for long-range dependencies |
| Attention | Mechanism for weighting importance of different positions |
| Multi-head attention | Multiple attention mechanisms learning different relationships |
| Bidirectional | Processing sequence in both forward and backward directions |
| Receptive field | Range of input that affects a particular output |
Architecture intuition: You need to predict whether a 500bp DNA sequence is an active enhancer. The key features are: (1) presence of TF binding motifs (8-15bp each) and (2) specific combinations of motifs within 100bp windows. Would you use CNN, LSTM, or Transformer? Why?
CNNs and weight sharing: Explain why using the same filter at every position (weight sharing) is useful for finding motifs. What would happen if each position had its own unique filter?
Memory and biology: LSTMs can “remember” information across long sequences. Give two biological examples where this long-term memory would be crucial for making accurate predictions.
Attention visualization: If you visualize attention weights from a Transformer trained on gene regulation, and you see strong attention between position 1000 and position 50000, what biological relationship might this reveal? How would you experimentally validate this?
Comparing approaches: You have three models for splice site prediction:
Which would you choose if you have: (a) limited data, (b) limited time, (c) need best accuracy regardless of cost?
Hybrid reasoning: Enformer uses CNNs followed by Transformers. Why not just use Transformers from the start? What advantage does the CNN provide?
Bidirectional context: When would you want bidirectional processing instead of forward-only? Give a specific biological example where information from both directions matters.
Architecture limitations: What problems would arise if you tried to use a CNN to predict gene expression 24 hours from now, given measurements from hours 0, 6, 12, 18? What about using an LSTM?
CNNs for Genomics:
Transformers:
Vaswani et al. (2017) “Attention is all you need.” NeurIPS 2017.
Avsec et al. (2021) “Effective gene expression prediction from sequence by integrating long-range interactions.” Nature Methods 18:1196-1203.
Eraslan et al. (2019) “Deep learning: new computational modelling techniques for genomics.” Nature Reviews Genetics 20:389-403.
Zou et al. (2019) “A primer on deep learning in genomics.” Nature Genetics 51:12-18.
The Illustrated Transformer by Jay Alammar
Distill.pub: Attention and Augmented RNNs
You now understand the three major neural network architectures and when to use each!
You’ve learned:
Before moving on, make sure you can:
Self-check:
👉 Continue to Chapter 5: Genetic Variation and Genomic Technologies
“The future of biology is at the intersection of experiments and computation. Neither alone is sufficient.”