Imagine unrolling a 1,000-nucleotide stretch of DNA across your desk — a long paper ribbon of A’s, T’s, G’s, and C’s stretching from one end to the other. You pick up a magnifying glass, a small one, about 8 nucleotides wide, and start sliding it from left to right across the ribbon. At position 14, you notice a pattern: G-A-T-A followed by an A-rich stretch — a GATA motif, a known binding site for a family of transcription factors. You slide on. At position 83, another pattern emerges: a GC-rich cluster with a specific internal spacing. You note each observation on a sticky note and pin it to the wall. By the time the magnifying glass has traveled the full length of the ribbon, the wall is covered: dozens of motif observations, each recording what the glass saw at each position along the way.
Congratulations — you just performed a convolution. A small, fixed-width window, sliding systematically from one end of a sequence to the other, recording local pattern matches at every position. The sticky notes on your wall are a feature map: a compact summary of what each position in the sequence looks like through that particular lens.
A Convolutional Neural Network does exactly this, with two differences in scale and autonomy. First, it does not slide one magnifying glass but hundreds — each one tuned to detect a different pattern, producing hundreds of feature maps simultaneously. Second, and more importantly, it did not arrive with those lenses pre-programmed. It learned which patterns to look for by examining millions of sequences with known regulatory activity, adjusting its filters until the patterns that predict function came into focus. The convolution operation itself is ancient mathematics; the novelty is that the filters are not designed by a biologist but discovered by the network from data.
This chapter opens the hood on that process. We will work through how convolutions are applied to DNA sequences, how filters learn to recognize motifs, how stacked layers build from local patterns to global function, and how models like DeepSEA and Basset put these ideas into practice for predicting regulatory activity across the human genome.
The noncoding genome represents about 98% of human DNA, yet we understand far less about it than we do about protein-coding regions. This vast regulatory landscape includes:
Each regulatory element works in a cell-type-specific manner. An enhancer active in neurons might be completely inactive in muscle cells. The same DNA sequence can have different functions depending on:
Testing regulatory activity experimentally requires techniques like:
Why we need computational approaches:
Scale: Testing every possible single nucleotide variant in every regulatory element would require ~400 billion experiments (1 million elements × 400 bases per element × 1,000 cell types)
Cost: Even with high-throughput methods, comprehensive experimental mapping costs tens of millions of dollars per cell type
Speed: Experimental characterization takes months to years; predictions take seconds
Personalization: Each individual has ~4-5 million variants. We need to predict which variants affect regulation in their specific genome
Hypothesis generation: Computational predictions help prioritize which experiments to actually perform in the lab
The challenge is to build models that can learn the “regulatory code”—the rules that determine which DNA sequences function as enhancers, promoters, or silencers in which cell types. This is where convolutional neural networks excel: they can learn to recognize sequence patterns (motifs) and their combinations that determine regulatory function.
By the end of this chapter, you will be able to:
Before diving into specific models, let’s understand why convolutional neural networks are particularly well-suited for analyzing DNA sequences.
DNA regulatory elements work through transcription factor binding. Transcription factors are proteins that recognize and bind to specific DNA sequences called motifs. A typical motif is 6-20 base pairs long and has some flexibility—for example, the binding site for the transcription factor CTCF looks roughly like: CCGCGNGGNGGCAG (where N means any nucleotide).
Think of a CNN filter as a molecular scanner—much like a restriction enzyme that slides along the DNA helix looking for its specific recognition sequence before cutting. A CNN filter “slides” along the encoded DNA looking for patterns it has learned to recognize during training.
The regulatory activity of a DNA sequence depends on:
This is fundamentally a pattern recognition problem—and CNNs are excellent at recognizing patterns in sequential data.
Recall from Chapter 3 that convolutional neural networks use filters that slide across input data, detecting local patterns. For DNA sequences:
DNA is encoded as a one-hot matrix: Each nucleotide becomes a 4-dimensional vector:
Convolutional filters learn motifs: A filter of width 12 can learn to recognize a 12-bp motif. The first layer learns individual motifs (like transcription factor binding sites).
Deeper layers learn motif combinations: Subsequent layers learn how motifs are arranged relative to each other—which combinations create strong enhancers versus weak ones.
Pooling captures position flexibility: Max pooling allows the network to recognize that a motif is present without caring about its exact position within a window.
Compared to traditional motif scanning (like FIMO or JASPAR):
Compared to traditional machine learning (like SVMs with k-mer features):
When we visualize the filters in trained CNNs, we find that:
First convolutional layer: Learns individual transcription factor motifs
Second convolutional layer: Learns motif pairs and spacing constraints
Deeper layers: Learn cell-type-specific regulatory logic
This hierarchical learning mirrors the biological reality: individual transcription factors bind to motifs, combinations of transcription factors work together to regulate genes, and the cell-type-specific combination of available transcription factors determines which enhancers are active.
DeepSEA (Deep learning for Sequence-based Estimation of chromatin Accessibility) was one of the first successful applications of CNNs to regulatory genomics. Published in 2015 by Jian Zhou and Olga Troyanskaya, it demonstrated that DNA sequence alone could predict chromatin features with remarkable accuracy.
Think of DeepSEA as predicting enhancer activity by training on thousands of ENCODE ChIP-seq experiments—just as a well-trained physician learns to recognize disease patterns by studying thousands of past cases, DeepSEA learned to recognize regulatory patterns by studying thousands of genomic experiments.
DeepSEA was trained on data from the ENCODE Project and Roadmap Epigenomics Consortium, which includes:
The training set consisted of:
DeepSEA uses a relatively simple architecture:
Input: 1000 bp DNA sequence (4 × 1000 matrix after one-hot encoding)
↓
Conv Layer 1: 320 filters, width 8
↓ ReLU activation
↓ Max pooling (width 4)
↓
Conv Layer 2: 480 filters, width 8
↓ ReLU activation
↓ Max pooling (width 4)
↓
Conv Layer 3: 960 filters, width 8
↓ ReLU activation
↓
Fully Connected Layer: 925 units
↓ ReLU activation
↓ Dropout (50%)
↓
Output Layer: 919 units (one per chromatin feature)
↓ Sigmoid activation (multi-label classification)
Key design choices:
Three convolutional layers: Captures patterns at multiple scales—individual motifs (layer 1), motif pairs (layer 2), and larger regulatory modules (layer 3)
Increasing filter numbers: More filters in deeper layers allow learning more complex patterns
Small filter width (8 bp): Matches typical transcription factor motif lengths
Multi-task learning: Predicting 919 features simultaneously helps the network learn shared regulatory logic
Sigmoid output: Each chromatin feature is predicted independently (probability from 0 to 1)
DeepSEA achieved impressive accuracy:
Why some features are easier than others:
DeepSEA’s most powerful application is predicting how genetic variants affect regulatory elements. The approach is called in silico mutagenesis:
Δ score = Alternative prediction - Reference predictionFor example, consider a variant in an enhancer active in liver cells:
DeepSEA computes this difference for all 919 chromatin features, giving a comprehensive view of how a variant affects regulation across cell types.
Strong predictions (|Δ score| > 0.5):
Moderate predictions (|Δ score| = 0.2-0.5):
Weak predictions (|Δ score| < 0.2):
Cell-type specificity: If a variant shows large Δ scores for enhancer marks (H3K27ac) specifically in brain cell types but not liver cell types, this suggests:
While DeepSEA predicts chromatin features, Basenji (developed by David Kelley and colleagues at Calico, published in 2018) goes a step further: it predicts actual gene expression levels from DNA sequence.
Knowing that a sequence is an enhancer (from DeepSEA) is valuable, but biologists often want to know:
Basenji addresses these questions by directly predicting genome-wide gene expression and chromatin accessibility from sequence.
Basenji was trained on much larger genomic windows than DeepSEA:
Why such long sequences?
Original Basenji (2018):
Input: 131,072 bp sequence
↓
8 Convolutional layers (with batch norm and pooling)
↓
2 Dilated convolutional layers (expands receptive field)
↓
Dense output layer
↓
Output: 4229 tracks, each 1024 bins (128 bp per bin)
Basenji2 (2020 update):
Unlike DeepSEA, which makes a single prediction per sequence, Basenji makes spatially-resolved predictions:
For gene expression:
Basenji’s variant effect prediction is more informative than DeepSEA’s:
Sequence-to-expression changes: Shows exactly where in the 131 kb window the effect occurs and can identify which gene’s expression changes
Quantitative expression changes: Predicts fold-change in expression (2× increase, 50% decrease, etc.), which is more directly interpretable than chromatin feature changes
Cell-type-specific effects: Can show that a variant increases expression 3-fold in heart but decreases it 50% in liver
Example interpretation:
Imagine a variant 50 kb upstream of the gene APOE:
This is much more directly interpretable than “0.5 change in H3K27ac prediction”—we can immediately see this variant might affect neuronal function by reducing APOE expression.
Basenji achieved:
Limitations:
Both models use CNNs to predict regulatory function from sequence, but they differ in important ways:
| Aspect | DeepSEA | Basenji |
|---|---|---|
| Input size | 1,000 bp | 131,072 bp |
| Primary output | Chromatin features | Gene expression + chromatin |
| Spatial resolution | Single value per feature | Track across sequence (128 bp bins) |
| Number of tracks | 919 | 4000+ |
| Computational cost | Low (fast predictions) | High (slower, more memory) |
| Best use case | Variant prioritization, TF binding | Expression changes, long-range effects |
| Can identify target gene | No (too short) | Yes (if within 131 kb) |
When to use each:
Use DeepSEA when:
Use Basenji when:
Use both when:
[Optional: The Math]
Variant Effect Score
The basic formula for variant effect prediction:
$$\Delta S_f = S_f(\text{alt}) - S_f(\text{ref})$$
Where $\Delta S_f$ = change in score for feature $f$, $S_f(\text{alt})$ = prediction for alternative allele, $S_f(\text{ref})$ = prediction for reference allele.
Example Calculation
Reference sequence (G at position 500):
- H3K27ac in hepatocytes: 0.82
- H3K4me1 in hepatocytes: 0.91
- DNase in hepatocytes: 0.88
Alternative sequence (A at position 500):
- H3K27ac in hepatocytes: 0.23
- H3K4me1 in hepatocytes: 0.85
- DNase in hepatocytes: 0.31
Variant effects:
- $\Delta S_{\text{H3K27ac}} = 0.23 - 0.82 = -0.59$ ← Strong negative effect
- $\Delta S_{\text{H3K4me1}} = 0.85 - 0.91 = -0.06$ ← Minimal effect
- $\Delta S_{\text{DNase}} = 0.31 - 0.88 = -0.57$ ← Strong negative effect
Biological interpretation: The variant likely disrupts an active enhancer (H3K27ac down), while the enhancer remains in a poised state (H3K4me1 unchanged), and chromatin becomes less accessible (DNase down).
For overall impact, we can also aggregate across features: $$\text{Impact Score} = \sum_{f \in \text{relevant features}} |\Delta S_f| \times w_f$$ where $w_f$ is a weight for feature importance (e.g., weighting H3K27ac heavily for enhancers).
The success of DeepSEA and Basenji inspired many architectural improvements.
Problem: Standard convolutions have limited receptive fields. To see 1000 bp, you’d need many layers of 3-bp filters.
Solution: Dilated convolutions (also called atrous convolutions) insert gaps between filter positions.
A standard 3-bp filter looks at positions: [i, i+1, i+2] A dilated filter with dilation=2: [i, i+2, i+4] With dilation=4: [i, i+4, i+8]
Advantages:
Problem: Very deep networks are hard to train (vanishing gradients)
Solution: Residual connections (from ResNet) allow gradients to flow directly through the network.
x → Conv → ReLU → Conv → Add → ReLU → ...
↓ ↑
└─────── skip connection ──┘
Advantages:
Problem: Regulatory patterns exist at multiple scales (6 bp motifs, 50 bp motif pairs, 500 bp regulatory modules)
Solution: Use multiple parallel convolutional paths with different filter sizes, then combine.
Input → Conv(width=4) → Concat → ...
↘ Conv(width=8) ↗
↘ Conv(width=16) ↗
Problem: Not all parts of a sequence are equally important
Solution: Attention mechanisms learn to focus on important regions. In a genomics context, attention might focus on transcription factor binding sites while ignoring repetitive elements. (We explore this fully in Chapter 10.)
Let’s walk through an illustrative example of using CNN models for variant interpretation. The biology is inspired by real enhancer-variant studies around hematopoietic genes, but the specific coordinate and quantitative values below are a teaching scenario unless a primary citation is added.
TAL1 (T-cell acute lymphocytic leukemia 1) is a transcription factor critical for blood cell development. A +51 kb enhancer upstream of TAL1 is active specifically in erythroid cells (red blood cell precursors).
Suppose researchers identify a single nucleotide variant in this enhancer:
The researchers performed reporter assays:
This took approximately 6 months of lab work and cost ~$15,000.
Running the variant through DeepSEA (this takes ~30 seconds):
Top predicted effects:
(K562 is an erythroid cell line commonly used in ENCODE)
Interpretation:
Accuracy: DeepSEA’s predictions matched the experimental results closely. The model predicted the variant would disrupt GATA binding and reduce enhancer activity—exactly what was observed.
Running through Basenji2 (takes ~2 minutes on GPU):
Predicted expression changes for TAL1:
Interpretation:
When models disagree with experiments:
While CNN-based models have been remarkably successful, they have important limitations.
1. Trans-acting factors: Models only see DNA sequence. They can’t predict whether a transcription factor is expressed in a given cell, whether it is post-translationally modified, or whether signaling pathways have activated or repressed it.
2. Long-range interactions: Even Basenji’s 131 kb window can’t capture enhancers located 500 kb or more from target genes, inter-chromosomal interactions, or topologically associating domain (TAD) structures.
3. DNA methylation: Most models don’t include information about CpG island methylation status, parent-of-origin effects (imprinting), or age-related methylation changes.
4. Chromatin accessibility context: Models predict from sequence alone, but real regulatory activity depends on whether chromatin is already open in that cell type and nucleosome positioning.
5. Environmental and developmental context: Models can’t predict how regulation changes with hormonal signals, stress responses, developmental timing, or disease states.
Be skeptical of predictions when:
CNNs are natural for DNA sequence analysis because they detect local patterns (motifs) and learn hierarchical representations (motifs → motif combinations → regulatory logic)
DeepSEA predicts chromatin features from 1000 bp sequences, including transcription factor binding and histone modifications across 919 features and multiple cell types
Basenji predicts gene expression from 131 kb sequences, enabling quantitative predictions of expression changes and identification of affected genes
Variant effect prediction works through in silico mutagenesis: comparing model predictions between reference and alternative alleles to calculate Δ scores
Architecture innovations like dilated convolutions, residual connections, and attention mechanisms have improved model performance and interpretability
CNN models have limitations: they only see sequence, miss long-range interactions, depend on training data coverage, and can’t capture trans-acting factors or environmental context
Real-world success: Models like DeepSEA and Basenji have successfully predicted functional variants, guided experiments, and accelerated regulatory genomics research
| Term | Definition |
|---|---|
| Attention mechanism | Neural network component that learns to weight different parts of input differently, focusing on important regions |
| Basenji | CNN-based model that predicts gene expression and chromatin accessibility from DNA sequences up to 131 kb long |
| CAGE (Cap Analysis of Gene Expression) | Technique that identifies transcription start sites and measures expression levels |
| Chromatin feature | Experimentally measurable property of chromatin, such as DNase sensitivity, histone modifications, or transcription factor binding |
| Convolutional filter | Learnable pattern detector that slides across DNA sequence, typically learning transcription factor motifs |
| DeepSEA | CNN-based model that predicts 919 chromatin features from 1000 bp DNA sequences |
| Dilated convolution | Convolutional operation with gaps between filter positions, allowing larger receptive fields without additional parameters |
| Enhancer | Regulatory DNA sequence that increases gene expression, often located far from target genes and active in specific cell types |
| In silico mutagenesis | Computational technique for predicting variant effects by comparing model predictions between reference and alternative sequences |
| Multi-task learning | Training approach where model predicts multiple related outputs simultaneously, learning shared representations |
| One-hot encoding | Representation of DNA where each nucleotide becomes a 4-dimensional binary vector (A, C, G, T) |
| Receptive field | Region of input sequence that influences a particular output, determined by filter sizes and number of layers |
| Regulatory element | DNA sequence that controls gene expression without coding for protein, including enhancers, promoters, silencers, and insulators |
| Residual connection | Skip connection that adds layer input to output, enabling training of very deep networks |
| Variant effect prediction | Computational estimation of how a genetic variant affects regulatory function or gene expression |
Why are convolutional neural networks particularly well-suited for analyzing DNA sequences, compared to fully connected networks? What properties of CNN architecture match the biological properties of regulatory elements?
DeepSEA was trained on 919 chromatin features from multiple cell types. Explain how multi-task learning (predicting all features simultaneously) helps the model learn better representations than training 919 separate models.
A researcher finds that DeepSEA predicts a variant disrupts GATA1 binding (Δ score = -0.82) in K562 cells but has minimal effect on GATA1 binding in HepG2 cells (Δ score = -0.05). What might explain this cell-type-specific prediction? What additional information would help interpret this result?
Basenji can predict gene expression from sequence, but it cannot predict how expression changes when a signaling pathway is activated (e.g., when cells receive a hormone signal). Explain what information Basenji is missing and why this limitation exists.
Compare the advantages and disadvantages of DeepSEA’s 1 kb input window versus Basenji’s 131 kb window. For what types of biological questions would each be more appropriate?
A variant shows a strong DeepSEA prediction (disrupts H3K27ac, Δ score = -0.67) but weak Basenji prediction (minimal expression change). Propose three biological explanations for why these predictions might disagree.
Imagine you’re studying a rare neurodevelopmental condition and have identified 200 noncoding variants in affected individuals. Design a computational pipeline using DeepSEA and Basenji to prioritize which variants to validate experimentally. What criteria would you use?
CNN models can learn transcription factor motifs from data without being given a motif database. How would you validate that a learned filter actually represents a biologically meaningful motif? What experiments or analyses would you perform?