Something puzzling is happening in the β-globin locus. Regulatory sequences such as the locus control region can sit tens of kilobases from the globin genes they control. A variant in one of these noncoding elements may alter gene expression even though no protein-coding exon is changed. The variant is whispering instructions across a long genomic distance, and the gene on the other end is listening.
How? The convolutional networks we’ve studied so far struggle with that question. Their effective receptive field — the genomic window they can use for one prediction — can be much smaller than the distances over which enhancers, silencers, and insulators act. CNN-based tools like DeepSEA faithfully report the local chromatin state at the variant site, but they have limited ability to connect that site to a distant target gene.
Many enhancer-promoter interactions span tens to hundreds of kilobases, and some stretch beyond 1 million bases. These long-range interactions are not edge cases — they are fundamental to how gene regulation works, especially in development and tissue-specific contexts. A model that cannot see across these distances will always be missing part of the story.
The answer involves an architectural innovation borrowed from machine translation: a mechanism called attention that lets a model look at every position in a sequence simultaneously and learn which distant positions talk to each other. This is exactly the problem transformers were designed to solve — not in genomics initially, but in natural language processing. Welcome to transformers.
The genome isn’t a simple linear instruction manual where each gene operates independently. Instead, it’s a complex three-dimensional structure where regulatory elements communicate across vast genomic distances:
Chromatin Looping in Gene Regulation:
Think of it this way: a transcription factor doesn’t just check its immediate neighborhood—it scans the entire accessible genome to find its binding partners, wherever they are. Enhancers can “reach across” vast genomic distances to activate a promoter, much like two people can have a conversation across a large room.
The Splicing Challenge:
By the end of this chapter, you will be able to:
Before we introduce transformers, let’s understand exactly why CNNs face challenges with long-range interactions—and what that means biologically.
Recall from Chapter 9 that CNNs use filters (kernels) that slide across sequences, detecting local patterns. Each convolutional layer has a receptive field—the span of input sequence that can influence a single output position.
How receptive fields grow:
To reach a receptive field of 100,000 bp with kernel size 3:
DeepSEA’s approach:
Basenji’s approach (Chapter 9):
Example 1: Enhancer-Promoter Interactions
The β-globin locus control region (LCR) contains enhancers 50-60 kb upstream of the HBB gene. Variants in the LCR cause β-thalassemia by reducing HBB expression.
Example 2: Splicing Across Large Genes
The DMD gene (dystrophin) spans 2.2 million bases:
A variant in intron 20 might affect exon 25 inclusion 50 kb away, but standard CNNs can’t integrate this information.
Researchers tried several approaches:
1. Dilated Convolutions (Atrous Convolutions)
2. Extremely Deep Networks
3. Larger Input Windows
None of these fully solve the long-range interaction problem. We need a fundamentally different architecture—one that can relate any position to any other position regardless of distance.
Enter transformers.
Transformers were introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. for machine translation. The key insight: replace sequential processing with parallel processing based on attention mechanisms.
Why transformers work for language:
Biological analogy:
Self-attention is the core innovation. Here’s how it works:
The Basic Idea: For each position in a sequence, attention computes how much to “attend to” every other position. Positions that are relevant get high attention scores; irrelevant positions get low scores.
Three Components (the QKV framework):
For each sequence position, we create three vectors:
How it works step-by-step:
Imagine analyzing a genomic sequence with three positions (simplified):
Position 1: TATA box (promoter element)
Position 2: random intronic sequence
Position 3: GATA motif (enhancer element)
Step 1: Create Q, K, V vectors Each position gets Query, Key, and Value vectors (learned during training):
Step 2: Compute attention scores For Position 1, we want to know: “Which other positions are relevant to me?”
Calculate: Q₁ · K₁, Q₁ · K₂, Q₁ · K₃ (dot products)
The dot product measures similarity—high values mean “these positions should interact.”
Step 3: Normalize with softmax Convert scores to probabilities that sum to 1:
Step 4: Weighted sum of values Output for Position 1 = 0.2 × V₁ + 0.1 × V₂ + 0.7 × V₃
Position 1’s representation now incorporates strong information from Position 3 (the enhancer), weak information from Position 2.
The key insight: This happens in parallel for ALL positions, and they can all attend to each other regardless of distance.
[Optional: The Math]
Attention function:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q$: Query matrix (sequence_length × d_k)
- $K$: Key matrix (sequence_length × d_k)
- $V$: Value matrix (sequence_length × d_v)
- $d_k$: dimension of key vectors (scaling factor prevents tiny gradients)
- $QK^T$: produces attention scores for all position pairs
- softmax: normalizes to probabilities
- Result: weighted sum of values based on attention
Multi-head attention:
Instead of one attention mechanism, use multiple “heads” in parallel: $$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, …, \text{head}_h)W^O$$ where each head$_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
Different heads can specialize: Head 1 might learn promoter-enhancer interactions; Head 2 might learn exon-exon interactions; Head 3 might learn CTCF-CTCF (chromatin loop) interactions.
Positional Encoding: Transformers process all positions in parallel, losing sequence order information. To preserve order, positional encodings are added to input embeddings: $$PE{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$
Basenji (Chapter 9 recap):
Enformer (Avsec et al., 2021):
Enformer combines the best of both worlds: CNNs for local patterns, transformers for long-range interactions.
Architecture overview:
Input: 196,608 bp DNA sequence
↓
[Convolutional Stem]
- 7 convolutional blocks
- Reduces sequence length (downsampling)
- Extracts local features
- Output: 1,536 positions × 1,536 features
↓
[Transformer Layers]
- 11 transformer blocks
- Multi-head self-attention
- Each position can attend to all others
- Learns long-range dependencies
↓
[Prediction Heads]
- Predicts 5,313 experimental tracks:
* CAGE (gene expression)
* DNase-seq (accessibility)
* H3K4me3, H3K27ac, etc. (histone marks)
- Outputs: bin-level predictions across sequence
Why this hybrid approach?
How to predict variant effects:
Reference prediction: Input 197 kb sequence centered on gene/variant → get predictions for all 5,313 tracks
Variant prediction: Input same sequence with variant substituted → get predictions for all tracks
Calculate difference:
Example: Enhancer variant analysis
Hypothetical example: a variant 85 kb upstream of FOXP2 lies in a brain-active enhancer. Enformer can compare reference and alternate sequences across a ~200 kb window and predict whether the alternate allele changes CAGE or chromatin-track outputs near FOXP2. Such a prediction would prioritize the variant for follow-up, but it would still need eQTL, reporter, CRISPRi, or other functional evidence.
Performance metrics (Avsec et al., 2021) depend on the assay and evaluation split. For CAGE gene-expression prediction at human protein-coding genes, the reported mean correlation increased from 0.81 for Basenji2 to 0.85 for Enformer. Across genome-wide tracks, Enformer also improved prediction of CAGE, histone marks, transcription factor binding, and DNA accessibility on held-out genomic regions.
By visualizing attention weights and contribution scores, researchers found evidence that Enformer uses biologically meaningful sequence context:
Promoter-enhancer grammar: contribution scores highlight distal enhancer-like sequences, often agreeing with H3K27ac and CRISPRi enhancer data
CTCF-associated boundaries: attention and contribution patterns can reflect CTCF/TAD-boundary sequence features
Splicing regulatory elements: sequence models can highlight exons and intronic elements that may influence splice-site choice
This is remarkable, but subtle: Enformer learned sequence features associated with distal regulation and TAD boundaries from linear DNA sequence alone. This is not the same as measuring 3D chromatin contacts directly.
Chromatin states represent distinct functional categories of genomic regions:
The Sei approach:
Sei (Chen et al., 2022) uses a deep sequence model to predict regulatory profiles and summarize them into sequence classes:
Input: 4,096 bp DNA sequence
↓
[CNN Encoder] — extracts local sequence features
↓
[Sequence representation layers] — integrate information across the 4 kb window
↓
[Prediction Heads]
- Predicts 21,907 TF binding profiles
- Predicts chromatin accessibility
- Predicts histone modification patterns
↓
[Chromatin State Classification]
- Maps predictions to 40 distinct chromatin states
- Tissue-specific state predictions
Sei’s approach:
Sequence class scoring: Each sequence gets a “class score” for chromatin states
Variant effect as class shift:
Example application:
Variant rs7903146 near TCF7L2 (type 2 diabetes GWAS hit):
Reference sequence:
- Active enhancer (pancreatic islet): 0.88
- Weak enhancer (liver): 0.34
- Quiescent (brain): 0.02
Variant sequence:
- Active enhancer (pancreatic islet): 0.12 ← decreased!
- Weak enhancer (liver): 0.31 ← minimal change
- Quiescent (brain): 0.03 ← no change
Interpretation: Variant disrupts enhancer specifically in
pancreatic islet cells, where TCF7L2 regulates insulin
secretion. This matches the tissue specificity of type 2
diabetes.
One advantage of attention mechanisms: interpretability.
For a sequence predicted as “strong enhancer”:
Pre-mRNA splicing:
Splicing signals:
The challenge: Splicing signals can be >1,000 bp from splice sites. Deep intronic variants can activate cryptic sites. The DMD gene spans 2.2 million base pairs, with variants deep in introns capable of causing Duchenne muscular dystrophy.
SpliceAI (Jaganathan et al., 2019) uses dilated convolutions to achieve long-range predictions without a full transformer architecture.
Input: Up to 10,000 bp sequence (centered on potential splice site)
↓
[Residual Blocks with Dilated Convolutions]
Block 1: dilation = 1 (local context)
Block 2: dilation = 2
Block 3: dilation = 4
Block 4: dilation = 8
...
Block 8: dilation = 128
↓
[Output Predictions at Each Position]
- Acceptor splice site probability
- Donor splice site probability
- Probability position is in exon
Why this works:
SpliceAI score for variants:
Types of splicing effects detected:
Performance on known splicing variants:
Jaganathan et al. (2019) tested SpliceAI on:
Example: Synonymous variant with splicing impact
Patient with cystic fibrosis:
Precomputed SpliceAI annotations are commonly used through tools such as VEP plugins, dbNSFP-style annotation resources, and clinical analysis pipelines. ClinVar itself should not be described as containing SpliceAI scores for all variants.
One of the most exciting aspects of transformer models is that we can visualize what they learn by examining attention weights.
What are attention weights?
Example: Enformer attention for β-globin locus
Analyzing HBB gene promoter region:
Promoter LCR Intergenic Downstream
Pos: [0] [50kb] [75kb] [100kb]
Attention weights from promoter position:
- Self (promoter): 0.15
- LCR (50 kb away): 0.78 ← Strong attention!
- Intergenic: 0.02
- Downstream: 0.05
Interpretation: Model assigns high relevance to a distal regulatory region.
This is a hypothesis about enhancer-promoter regulation, not proof of a
physical loop by itself.
Comparing with chromatin data:
Discovered patterns:
Promoter-Enhancer Syntax: distal regulatory sequences can contribute to promoter predictions; CTCF/TAD-boundary patterns can shape how information flows across the input window
Splice Site Recognition: Attention between donor and acceptor sites; strong attention within exons (exon definition); attention to regulatory elements (ESEs, ISSs)
Motif Co-occurrence: AP-1 and ETS motifs attend to each other, suggesting cooperative TF binding—this matches known TF partnerships
Attention is not causation:
Best practice:
Avsec et al. (2021) tested whether Enformer contribution scores could prioritize candidate enhancer-gene links without requiring new cell-type-specific Hi-C data for every prediction.
Experimental setup:
Enhancer prioritization: Enformer contribution scores ranked validated enhancer-gene pairs above many nonvalidated candidates across CRISPRi datasets. The model was competitive with some annotation-based enhancer prioritization approaches, even though its input was DNA sequence.
TAD-boundary signal: Attention analyses showed patterns around TAD boundaries that were consistent with reduced information flow across boundaries. This supports the idea that the model learned sequence features associated with genome organization, but it should not be read as a direct Hi-C substitute.
Discovered regulatory grammar:
Patient presentation:
Genetic testing:
Initial findings:
SpliceAI analysis:
Prediction details:
Reference sequence:
- No cryptic splice sites predicted
- Normal exon 7 - exon 8 splicing
Variant sequence:
- Creates cryptic donor site (GT) at +784
- Predicts pseudoexon inclusion
- 47 bp insertion in mRNA
- Frameshift → premature stop codon
RT-PCR from patient muscle biopsy:
Protein analysis:
For this patient:
Broader implications:
| Term | Definition |
|---|---|
| Attention mechanism | Method for computing relevance of different positions in a sequence to each other, enabling long-range dependency modeling regardless of distance. |
| Attention weight | Numerical score (0 to 1) quantifying how much one position “attends to” another position; high values indicate strong predicted interactions. |
| Chromatin loop | Physical interaction between distant genomic regions mediated by protein complexes; often brings enhancers and promoters into proximity. |
| Cryptic splice site | Sequence resembling a splice site (GT-AG) that is normally not used but can be activated by variants or regulatory changes. |
| Dilated convolution | Convolutional operation with gaps between kernel positions, enabling larger receptive fields without increasing parameters. |
| Enformer | Transformer-based model predicting gene expression and chromatin features from ~200 kb DNA sequences; improves over CNN-based Basenji2 by better using distal regulatory sequence. |
| Exonic splicing enhancer (ESE) | Sequence element in exons that promotes exon inclusion through binding of SR proteins. |
| Intronic splicing silencer (ISS) | Sequence element in introns that suppresses splice site recognition, preventing cryptic splicing. |
| Multi-head attention | Parallel attention mechanisms (heads) that can learn different types of relationships; outputs are combined for final representation. |
| Positional encoding | Mathematical representation of sequence position added to input embeddings, preserving order information in transformers. |
| Query-Key-Value (QKV) | Framework for attention computation where queries search for relevant keys to retrieve corresponding values. |
| Receptive field | Span of input sequence that can influence a single output position; larger receptive fields capture longer-range dependencies. |
| Sei | Sequence-based chromatin profile model that uses deep learning to assign sequence classes and support noncoding variant interpretation. |
| Self-attention | Attention mechanism where a sequence attends to itself, computing relationships between all position pairs within the same sequence. |
| SpliceAI | Deep learning model predicting splicing effects using dilated convolutions; achieves 10 kb receptive field for splice site prediction. |
| Transformer | Neural network architecture based on self-attention mechanisms, originally developed for natural language processing, now applied to genomics. |
Receptive field limitations: Explain why a CNN with 20 convolutional layers (kernel size 3) still cannot effectively model interactions between an enhancer 100 kb upstream and its target promoter. What specifically limits the flow of information?
Attention vs convolution: A researcher argues that “attention is just a fancy way of doing convolution across the entire sequence.” Explain why this is incorrect and what fundamentally distinguishes the attention mechanism from convolution.
Biological relevance of long-range modeling: The dystrophin gene (DMD) spans 2.2 million bases. Would Enformer (197 kb input) be sufficient to analyze all potential regulatory interactions for this gene? Why or why not? What biological features might be missed?
Attention interpretability: When Enformer shows high attention between two genomic positions 80 kb apart, what can we confidently conclude, and what remains uncertain? How would you validate that this attention reflects a functional regulatory interaction?
Tissue-specific predictions: SpliceAI achieves 98% accuracy predicting splicing but doesn’t explicitly model tissue types. How can it predict splicing so accurately given that splicing is often tissue-specific? What information in the sequence might enable tissue-agnostic prediction?
Dilated convolutions vs transformers: SpliceAI uses dilated convolutions while Enformer uses transformers, both achieving long-range predictions. Compare these approaches: What are the computational trade-offs? When might you prefer one over the other?
Variant effect prediction: A noncoding variant 150 kb upstream of a gene shows a large predicted effect in Enformer but is absent from GWAS studies of any trait. What might explain this discrepancy? What additional evidence would you seek?
Multi-head attention specialization: Enformer uses 8 attention heads per layer. What biological advantage might there be to having multiple parallel attention mechanisms rather than a single, larger attention mechanism?