Start with a single nucleotide — an A at one position in the genome. Now zoom out to 100 bases: you see a transcription factor binding motif. Zoom to 1,000: an exon boundary with splice signals. Zoom to 10,000: a complete gene with its promoter. Zoom to 100,000: an entire regulatory domain with enhancers, insulators, and their target genes interacting across vast distances. Zoom to 1,000,000: a topologically associating domain where the 3D folding of chromatin determines which enhancers can reach which promoters. At every scale, new biology emerges. The genome is not a flat string of letters — it is a deeply hierarchical document, and understanding it requires reading at multiple resolutions simultaneously.
But until recently, most AI models could only see a few hundred to a few thousand bases at a time — like trying to understand a city by looking through a keyhole. The transformer models from the previous chapter are powerful, but they carry a fundamental computational burden: their attention mechanism has quadratic complexity. Double the sequence length and the computation quadruples. Increase it tenfold and the computation increases a hundredfold. At 1 million base pairs, the attention matrix alone would require a terabyte of GPU memory. The biology demands scale. The architecture forbids it.
This collision between biological necessity and computational constraint has driven a new wave of architectural innovation. Researchers have asked: what if we could capture long-range dependencies without computing all pairwise attention? What if the architecture scaled linearly with sequence length instead of quadratically? The answer has come from an unlikely direction — state space models and convolution operators originally developed for signal processing, now repurposed to read DNA at scales transformers cannot reach.
This chapter is about models that finally open the door: HyenaDNA, Mamba, and Caduceus. They represent not just engineering improvements but a genuine expansion in what questions genomics AI can ask.
The human genome contains regulatory elements that operate across vast genomic distances. Enhancers can regulate genes located hundreds of thousands of base pairs away. Topologically associating domains (TADs) span megabase-scale regions. Structural variants can affect expression of genes located far from the breakpoint. Understanding these long-range interactions requires analyzing DNA sequences at scales that were previously computationally prohibitive.
Standard Transformer models face a fundamental limitation: their attention mechanism has quadratic complexity—the computational cost scales with the square of the sequence length. This means:
This quadratic scaling creates practical barriers:
Traditional solutions involve breaking long sequences into short chunks and losing long-range information. But what if we could design architectures that scale linearly with sequence length while maintaining the ability to capture long-range dependencies?
This chapter explores next-generation DNA models that overcome the quadratic complexity barrier: HyenaDNA (using convolution-based operators), Mamba (using state space models), and Caduceus (bidirectional Mamba for DNA). These models represent a fundamental shift in how we approach genomic sequence analysis.
After completing this chapter, you will be able to:
Let’s make the computational challenge concrete. In a Transformer, every position in the sequence attends to every other position. For a sequence of length L:
Let’s calculate what this means for real genomic sequences:
Example: BRCA1 gene region (100,000 bp)
Example: Chromosome 21 (46 million bp)
You might ask: why not analyze DNA in short chunks? This approach has several limitations:
Consider this real example: The IRF6 gene (associated with cleft lip/palate) has an enhancer located 200,000 bp away. Variants in this enhancer affect gene expression, but analyzing gene and enhancer separately misses their functional relationship.
Before next-generation models, researchers tried several approaches:
Sparse Attention (e.g., Longformer, BigBird)
Sliding Windows
Hierarchical Models
None of these approaches fundamentally solved the quadratic complexity problem while maintaining full access to long-range sequence information.
Biological Analogy (HyenaDNA long context): Instead of reading in short 512 bp windows, the entire genomic locus (100 kb) can be read at once — distant enhancer-promoter relationships are not missed.
HyenaDNA replaces the attention mechanism with a convolutional operator that can be computed efficiently using the Fast Fourier Transform (FFT). The key insight: convolutions can capture long-range dependencies through their filter design, and FFT makes them computationally efficient.
The model uses what the authors call the Hyena operator, which combines:
Complexity comparison:
For L = 1,000,000, this means:
The Hyena operator works in three steps:
Step 1: Projection Input sequence X → Three parallel projections (similar to Q, K, V in attention):
Step 2: Long Convolution Instead of attention scores, use learned convolutional filters:
Step 3: Gating Combine projections with element-wise operations:
HyenaDNA was pretrained on the human reference genome using:
The key advantage: training on megabase-scale sequences allows the model to learn truly long-range genomic patterns.
Analysis of the trained model and downstream tasks suggests it can capture sequence patterns across much longer windows than standard attention models:
Long-range sequence context:
Regulatory element interactions:
Repetitive element patterns:
[Optional: The Math]
Math Box: FFT and Convolution
Why FFT Makes Convolution Fast
A convolution between sequence X (length L) and filter H (length K) requires:
- Direct computation: O(L × K) operations
- FFT-based computation: O(L log L + K log K)
For long filters (K ≈ L), the FFT approach is dramatically faster.
The Convolution Theorem: Convolution in time domain = Multiplication in frequency domain
Conv(X, H) = IFFT(FFT(X) ⊙ FFT(H))Where:
- FFT: Fast Fourier Transform (O(L log L))
- ⊙: Element-wise multiplication (O(L))
- IFFT: Inverse FFT (O(L log L))
Total complexity: O(L log L) instead of O(L²)
For genomic sequences:
- L = 1,000,000 bp
- Direct convolution: 1 trillion operations
- FFT convolution: 20 million operations
Biological interpretation: This is like analyzing all possible enhancer-promoter pairs simultaneously, but at a fraction of the computational cost of checking each pair individually.
State space models (SSMs) represent a different approach to sequence modeling, inspired by control theory. Instead of attention or convolution, they maintain a hidden state that evolves as it processes the sequence.
Key concept: At each position, the model:
Think of it like a cell integrating signals over time:
The Structured State Space (S4) model introduced efficient parameterization of state space models. The core equations:
h(t+1) = A × h(t) + B × x(t) # State update
y(t) = C × h(t) # Output
Where:
Key innovation: S4 uses structured matrices (HiPPO initialization) that make the model stable for very long sequences.
Complexity: O(L) for sequence length L
Biological Analogy (Mamba / state space models): Like simultaneously holding short-term and long-term memory while reading a genome — processed efficiently without the computational cost of attention.
Mamba improves upon S4 by making the state space parameters data-dependent. Instead of fixed A, B, C matrices, Mamba computes them based on the input:
B(t) = Linear_B(x(t)) # Input-dependent
C(t) = Linear_C(x(t)) # Input-dependent
Δ(t) = Softplus(Linear_Δ(x(t))) # Input-dependent discretization
Why this matters for DNA:
Think of it like a researcher scanning a chromosome:
Mamba achieves linear complexity while maintaining long-range capability:
Compared to Transformers:
Both strands of DNA are biologically meaningful:
Transcription factor binding sites can appear on either strand. Genes can be encoded on either strand. Regulatory elements work regardless of orientation.
Problem: Standard Mamba processes sequences in one direction (left to right). This creates an asymmetry that doesn’t reflect biological reality.
Caduceus solves this by using bidirectional Mamba layers:
Approach 1: RC-Augmentation
Approach 2: BiMamba Layers
The BiMamba block:
# Forward pass
h_fwd = Mamba(x, direction='forward')
# Reverse pass
h_rev = Mamba(x, direction='reverse')
# Mix
output = h_fwd + h_rev # or learnable combination
Caduceus uses a family of bidirectional and reverse-complement-aware MambaDNA blocks. Reported checkpoints vary in size and context length, so the safest description is architectural rather than a single fixed recipe:
On genomic benchmarks, Caduceus-style models are designed to test whether bidirectionality and reverse-complement equivariance improve long-range DNA modeling. Reported gains are benchmark-dependent, so use the points below as qualitative expectations:
Regulatory element prediction:
Variant effect prediction:
Downstream fine-tuning:
Let’s compare the three architectures:
| Feature | HyenaDNA | Mamba | Caduceus |
|---|---|---|---|
| Core mechanism | Long convolution | State space model | Bidirectional SSM |
| Complexity | O(L log L) | O(L) | O(L) |
| Max context | 1M bp | 1M+ bp | 1M bp |
| Directionality | Bidirectional (conv) | Unidirectional | Explicitly bidirectional |
| Training speed | Fast | Very fast | Very fast |
| Inference speed | Very fast | Very fast | Very fast |
| Memory usage | Low | Very low | Very low |
HyenaDNA:
Mamba:
Caduceus:
Still use Transformers when:
Structural variants (SVs) like large deletions, duplications, and inversions affect genomic regions spanning thousands to millions of base pairs. Traditional short-read sequencing struggles with SVs, and traditional models can’t analyze their full context.
Case example: Analyzing a 500 kb deletion
Using Caduceus with 1M bp context:
This enables:
Many regulatory variants lie in enhancers located 100-500 kb from their target genes. Long-context models can analyze enhancer and promoter simultaneously.
Example workflow:
Real example: The BCL11A enhancer (controlling fetal hemoglobin) is 62 kb from the gene. Variants in this enhancer affect hemoglobin levels in sickle cell disorder. Long-context models can directly model this regulatory relationship.
Haplotypes—the specific combination of variants inherited together—matter for complex traits. Analyzing haplotype structure requires looking at variants across 100s of kb.
Using HyenaDNA for haplotype analysis:
Genome-wide association studies (GWAS) identify variants associated with traits. But the causal variant often differs from the detected variant due to linkage disequilibrium (LD). Long-context models can analyze entire associated loci.
Approach:
Example: The FTO locus associated with obesity spans 500 kb. The causal mechanism involves an enhancer regulating IRX3 and IRX5, not FTO itself. Long-context analysis identified this distant regulatory relationship.
Background: Dr. Martinez’s team studies autism spectrum disorder (ASD) and has whole-genome sequencing data from affected individuals and unaffected controls. Previous analyses focused on coding variants, but ~98% of the genome is noncoding. Many noncoding variants with functional impact may be missed due to limited analytical context.
Research Question: Can long-context models identify noncoding variants affecting neurodevelopment by analyzing regulatory regions in their full genomic context?
Approach:
Data preparation:
Model application (using Caduceus):
Prioritization:
Illustrative results:
Example validation plan: Selected top 10 variants for CRISPR-based validation:
Biological insight in the teaching scenario: The SHANK3 enhancer variant disrupts a binding site for MEF2C, a transcription factor crucial for synapse development. The variant reduces SHANK3 expression by 40% in neurons. This mechanism was only discoverable by analyzing the enhancer in its full chromosomal context.
Impact:
Reference note: This case study is a hypothetical synthesis based on principles from:
Background: Chromatin is organized into topologically associating domains (TADs)—megabase-scale regions where DNA interactions are enriched. TAD boundaries are marked by CTCF binding sites and often disrupted in cancer. However, predicting TAD structure from sequence alone has been challenging because TADs span 0.5-2 Mb.
Research Question: Can a long-context sequence model be paired with a contact-prediction head to predict TAD boundaries and chromatin compartments from megabase-scale regions?
Approach:
Training data:
Model training:
Validation:
Illustrative results:
Variant Effect Prediction: Applied model to analyze structural variants:
Biological Insights:
Clinical Relevance:
Limitations:
Reference note: This scenario is inspired by approaches from:
Computational Challenges:
Biological Limitations:
Interpretability:
Validation Challenges:
1. Hybrid Architectures Combining multiple mechanisms:
2. Multi-Modal Models Integrating sequence with other data:
3. Larger Context Windows Pushing toward chromosome-scale:
4. Improved Pretraining Better training objectives:
5. Clinical Translation Making models clinically useful:
6. Cross-Species Models Learning from comparative genomics:
Transformer models face quadratic complexity (O(L²)) that limits them to analyzing sequences of a few thousand base pairs, missing important long-range genomic interactions.
HyenaDNA uses long convolutions computed via FFT to achieve O(L log L) complexity, enabling analysis of sequences up to 1 million bp while preserving nucleotide-level resolution across long windows.
Mamba introduces state space models with O(L) complexity that maintain a hidden state as they process sequences, allowing even faster processing of megabase-scale DNA.
Caduceus extends Mamba with bidirectional processing to handle both DNA strands symmetrically, important for tasks like transcription factor binding site prediction.
Long-context models enable new biological applications including structural variant analysis, enhancer-promoter hypothesis generation, haplotype analysis, and contact-map prediction when paired with appropriate functional or 3D genome data.
Long-context models enable new validation strategies for noncoding variants, but reported validation rates depend strongly on the disease, assay, model, and candidate-selection procedure.
Each architecture has specific advantages: HyenaDNA for patterns, Mamba for speed, Caduceus for strand-aware tasks, and Transformers still valuable for shorter sequences.
Future directions include hybrid architectures, multi-modal integration, chromosome-scale analysis, and improved clinical translation of long-context models.
| Term | Definition |
|---|---|
| Bidirectional processing | Analyzing DNA sequences in both forward and reverse directions simultaneously to capture biology of both strands. |
| Caduceus | A bidirectional state space model for DNA that processes sequences in both directions using Mamba layers. |
| Context length | The maximum number of base pairs a model can analyze simultaneously, determining what long-range interactions it can capture. |
| Fast Fourier Transform (FFT) | An efficient algorithm for computing convolutions in O(L log L) time instead of O(L²). |
| Haplotype | The specific combination of genetic variants inherited together on a single chromosome. |
| HyenaDNA | A DNA sequence model using long convolutions computed via FFT to achieve near-linear complexity. |
| Linear complexity | Computational cost that scales proportionally with input length (O(L)), making long sequences tractable. |
| Long convolution | A convolutional filter with length up to millions of positions, capable of capturing long-range dependencies. |
| Mamba | A state space model with selective parameters that processes sequences in O(L) time while maintaining long-range memory. |
| Quadratic complexity | Computational cost that scales with the square of input length (O(L²)), limiting Transformers to short sequences. |
| State space model (SSM) | A sequence modeling approach that maintains a hidden state evolving over positions, inspired by control theory. |
| Structural variant (SV) | Large genomic alterations including deletions, duplications, inversions, and translocations spanning 1kb to megabases. |
| Topologically associating domain (TAD) | A genomic region spanning 0.5-2 Mb where DNA interactions are enriched, bounded by CTCF sites. |
Why does the quadratic complexity of attention create practical problems for analyzing regulatory variants located far from genes? Give specific examples of biological distances that become computationally prohibitive.
Explain how HyenaDNA’s use of convolution with FFT achieves better computational complexity than attention. What is the tradeoff between these two approaches?
Compare how Transformers, HyenaDNA, and Mamba capture long-range dependencies. Which biological scenarios favor each architecture?
Why is bidirectional processing important for DNA sequence analysis? What biological features require seeing both strands?
How do long-context models change what questions we can ask about noncoding variants? What new types of analyses become possible?
A researcher wants to analyze a 300 kb deletion that spans three genes. Which model architecture would you recommend and why?
Explain why state space models can process sequences in linear time. What is the key difference from attention that enables this?
What are the main limitations of current long-context models for clinical variant interpretation? How might these be addressed in future work?
Ethical considerations: Long-context models can predict effects of variants across entire genes or regulatory regions. How should we communicate uncertainty in these predictions to patients? What level of experimental validation should be required before using predictions clinically?
Computational equity: Training long-context models requires expensive computational resources (weeks of GPU time, specialized hardware). How does this affect which research groups can develop these models? What strategies could make long-context analysis more accessible?
Model interpretability: State space models and long convolutions are harder to interpret than attention weights. For clinical use, how important is interpretability versus accuracy? Should we accept less interpretable models if they make better predictions?
Scaling limits: Current models analyze up to 1M bp. The human genome is 3.2 billion bp. What biological questions require even longer context (e.g., chromosome-scale or genome-scale)? Are there fundamental limits to how much context is useful?
Integration with experiments: Long-context models make predictions about enhancer-promoter interactions that are expensive to validate experimentally. How should we prioritize which predictions to validate? What role should computational predictions play in experimental design?
Poli et al. (2023) “Hyena Hierarchy: Towards Larger Convolutional Language Models” ICML
Gu & Dao (2023) “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” arXiv
Schiff et al. (2024) “Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling” arXiv
Gu et al. (2023) “On the Parameterization and Initialization of Diagonal State Space Models” NeurIPS
Nguyen et al. (2023) “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution” NeurIPS
Mamba Documentation: https://github.com/state-spaces/mamba
HyenaDNA GitHub: https://github.com/HazyResearch/hyena-dna
Long Context Models Survey: https://github.com/Strivin0311/long-llms-learning
In the next chapter, we’ll transition from DNA sequence models to single-cell omics. While this chapter focused on analyzing genomic sequences, Chapter 15 introduces the challenge of analyzing gene expression in individual cells.
Preview of Chapter 15: Introduction to Single-Cell Omics
Single-cell RNA sequencing (scRNA-seq) measures expression of ~20,000 genes in individual cells. A typical experiment generates data from 10,000-1,000,000 cells. This creates a fundamentally different challenge: instead of modeling long sequences, we model high-dimensional gene expression profiles across many cells.
You’ll learn:
Prerequisites for Chapter 15:
Connection to Chapter 15: Just as long-context models capture dependencies across genomic distances, single-cell models capture dependencies across gene regulatory networks. Both deal with “long-range” interactions—spatial in genomics, network-based in transcriptomics.
Ready to explore how cells differ at the molecular level? → [Continue to Chapter 15: Introduction to Single-Cell Omics]