Chapter 14: Next-Generation DNA Models

Interactive: Chapter 14

Start with a single nucleotide — an A at one position in the genome. Now zoom out to 100 bases: you see a transcription factor binding motif. Zoom to 1,000: an exon boundary with splice signals. Zoom to 10,000: a complete gene with its promoter. Zoom to 100,000: an entire regulatory domain with enhancers, insulators, and their target genes interacting across vast distances. Zoom to 1,000,000: a topologically associating domain where the 3D folding of chromatin determines which enhancers can reach which promoters. At every scale, new biology emerges. The genome is not a flat string of letters — it is a deeply hierarchical document, and understanding it requires reading at multiple resolutions simultaneously.

But until recently, most AI models could only see a few hundred to a few thousand bases at a time — like trying to understand a city by looking through a keyhole. The transformer models from the previous chapter are powerful, but they carry a fundamental computational burden: their attention mechanism has quadratic complexity. Double the sequence length and the computation quadruples. Increase it tenfold and the computation increases a hundredfold. At 1 million base pairs, the attention matrix alone would require a terabyte of GPU memory. The biology demands scale. The architecture forbids it.

This collision between biological necessity and computational constraint has driven a new wave of architectural innovation. Researchers have asked: what if we could capture long-range dependencies without computing all pairwise attention? What if the architecture scaled linearly with sequence length instead of quadratically? The answer has come from an unlikely direction — state space models and convolution operators originally developed for signal processing, now repurposed to read DNA at scales transformers cannot reach.

This chapter is about models that finally open the door: HyenaDNA, Mamba, and Caduceus. They represent not just engineering improvements but a genuine expansion in what questions genomics AI can ask.


The Biological Challenge

The human genome contains regulatory elements that operate across vast genomic distances. Enhancers can regulate genes located hundreds of thousands of base pairs away. Topologically associating domains (TADs) span megabase-scale regions. Structural variants can affect expression of genes located far from the breakpoint. Understanding these long-range interactions requires analyzing DNA sequences at scales that were previously computationally prohibitive.

Standard Transformer models face a fundamental limitation: their attention mechanism has quadratic complexity—the computational cost scales with the square of the sequence length. This means:

  • 10,000 bp sequence: 100 million attention computations
  • 100,000 bp sequence: 10 billion attention computations (100× increase)
  • 1,000,000 bp sequence: 1 trillion attention computations (10,000× increase)

This quadratic scaling creates practical barriers:

  • Memory requirements: A 1 million bp sequence requires 1TB of GPU memory just for attention
  • Training time: Models become too slow to train on long sequences
  • Inference cost: Analyzing genome-wide variants becomes economically infeasible
  • Limited context: Models can only “see” a few thousand bases at once

Traditional solutions involve breaking long sequences into short chunks and losing long-range information. But what if we could design architectures that scale linearly with sequence length while maintaining the ability to capture long-range dependencies?

This chapter explores next-generation DNA models that overcome the quadratic complexity barrier: HyenaDNA (using convolution-based operators), Mamba (using state space models), and Caduceus (bidirectional Mamba for DNA). These models represent a fundamental shift in how we approach genomic sequence analysis.


Learning Objectives

After completing this chapter, you will be able to:

  • Explain why Transformer models have quadratic complexity and how this limits genomic analysis
  • Describe the key architectural innovations that enable linear-time sequence modeling
  • Compare the tradeoffs between attention-based and state space-based models for DNA
  • Understand how convolutional operators can approximate attention at lower computational cost
  • Apply long-context DNA models to analyze regulatory variants and structural variation
  • Evaluate when to use Transformer models versus next-generation architectures
  • Interpret results from models that analyze sequences at megabase scale

14.1 The Quadratic Complexity Problem

Understanding Attention Complexity

Let’s make the computational challenge concrete. In a Transformer, every position in the sequence attends to every other position. For a sequence of length L:

  • Number of attention pairs: L × L = L²
  • Each attention computation: Involves vector operations (dot products, softmax)
  • Total complexity: O(L²) for attention mechanism

Let’s calculate what this means for real genomic sequences:

Example: BRCA1 gene region (100,000 bp)

  • Attention pairs: 100,000² = 10,000,000,000 (10 billion)
  • Memory for attention matrix: ~400 GB (using float32)
  • This exceeds typical GPU memory (16-80 GB)

Example: Chromosome 21 (46 million bp)

  • Attention pairs: 46,000,000² ≈ 2 × 10¹⁵ (2 quadrillion)
  • Memory required: ~8 petabytes
  • Clearly impossible on current hardware

Why Can’t We Just Use Short Sequences?

You might ask: why not analyze DNA in short chunks? This approach has several limitations:

  1. Lost long-range interactions: Enhancers regulating distant genes are missed
  2. Structural variant context: Large deletions or duplications span boundaries
  3. Haplotype information: Phasing requires analyzing linked variants across distances
  4. TAD structure: Topologically associating domains require chromosome-scale context
  5. Splicing patterns: Alternative splicing depends on distant regulatory signals

Consider this real example: The IRF6 gene (associated with cleft lip/palate) has an enhancer located 200,000 bp away. Variants in this enhancer affect gene expression, but analyzing gene and enhancer separately misses their functional relationship.

Previous Attempts at Solutions

Before next-generation models, researchers tried several approaches:

Sparse Attention (e.g., Longformer, BigBird)

  • Only attend to nearby positions and a few global positions
  • Problem: May miss important long-range interactions
  • Still has significant memory requirements

Sliding Windows

  • Analyze DNA in overlapping chunks
  • Problem: Arbitrary boundaries, no true long-range integration
  • Post-hoc merging of predictions is challenging

Hierarchical Models

  • First encode local regions, then attend over region summaries
  • Problem: Local encoding may lose important details
  • Still limited in maximum context

None of these approaches fundamentally solved the quadratic complexity problem while maintaining full access to long-range sequence information.


14.2 HyenaDNA: Convolution-Based Long-Range Modeling

The Core Innovation

Biological Analogy (HyenaDNA long context): Instead of reading in short 512 bp windows, the entire genomic locus (100 kb) can be read at once — distant enhancer-promoter relationships are not missed.

HyenaDNA replaces the attention mechanism with a convolutional operator that can be computed efficiently using the Fast Fourier Transform (FFT). The key insight: convolutions can capture long-range dependencies through their filter design, and FFT makes them computationally efficient.

The model uses what the authors call the Hyena operator, which combines:

  1. Data-dependent gating (like attention’s value mechanism)
  2. Long convolution filters (capturing long-range patterns)
  3. FFT-based computation (achieving linear complexity)

Complexity comparison:

  • Standard Attention: O(L²)
  • HyenaDNA: O(L log L)

For L = 1,000,000, this means:

  • Attention: ~1 trillion operations
  • HyenaDNA: ~20 million operations (50,000× fewer!)

Architecture Details

The Hyena operator works in three steps:

Step 1: Projection Input sequence X → Three parallel projections (similar to Q, K, V in attention):

  • Query projection: X_q
  • Key projection: X_k
  • Value projection: X_v

Step 2: Long Convolution Instead of attention scores, use learned convolutional filters:

  • Filters can have length up to 1 million positions
  • Can in principle capture sequence patterns separated by hundreds of kilobases
  • Computed efficiently using FFT

Step 3: Gating Combine projections with element-wise operations:

  • Output = X_v ⊙ Conv(X_k) ⊙ X_q
  • ⊙ denotes element-wise multiplication
  • This provides data-dependent mixing like attention

Training HyenaDNA

HyenaDNA was pretrained on the human reference genome using:

  • Context length: up to 1 million tokens/base positions in the long-context setting
  • Training objective: nucleotide-level language modeling rather than BERT-style 6-mer masked-token prediction
  • Tokenization: single-nucleotide resolution, preserving SNP-level changes
  • Model sizes: multiple checkpoints, from small models to larger long-context variants

The key advantage: training on megabase-scale sequences allows the model to learn truly long-range genomic patterns.

What HyenaDNA Learned

Analysis of the trained model and downstream tasks suggests it can capture sequence patterns across much longer windows than standard attention models:

Long-range sequence context:

  • regulatory motifs separated by long distances
  • repetitive-element structure across large regions
  • sequence composition and motif organization at scales beyond typical transformer windows

Regulatory element interactions:

  • candidate enhancer-promoter pairs separated by long genomic distances
  • CTCF binding site relationships
  • Locus control regions and gene clusters

Repetitive element patterns:

  • Long interspersed nuclear elements (LINEs)
  • Short interspersed nuclear elements (SINEs)
  • Tandem repeat organization

[Optional: The Math]

Math Box: FFT and Convolution

Why FFT Makes Convolution Fast

A convolution between sequence X (length L) and filter H (length K) requires:

  • Direct computation: O(L × K) operations
  • FFT-based computation: O(L log L + K log K)

For long filters (K ≈ L), the FFT approach is dramatically faster.

The Convolution Theorem: Convolution in time domain = Multiplication in frequency domain

Conv(X, H) = IFFT(FFT(X) ⊙ FFT(H))

Where:

  • FFT: Fast Fourier Transform (O(L log L))
  • ⊙: Element-wise multiplication (O(L))
  • IFFT: Inverse FFT (O(L log L))

Total complexity: O(L log L) instead of O(L²)

For genomic sequences:

  • L = 1,000,000 bp
  • Direct convolution: 1 trillion operations
  • FFT convolution: 20 million operations

Biological interpretation: This is like analyzing all possible enhancer-promoter pairs simultaneously, but at a fraction of the computational cost of checking each pair individually.


14.3 State Space Models: Mamba and S4

Introduction to State Space Models

State space models (SSMs) represent a different approach to sequence modeling, inspired by control theory. Instead of attention or convolution, they maintain a hidden state that evolves as it processes the sequence.

Key concept: At each position, the model:

  1. Updates its hidden state based on previous state and current input
  2. Produces an output based on the hidden state
  3. The state can “remember” information from distant positions

Think of it like a cell integrating signals over time:

  • The cell’s current state depends on all previous signals
  • But it doesn’t need to store every past signal explicitly
  • It maintains a compressed representation of history

The S4 Model

The Structured State Space (S4) model introduced efficient parameterization of state space models. The core equations:

h(t+1) = A × h(t) + B × x(t)    # State update
y(t) = C × h(t)                  # Output

Where:

  • h(t): Hidden state at position t
  • x(t): Input at position t
  • y(t): Output at position t
  • A, B, C: Learned parameter matrices

Key innovation: S4 uses structured matrices (HiPPO initialization) that make the model stable for very long sequences.

Complexity: O(L) for sequence length L

Mamba: Selective State Space Models

Biological Analogy (Mamba / state space models): Like simultaneously holding short-term and long-term memory while reading a genome — processed efficiently without the computational cost of attention.

Mamba improves upon S4 by making the state space parameters data-dependent. Instead of fixed A, B, C matrices, Mamba computes them based on the input:

B(t) = Linear_B(x(t))    # Input-dependent
C(t) = Linear_C(x(t))    # Input-dependent  
Δ(t) = Softplus(Linear_Δ(x(t)))  # Input-dependent discretization

Why this matters for DNA:

  • Model can selectively remember important features (like TFBS)
  • Can ignore repetitive or uninformative regions
  • Adapts its processing to sequence content

Think of it like a researcher scanning a chromosome:

  • Slow down and pay attention at regulatory regions
  • Quickly skip through repetitive elements
  • Adjust memory capacity based on information density

Mamba’s Computational Advantages

Mamba achieves linear complexity while maintaining long-range capability:

  1. Memory efficiency: No attention matrix needed
  2. Fast inference: Single sequential pass through sequence
  3. Long context: Can process 1M+ bp sequences
  4. Parallel training: Uses efficient parallel scan algorithms

Compared to Transformers:

  • Speed: 5× faster inference
  • Memory: 10× less GPU memory
  • Context: 100× longer sequences

14.4 Caduceus: Bidirectional Mamba for DNA

The Directionality Problem

Both strands of DNA are biologically meaningful:

  • Forward strand: 5’ to 3’
  • Reverse strand: 3’ to 5’ (reverse complement)

Transcription factor binding sites can appear on either strand. Genes can be encoded on either strand. Regulatory elements work regardless of orientation.

Problem: Standard Mamba processes sequences in one direction (left to right). This creates an asymmetry that doesn’t reflect biological reality.

Caduceus Architecture

Caduceus solves this by using bidirectional Mamba layers:

Approach 1: RC-Augmentation

  • Process sequence in forward direction
  • Also process reverse complement
  • Average or concatenate the two representations

Approach 2: BiMamba Layers

  • Simultaneously process in both directions
  • Mix information from both directions at each layer
  • More parameter-efficient than separate models

The BiMamba block:

# Forward pass
h_fwd = Mamba(x, direction='forward')

# Reverse pass  
h_rev = Mamba(x, direction='reverse')

# Mix
output = h_fwd + h_rev  # or learnable combination

Caduceus Training

Caduceus uses a family of bidirectional and reverse-complement-aware MambaDNA blocks. Reported checkpoints vary in size and context length, so the safest description is architectural rather than a single fixed recipe:

  • Architecture: bidirectional Mamba-style blocks with reverse-complement equivariance in the DNA-specific variants
  • Context length: designed for long-range DNA sequence modeling
  • Objective: sequence language modeling and downstream fine-tuning
  • Key design goal: treat forward and reverse-complement DNA consistently

Performance Comparison

On genomic benchmarks, Caduceus-style models are designed to test whether bidirectionality and reverse-complement equivariance improve long-range DNA modeling. Reported gains are benchmark-dependent, so use the points below as qualitative expectations:

Regulatory element prediction:

  • Can match or exceed transformer baselines on some tasks
  • Can use longer contexts with lower memory pressure than full attention
  • Is especially relevant when strand symmetry matters

Variant effect prediction:

  • Can be applied to promoter and regulatory variants
  • Can test whether distal sequence context improves variant-effect prediction
  • Still requires task-specific validation

Downstream fine-tuning:

  • Potentially cheaper than full-attention transformers for long windows
  • Useful for comparing architecture choices on the same genomic benchmark
  • Requires careful train/test splits to avoid sequence leakage

14.5 Comparing Next-Generation Models

HyenaDNA vs Mamba vs Caduceus

Let’s compare the three architectures:

Feature HyenaDNA Mamba Caduceus
Core mechanism Long convolution State space model Bidirectional SSM
Complexity O(L log L) O(L) O(L)
Max context 1M bp 1M+ bp 1M bp
Directionality Bidirectional (conv) Unidirectional Explicitly bidirectional
Training speed Fast Very fast Very fast
Inference speed Very fast Very fast Very fast
Memory usage Low Very low Very low

When to Use Each Model

HyenaDNA:

  • Best for: Analyzing repeated patterns across long distances
  • Advantages: FFT efficiency, interpretable filters
  • Use cases: Repetitive element analysis, chromatin domains

Mamba:

  • Best for: Fast inference on very long sequences
  • Advantages: Truly linear complexity, lowest memory
  • Use cases: Genome-wide scanning, real-time analysis

Caduceus:

  • Best for: Tasks requiring strand symmetry
  • Advantages: Explicit bidirectionality, strong performance
  • Use cases: TFBS prediction, splicing analysis, general genomics

Still use Transformers when:

  • Context length < 10,000 bp (quadratic complexity manageable)
  • Need interpretable attention weights
  • Task benefits from explicit pairwise comparisons
  • Have existing Transformer infrastructure

14.6 Long-Context Applications

Application 1: Structural Variant Analysis

Structural variants (SVs) like large deletions, duplications, and inversions affect genomic regions spanning thousands to millions of base pairs. Traditional short-read sequencing struggles with SVs, and traditional models can’t analyze their full context.

Case example: Analyzing a 500 kb deletion

Using Caduceus with 1M bp context:

  1. Encode sequence with deletion and flanking regions
  2. Model captures:
    • Disrupted TAD boundaries
    • Lost enhancer-promoter contacts
    • Affected gene expression patterns
    • Compensatory regulatory changes

This enables:

  • Predicting phenotypic impact of SVs
  • Identifying pathogenic deletions
  • Understanding dosage sensitivity
  • Prioritizing SVs for experimental validation

Application 2: Enhancer-Promoter Prediction

Many regulatory variants lie in enhancers located 100-500 kb from their target genes. Long-context models can analyze enhancer and promoter simultaneously.

Example workflow:

  1. Input: 200 kb region containing enhancer and gene
  2. Model identifies:
    • Enhancer boundaries
    • Promoter location
    • CTCF insulator positions
    • Chromatin looping probability
  3. Variant effect prediction:
    • Test variant in enhancer
    • Predict change in gene expression
    • Account for 3D genomic context

Real example: The BCL11A enhancer (controlling fetal hemoglobin) is 62 kb from the gene. Variants in this enhancer affect hemoglobin levels in sickle cell disorder. Long-context models can directly model this regulatory relationship.

Application 3: Haplotype Analysis

Haplotypes—the specific combination of variants inherited together—matter for complex traits. Analyzing haplotype structure requires looking at variants across 100s of kb.

Using HyenaDNA for haplotype analysis:

  1. Input: Personal genome sequence spanning 500 kb
  2. Model identifies:
    • Haplotype blocks (regions of linkage)
    • Recombination hotspots
    • Compound heterozygous combinations
  3. Applications:
    • Pharmacogenomics (drug response haplotypes)
    • Risk prediction (combined variant effects)
    • Evolutionary history

Application 4: Locus-Wide Association

Genome-wide association studies (GWAS) identify variants associated with traits. But the causal variant often differs from the detected variant due to linkage disequilibrium (LD). Long-context models can analyze entire associated loci.

Approach:

  1. Input: 1 Mb region around GWAS signal
  2. Model evaluates:
    • Every variant’s functional impact
    • Regulatory context
    • Gene targets
    • Epistatic interactions
  3. Output: Prioritized causal variant list

Example: The FTO locus associated with obesity spans 500 kb. The causal mechanism involves an enhancer regulating IRX3 and IRX5, not FTO itself. Long-context analysis identified this distant regulatory relationship.


Teaching Scenario: Analyzing Noncoding Variants in Neurodevelopmental Disorders

Background: Dr. Martinez’s team studies autism spectrum disorder (ASD) and has whole-genome sequencing data from affected individuals and unaffected controls. Previous analyses focused on coding variants, but ~98% of the genome is noncoding. Many noncoding variants with functional impact may be missed due to limited analytical context.

Research Question: Can long-context models identify noncoding variants affecting neurodevelopment by analyzing regulatory regions in their full genomic context?

Approach:

  1. Data preparation:

    • Extracted all rare noncoding variants (MAF < 0.1%)
    • Identified 125,000 candidate regulatory variants
    • For each variant, extracted 500 kb genomic context
  2. Model application (using Caduceus):

    • Analyzed each variant in 500 kb context window
    • Predicted impact on:
      • Nearby gene expression
      • Chromatin accessibility
      • Transcription factor binding
      • Long-range regulatory interactions
  3. Prioritization:

    • Ranked variants by predicted functional impact
    • Considered gene targets expressed in brain
    • Evaluated enrichment in ASD cases vs. controls

Illustrative results:

  • Identified 47 high-confidence noncoding variants enriched in ASD cases
  • 23 variants located in enhancers >100 kb from target genes
  • Would have been missed by standard 5-10 kb context windows
  • Top hit: Enhancer 380 kb from SHANK3 gene

Example validation plan: Selected top 10 variants for CRISPR-based validation:

  • Created deletions of enhancer regions in neural progenitor cells
  • Measured target gene expression changes
  • 7/10 showed significant expression changes (70% validation rate)
  • Much higher than previous noncoding validation rates (20-30%)

Biological insight in the teaching scenario: The SHANK3 enhancer variant disrupts a binding site for MEF2C, a transcription factor crucial for synapse development. The variant reduces SHANK3 expression by 40% in neurons. This mechanism was only discoverable by analyzing the enhancer in its full chromosomal context.

Impact:

  • Long-context models enable noncoding variant interpretation
  • Identifying distant regulatory relationships requires >100 kb context
  • Provides new therapeutic targets (MEF2C pathway)
  • Demonstrates importance of analyzing full genomic context

Reference note: This case study is a hypothetical synthesis based on principles from:

  • Morrow et al. (2021) “Noncoding variants in neurodevelopmental disorders” (hypothetical synthesis)
  • Real SHANK3 biology from Leblond et al. (2014) Nature Genetics

Teaching Scenario: Predicting Chromatin Structure at Megabase Scale

Background: Chromatin is organized into topologically associating domains (TADs)—megabase-scale regions where DNA interactions are enriched. TAD boundaries are marked by CTCF binding sites and often disrupted in cancer. However, predicting TAD structure from sequence alone has been challenging because TADs span 0.5-2 Mb.

Research Question: Can a long-context sequence model be paired with a contact-prediction head to predict TAD boundaries and chromatin compartments from megabase-scale regions?

Approach:

  1. Training data:

    • Hi-C contact maps from 30 cell types (ENCODE)
    • Shows which genomic regions physically interact
    • Resolution: 10 kb bins across genome
  2. Model training:

    • Input: 1 Mb DNA sequence
    • Output: Predicted contact probability for all pairs
    • Architecture: long-context sequence encoder with contact prediction head
    • Loss: Mean squared error on contact frequencies
  3. Validation:

    • Tested on held-out cell types
    • Compared to experimental Hi-C
    • Measured TAD boundary prediction accuracy

Illustrative results:

  • Achieved 0.82 correlation with experimental Hi-C
  • Correctly identified 76% of TAD boundaries
  • Predicted tissue-specific chromatin organization
  • Identified CTCF sites critical for TAD formation

Variant Effect Prediction: Applied model to analyze structural variants:

  • 500 kb deletion removes TAD boundary → fusion of adjacent TADs
  • Predicts abnormal enhancer-promoter contacts
  • Validated in cancer genomes with known rearrangements

Biological Insights:

  1. TAD boundaries enriched for:
    • Convergent CTCF motifs
    • Housekeeping genes
    • GC-rich sequences
  2. Cell-type-specific TADs driven by:
    • Tissue-specific enhancers
    • Variable CTCF binding strength
    • Cohesin recruitment sites

Clinical Relevance:

  • Predicted impact of structural variants on 3D genome
  • Identified variants disrupting TAD boundaries in developmental disorders
  • Enables interpretation of noncoding SVs

Limitations:

  • Requires training on expensive Hi-C data
  • Cell-type specificity not fully captured
  • Some long-range interactions still missed

Reference note: This scenario is inspired by approaches from:

  • Schwessinger et al. (2020) Nat Commun - “DeepC predicts chromatin interactions”
  • Zhou et al. (2022) Nat Methods - “Deep learning predicts DNA structure”

14.7 Limitations and Future Directions

Current Limitations

Computational Challenges:

  • Training still requires significant GPU resources
  • 1M bp sequences need 40-80 GB GPU memory
  • Pretraining on full human genome takes weeks
  • Limited availability of trained models

Biological Limitations:

  • Cell-type specificity requires additional inputs
  • 3D genome structure not fully encoded in sequence
  • Epigenetic modifications not captured
  • Environmental context missing

Interpretability:

  • Harder to interpret than attention weights
  • Convolution filters less intuitive than attention
  • State space dynamics difficult to visualize
  • Black-box predictions for clinical use

Validation Challenges:

  • Limited experimental data for 100+ kb contexts
  • Difficult to validate predicted long-range interactions
  • Unclear how to benchmark megabase-scale predictions
  • Few datasets with sufficient genomic context

Future Research Directions

1. Hybrid Architectures Combining multiple mechanisms:

  • Local attention + global state space
  • Convolution for patterns + attention for specific interactions
  • Best of both worlds

2. Multi-Modal Models Integrating sequence with other data:

  • DNA sequence + chromatin accessibility
  • Sequence + evolutionary conservation
  • Sequence + 3D genome structure
  • Sequence + gene expression

3. Larger Context Windows Pushing toward chromosome-scale:

  • Current: 1M bp (0.03% of genome)
  • Goal: 50M bp (entire chromosome)
  • Would enable full-locus analysis
  • Requires further efficiency improvements

4. Improved Pretraining Better training objectives:

  • Multi-task learning (expression + chromatin + conservation)
  • Contrastive learning across species
  • Self-supervised from single-cell data
  • Incorporating structure prediction

5. Clinical Translation Making models clinically useful:

  • Calibrated uncertainty estimates
  • Interpretable predictions
  • Integration with electronic health records
  • Real-time variant interpretation

6. Cross-Species Models Learning from comparative genomics:

  • Pretrain on 100+ vertebrate genomes
  • Learn conserved regulatory logic
  • Identify species-specific elements
  • Evolutionary constraint prediction

Summary

Key Takeaways

  • Transformer models face quadratic complexity (O(L²)) that limits them to analyzing sequences of a few thousand base pairs, missing important long-range genomic interactions.

  • HyenaDNA uses long convolutions computed via FFT to achieve O(L log L) complexity, enabling analysis of sequences up to 1 million bp while preserving nucleotide-level resolution across long windows.

  • Mamba introduces state space models with O(L) complexity that maintain a hidden state as they process sequences, allowing even faster processing of megabase-scale DNA.

  • Caduceus extends Mamba with bidirectional processing to handle both DNA strands symmetrically, important for tasks like transcription factor binding site prediction.

  • Long-context models enable new biological applications including structural variant analysis, enhancer-promoter hypothesis generation, haplotype analysis, and contact-map prediction when paired with appropriate functional or 3D genome data.

  • Long-context models enable new validation strategies for noncoding variants, but reported validation rates depend strongly on the disease, assay, model, and candidate-selection procedure.

  • Each architecture has specific advantages: HyenaDNA for patterns, Mamba for speed, Caduceus for strand-aware tasks, and Transformers still valuable for shorter sequences.

  • Future directions include hybrid architectures, multi-modal integration, chromosome-scale analysis, and improved clinical translation of long-context models.


📖 Key Terms
Term Definition
Bidirectional processing Analyzing DNA sequences in both forward and reverse directions simultaneously to capture biology of both strands.
Caduceus A bidirectional state space model for DNA that processes sequences in both directions using Mamba layers.
Context length The maximum number of base pairs a model can analyze simultaneously, determining what long-range interactions it can capture.
Fast Fourier Transform (FFT) An efficient algorithm for computing convolutions in O(L log L) time instead of O(L²).
Haplotype The specific combination of genetic variants inherited together on a single chromosome.
HyenaDNA A DNA sequence model using long convolutions computed via FFT to achieve near-linear complexity.
Linear complexity Computational cost that scales proportionally with input length (O(L)), making long sequences tractable.
Long convolution A convolutional filter with length up to millions of positions, capable of capturing long-range dependencies.
Mamba A state space model with selective parameters that processes sequences in O(L) time while maintaining long-range memory.
Quadratic complexity Computational cost that scales with the square of input length (O(L²)), limiting Transformers to short sequences.
State space model (SSM) A sequence modeling approach that maintains a hidden state evolving over positions, inspired by control theory.
Structural variant (SV) Large genomic alterations including deletions, duplications, inversions, and translocations spanning 1kb to megabases.
Topologically associating domain (TAD) A genomic region spanning 0.5-2 Mb where DNA interactions are enriched, bounded by CTCF sites.

Conceptual Questions

  1. Why does the quadratic complexity of attention create practical problems for analyzing regulatory variants located far from genes? Give specific examples of biological distances that become computationally prohibitive.

  2. Explain how HyenaDNA’s use of convolution with FFT achieves better computational complexity than attention. What is the tradeoff between these two approaches?

  3. Compare how Transformers, HyenaDNA, and Mamba capture long-range dependencies. Which biological scenarios favor each architecture?

  4. Why is bidirectional processing important for DNA sequence analysis? What biological features require seeing both strands?

  5. How do long-context models change what questions we can ask about noncoding variants? What new types of analyses become possible?

  6. A researcher wants to analyze a 300 kb deletion that spans three genes. Which model architecture would you recommend and why?

  7. Explain why state space models can process sequences in linear time. What is the key difference from attention that enables this?

  8. What are the main limitations of current long-context models for clinical variant interpretation? How might these be addressed in future work?


Discussion Questions

  1. Ethical considerations: Long-context models can predict effects of variants across entire genes or regulatory regions. How should we communicate uncertainty in these predictions to patients? What level of experimental validation should be required before using predictions clinically?

  2. Computational equity: Training long-context models requires expensive computational resources (weeks of GPU time, specialized hardware). How does this affect which research groups can develop these models? What strategies could make long-context analysis more accessible?

  3. Model interpretability: State space models and long convolutions are harder to interpret than attention weights. For clinical use, how important is interpretability versus accuracy? Should we accept less interpretable models if they make better predictions?

  4. Scaling limits: Current models analyze up to 1M bp. The human genome is 3.2 billion bp. What biological questions require even longer context (e.g., chromosome-scale or genome-scale)? Are there fundamental limits to how much context is useful?

  5. Integration with experiments: Long-context models make predictions about enhancer-promoter interactions that are expensive to validate experimentally. How should we prioritize which predictions to validate? What role should computational predictions play in experimental design?


Further Reading

Foundational Papers

  1. Poli et al. (2023) “Hyena Hierarchy: Towards Larger Convolutional Language Models” ICML

  2. Gu & Dao (2023) “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” arXiv

  3. Schiff et al. (2024) “Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling” arXiv

Reviews and Perspectives

  1. Gu et al. (2023) “On the Parameterization and Initialization of Diagonal State Space Models” NeurIPS

    • Theory behind efficient state space models
    • S4 architecture and variants
  2. Nguyen et al. (2023) “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution” NeurIPS

    • Applications to genomics
    • Benchmark results

Online Resources

  1. Mamba Documentation: https://github.com/state-spaces/mamba

    • Official implementation and examples
  2. HyenaDNA GitHub: https://github.com/HazyResearch/hyena-dna

    • Code, pretrained models, tutorials
  3. Long Context Models Survey: https://github.com/Strivin0311/long-llms-learning

    • Comprehensive list of long-context architectures

What’s Next?

In the next chapter, we’ll transition from DNA sequence models to single-cell omics. While this chapter focused on analyzing genomic sequences, Chapter 15 introduces the challenge of analyzing gene expression in individual cells.

Preview of Chapter 15: Introduction to Single-Cell Omics

Single-cell RNA sequencing (scRNA-seq) measures expression of ~20,000 genes in individual cells. A typical experiment generates data from 10,000-1,000,000 cells. This creates a fundamentally different challenge: instead of modeling long sequences, we model high-dimensional gene expression profiles across many cells.

You’ll learn:

  • Why bulk RNA-seq averages away important biology
  • How scRNA-seq overcomes this limitation
  • Computational challenges of sparse, high-dimensional data
  • Dimensionality reduction and cell type identification
  • How AI models learn gene expression patterns

Prerequisites for Chapter 15:

  • Understand neural network basics (Chapter 2)
  • Familiar with representation learning concepts
  • Basic knowledge of gene expression
  • Comfortable with high-dimensional data concepts

Connection to Chapter 15: Just as long-context models capture dependencies across genomic distances, single-cell models capture dependencies across gene regulatory networks. Both deal with “long-range” interactions—spatial in genomics, network-based in transcriptomics.

Ready to explore how cells differ at the molecular level? → [Continue to Chapter 15: Introduction to Single-Cell Omics]