What if you found a book written in a language no one had ever seen? No dictionary, no grammar guide, no Rosetta Stone. Just the raw text — millions of pages of it. Could you figure out what it means? Surprisingly, yes — if you have enough text. By noticing which “words” appear near which other “words,” you could discover grammar, syntax, even meaning. The word that always appears between a subject and an object is probably a verb. Words that are interchangeable in context are probably synonyms. This is essentially how modern NLP models cracked human language before anyone handed them a grammar textbook.
Now consider DNA. We have 3 billion letters of text per genome, across thousands of species — but no complete dictionary telling us what every sequence means. We know some words: TATAAA is a core promoter motif, AATAAA is a polyadenylation signal. But 98% of the genome is noncoding, and the regulatory logic written there — which sequences bind which transcription factors, which enhancers activate which genes, which variants disrupt which functions — remains largely unread. Traditional tools approach this with fixed rules and small windows. They can tell you if a 6-mer matches a known motif. They cannot tell you what a sequence means in context.
DNA language models take the approach of the linguist with an unknown text: read billions of sequences, learn the statistical patterns, and use those patterns to predict what each sequence does. No labels required during training. No predetermined grammar. Just the raw text of evolution, accumulated over billions of years of selection, and a model large enough to find the signal within it.
The practical stakes are immediate. A GWAS study of autism spectrum disorder yields 15,000 regulatory variants — each a single nucleotide change in noncoding DNA, each potentially disrupting gene regulation in neurons. Traditional conservation scores narrow the list to 3,000. That’s still $1.5 million in experimental validation at $500 per variant. A DNA language model that genuinely understands genomic context can prioritize that list further — not by pattern-matching to known motifs, but by modeling what those sequences are actually saying.
Understanding regulatory variants requires knowing the context in which they appear. A “CAG” sequence means different things in different genomic neighborhoods:
Traditional tools treat each position independently or use fixed-size windows. They can’t capture long-range dependencies, tissue-specific effects, or the combinatorial logic of multiple regulatory elements working together.
The scale of the problem is staggering:
Experimental validation can’t scale to this level. MPRA (Massively Parallel Reporter Assays) can test thousands of sequences, but that’s a tiny fraction of possible variants. And experiments often miss tissue-specific or developmental-stage-specific effects.
What we need is a model that can:
This is precisely what language models were designed to do with text. Can we apply the same principles to DNA?
After completing this chapter, you will be able to:
When we call DNA a “language,” we’re not just making a poetic comparison. There are deep structural similarities between human languages and genomic sequences.
Consider these parallels:
In English:
In DNA:
For example, if you see “The cat sat on the ___,” you can predict “mat” based on context and grammar. Similarly, if you see a promoter sequence with a TATA box, you can predict nearby nucleotides that form the transcription start site.
Early attempts to apply language models to DNA treated each nucleotide as a “letter”:
A C G T A A C G G T A C...
But this misses crucial biological structure. Regulatory function emerges from groups of nucleotides:
It’s like trying to understand English by analyzing individual letters without recognizing words. You’d miss that “c-a-t” together means something different from “c-a-r.”
Biological Analogy (DNABERT k-mer tokenization): Like reading DNA as codons, but instead of 3-letter codons, overlapping 6-letter “words” are used to capture meaningful patterns.
K-mers are sequences of k consecutive nucleotides. Instead of reading DNA letter by letter, we read it in chunks:
For k=3 (3-mers or “codons” in the broad sense):
DNA: A C G T A A C G G T
3-mers: ACG CGT GTA TAA AAC ACG CGG GGT
For k=6 (6-mers):
DNA: A C G T A A C G G T A C
6-mers: ACGTAA CGTAAC GTAACG TAACGG AACGGT ACGGTA CGGTAC
Notice how k-mers overlap—each position starts a new k-mer. This sliding window captures local context.
Why k-mers work for DNA:
The choice of k is a trade-off:
| K value | Vocabulary size | Biological relevance | Computational cost |
|---|---|---|---|
| 3 | 64 | Too short for most motifs | Very low |
| 6 | 4,096 | Captures many TF sites | Moderate |
| 9 | 262,144 | Rare k-mers, sparse data | High |
| 12 | 16 million | Most k-mers never seen | Prohibitive |
DNABERT uses k=3, 4, 5, and 6 to capture patterns at multiple scales—like understanding text through letters, syllables, and words simultaneously.
DNABERT (2021) was the first major application of BERT architecture to DNA sequences. Let’s understand how it works.
Recall from Chapter 10 that BERT uses:
DNABERT adapts this to genomic sequences:
Input DNA: A C G [MASK] A A C G G T
K-mer tokens: ACG CGT GTM MTA TAA AAC ACG CGG GGT
Position: 1 2 3 4 5 6 7 8 9
Transformer processes all positions
↓
Predict masked token: "GTA"
DNABERT is pre-trained on the entire human reference genome (hg38)—all 3.2 billion nucleotides. The training process:
Step 1: Convert genome to k-mers
Genome region: ACGTAACGGT...
6-mers: ACGTAA, CGTAAC, GTAACG, ...
Step 2: Random masking (15% of k-mers)
Original: ACGTAA CGTAAC GTAACG TAACGG AACGGT
Masked: ACGTAA [MASK] GTAACG TAACGG [MASK]
Step 3: Model predicts masked k-mers The model must reconstruct the original sequence using bidirectional context.
Step 4: Update weights to minimize prediction error
After seeing billions of examples, DNABERT learns:
After pre-training, DNABERT’s internal representations (embeddings) capture biological information without being explicitly told:
Experiment: Ji et al. (2021) analyzed DNABERT embeddings:
This is remarkable: DNABERT was never told “this is a promoter” or “this is a splice site.” It discovered these patterns just by learning to predict masked nucleotides.
After pre-training, you can fine-tune DNABERT for specific biological tasks:
Task 1: Promoter identification
Task 2: Transcription factor binding site prediction
Task 3: Variant effect prediction
The key advantage: Pre-training on the entire genome provides a strong foundation. Fine-tuning requires relatively little task-specific data.
DNABERT showed the potential of language models for genomics, but had limitations. Several next-generation models address these issues.
DNABERT-2 (2023) made several improvements:
1. Byte Pair Encoding (BPE) for tokenization
Instead of fixed k-mers, BPE learns optimal “words” from the data:
Common sequence: ACGTAA ACGTAA ACGTAA (appears often)
BPE learns: ACGTAA is a single token (not 6 separate)
Rare sequence: GGGGGG (appears rarely)
BPE keeps separate: GG GG GG (or G G G G G G)
Benefits:
2. More efficient context use
DNABERT used fixed overlapping k-mers, so a 512-token input covered a fixed nucleotide span. DNABERT-2 uses BPE tokens of variable length, so the nucleotide span depends on the sequence and tokenization. In practice, this can cover several kilobases rather than a fixed 512 bp window, but it should not be described as a universal 10 kb receptive field.
This captures:
3. Multi-species pre-training
DNABERT-2 trains on genomes from multiple species:
This helps the model learn:
Performance improvements:
Nucleotide Transformer (2023) takes a different approach: scale up model size and training data.
Architecture:
Key insight: Larger models trained on more diverse data capture more nuanced patterns.
Novel feature: Cross-species embeddings
Because it trains on many species, Nucleotide Transformer can:
Example:
Human enhancer: ACGTAAGGCTAG...
Mouse ortholog: ACTTAAGGCCAG... (60% identity)
Zebrafish element: GCGTAAGGCTGC... (45% identity)
Nucleotide Transformer embeddings show these sequences are functionally similar
despite sequence divergence
LOGO (2024) addresses a fundamental limitation: previous models treat all genomic regions equally.
The problem:
LOGO’s solution: Multi-task pre-training
During pre-training, LOGO simultaneously learns:
Architecture:
Input sequence → LOGO encoder
↓
┌────────────┼────────────┐
↓ ↓ ↓
MLM head Region head Chromatin head
↓ ↓ ↓
Predict k-mer Promoter? H3K27ac?
Advantages:
GROVER (Genome Rules Obtained Via Extracted Representations, 2024) takes yet another approach: instead of starting from fixed k-mers, it learns a frequency-balanced vocabulary for the human genome using byte-pair encoding (BPE).
Many previous models had a fixed token window:
But biological context works at multiple scales:
GROVER trains a BERT-style model on human genome sequence using a BPE vocabulary selected by next-k-mer prediction. The key idea is that a useful DNA vocabulary should not simply list every possible 6-mer; it should group common sequence patterns while still preserving informative rare patterns.
What the paper reports:
What GROVER does not do in its core pretraining setup:
Example application: Variant effect in 3D context
Variant at position X in enhancer
↓
GROVER embedding of enhancer
↓
Compare to embeddings of potential target promoters
↓
Prioritize hypotheses about which regulatory grammar changed
To make this a 3D genome analysis, GROVER embeddings would need to be combined with external chromatin-contact or perturbation data.
[Optional: The Math]
Math Box: Attention Mechanisms in DNA Language Models
All modern DNA language models use attention mechanisms to weigh important context. Let’s break down how this works.
Self-Attention for Sequence Context
Given a sequence of k-mer embeddings, attention computes how much each position should “attend to” every other position.
Input: Sequence embeddings
Position: 1 2 3 4 5 K-mer: ACGT CGTA GTAA TAAC AACG Embedding: e₁ e₂ e₃ e₄ e₅For each position i, compute attention to every position j:
Step 1: Create Query, Key, Value matrices
Query₁ = Wq × e₁ Key₂ = Wk × e₂ Value₂ = Wv × e₂Step 2: Compute attention scores
score₁,₂ = Query₁ · Key₂ / √dWhere d is the embedding dimension (typically 768). Division by √d prevents scores from getting too large.
Step 3: Apply softmax to get attention weights
attention₁,₂ = exp(score₁,₂) / Σⱼ exp(score₁,ⱼ)This normalizes attention across all positions (sums to 1).
Step 4: Weighted sum of values
output₁ = Σⱼ attention₁,ⱼ × ValueⱼBiological interpretation:
- High attention between positions suggests the model is using information from both positions
- Attention patterns can generate hypotheses about regulatory interactions
- Attention alone does not prove a functional or physical interaction
Example: Splice site prediction
For a sequence near a splice donor site:
Position: ...EXON | GT | INTRON... Attention pattern shows: - GT dinucleotide attends to upstream exonic sequence - GT attends to downstream intronic elements - Less attention to distant positionsThis matches biological reality: splice site recognition depends on nearby exonic/intronic context.
Multi-Head Attention
DNA language models use multiple attention heads (typically 12):
Head 1 might learn: TF binding motifs Head 2 might learn: GC content patterns
Head 3 might learn: Repeat structures Head 12 might learn: Conservation patternsEach head can specialize in different types of patterns. The model combines all heads to get a rich representation.
Mathematical formulation:
MultiHead(Q,K,V) = Concat(head₁, head₂, ..., headₕ) × Wo where headᵢ = Attention(QWᵢq, KWᵢk, VWᵢv)
Let’s compare the major DNA language models:
| Model | Year | Scale | Training Data | Context | Key Feature |
|---|---|---|---|---|---|
| DNABERT | 2021 | BERT-scale | Human genome | fixed k-mer window | Early DNA BERT model |
| DNABERT-2 | 2023 | 117M reported | Multi-species genomes | variable BPE span | Efficient BPE tokenization |
| Nucleotide Transformer | 2023/2024 | up to billions of parameters | Many genomes across species | varies by checkpoint | Scaling and cross-species training |
| LOGO | 2024 | model-specific | Sequence plus functional annotations | task-dependent | Multi-task sequence learning |
| GROVER | 2024 | 12-layer BERT-style | Human genome sequence | BPE token window | Frequency-balanced genomic vocabulary |
There is no single leaderboard that ranks all DNA language models across all genomic tasks. Performance depends on:
General trends:
Training these models from scratch is expensive:
| Model type | Pretraining cost | Fine-tuning cost | Inference cost |
|---|---|---|---|
| Small BERT-style DNA model | Moderate | Low to moderate | Low |
| Efficient BPE model | Moderate | Low to moderate | Low |
| Billion-parameter model | Very high | Moderate to high | Moderate |
| Long-context model | High, especially for long windows | Task-dependent | Moderate to high |
Good news: You don’t need to train from scratch! Pre-trained models are publicly available. You only pay for fine-tuning on your specific task.
Storage requirements:
Let’s walk through how you’d actually use these models for real biological questions.
Decision tree:
Need maximum accuracy? → Nucleotide Transformer (but slower)
Need speed and efficiency? → DNABERT-2
Working with non-coding variants? → LOGO or GROVER
Limited computational resources? → DNABERT
Multiple species analysis? → Nucleotide Transformer or DNABERT-2
DNA language models expect specific input formats:
# Example: Preparing sequences for DNABERT
from transformers import AutoTokenizer
# Load pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M")
# Your sequence
sequence = "ACGTAACGGTACGTA"
# Tokenize
tokens = tokenizer(sequence, return_tensors="pt")
# Now ready for model input
Key considerations:
Option A: Get embeddings for downstream analysis
from transformers import AutoModel
# Load pre-trained model
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M")
# Get embeddings
embeddings = model(**tokens).last_hidden_state
# Shape: [batch_size, sequence_length, embedding_dim]
# embedding_dim is typically 768
Option B: Fine-tune for specific prediction
from transformers import AutoModelForSequenceClassification
# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
"zhihan1996/DNABERT-2-117M",
num_labels=2 # Binary classification
)
# Fine-tune on your labeled data
# (training loop code here)
# Make predictions on new sequences
predictions = model(**tokens)
For embeddings:
For predictions:
Background: GWAS studies of autism spectrum disorder identified 102 genomic regions associated with the condition. But each region contains dozens to hundreds of variants. Which ones are actually functional?
Important note: The workflow below is a teaching scenario showing how a DNA language model could be used. The specific validation rates, allele frequencies, and variant sequence are illustrative unless replaced with a primary study citation.
Approach:
Researchers used DNABERT-2 to prioritize candidate variants:
Step 1: Fine-tune on brain regulatory data
Step 2: Predict variant effects
Step 3: Validate predictions
Key finding: One variant in an enhancer near CHD8 (a known autism-associated gene):
Reference: ...ACGTAACG[G]TACGTA... (predicted: 0.82 active enhancer)
Alternate: ...ACGTAACG[A]TACGTA... (predicted: 0.23 active enhancer)
This single nucleotide change disrupts a predicted CTCF binding site, reducing enhancer activity by 60% in MPRA validation.
Clinical significance: In a real study, this final step would require independent genetic association, segregation or de novo evidence where appropriate, phenotype matching, and functional validation. DNABERT-2-style prioritization can nominate candidates, but it cannot establish clinical causality on its own.
Background: Many species lack extensive epigenomic data. Can we use language models trained on well-studied species to predict regulatory elements in unstudied species?
Approach:
Researchers can use Nucleotide Transformer-style embeddings to predict regulatory elements across species with limited annotations:
Step 1: Train on multi-species data
Step 2: Test on species with limited data
Step 3: Make predictions
Illustrative results:
Zebrafish predictions:
Cross-species conservation without alignment:
Human enhancer: ACGTAAGGCT...
Mouse ortholog: ACTTAAGGCC... (65% identity)
Zebrafish prediction: GCGTAAGGCA... (52% identity, no detectable alignment)
All three have similar Nucleotide Transformer embeddings
→ Functionally equivalent despite sequence divergence
Impact: This approach illustrates why multi-species pretraining is useful: embeddings can nominate candidate regulatory elements in species where epigenomic assays are sparse. Actual discovery claims should cite the specific benchmark or experimental validation study.
Despite impressive performance, DNA language models have important limitations.
Most models still can’t capture very long-range interactions:
Current limits:
Biological reality:
DNA language models learn from DNA sequence alone (mostly). But:
Example:
Sequence X in embryonic stem cells: Active enhancer
Same sequence in liver cells: Inactive/repressed
DNA language model only sees sequence → Can't distinguish
Partial solutions:
Current models work well for SNVs (single nucleotide variants) but struggle with:
Why?
DNA language models are still “black boxes”:
Attention visualization helps but is limited:
High attention between positions 45 and 67
→ But what does this mean biologically?
→ Which proteins bind there?
→ How does this affect gene expression?
Models learn patterns from training data:
Consequence:
Where are DNA language models heading?
Approaches in development:
Goal: Handle entire chromosomes (100+ million bp) in single model
Combining DNA sequence with other data:
Example architecture:
DNA sequence → Language model → Embeddings
↓
H3K27ac signal → CNN → Embeddings ─→ Fusion → Prediction
↑
ATAC-seq data → CNN → Embeddings ─→
Building truly general-purpose models:
This is the approach of recent models like:
Models that explicitly learn evolutionary constraints:
Current models are discriminative (classify/predict). Future models may be generative:
Potential application:
Input: "Design an enhancer active in T cells but not B cells"
Model generates: Novel sequence meeting these criteria
Validate in MPRA
Key Takeaways:
| Term | Definition |
|---|---|
| Attention mechanism | Method for the model to weight important sequence context, computing how much each position should “attend to” every other position |
| BERT (Bidirectional Encoder Representations from Transformers) | Model architecture that processes sequences in both directions simultaneously |
| Byte Pair Encoding (BPE) | Tokenization method that learns optimal “words” from data based on frequency |
| Context window | Maximum input span a model can process at once; depending on tokenization, this may be measured in tokens rather than directly in base pairs |
| Embedding | Numerical vector representation of a sequence that captures its biological properties |
| Fine-tuning | Adapting a pre-trained model to a specific task with limited task-specific data |
| Foundation model | Large pre-trained model that can be adapted to many downstream tasks |
| K-mer | Sequence of k consecutive nucleotides used as a token (e.g., 6-mer = ACGTAA) |
| Masked language modeling (MLM) | Training objective where the model predicts randomly hidden tokens from surrounding context |
| Multi-head attention | Using multiple attention mechanisms in parallel, each learning different types of patterns |
| Pre-training | Training a model on large unlabeled datasets to learn general patterns before fine-tuning |
| Self-attention | Mechanism allowing each position in a sequence to attend to all other positions |
| Tokenization | Converting raw DNA sequence into discrete units (tokens) for model input |
| Transfer learning | Using knowledge learned from one task (pre-training) to improve performance on another task (fine-tuning) |
| Zero-shot learning | Making predictions on new tasks without any task-specific training examples |
Why is k-mer tokenization more appropriate for DNA than single-nucleotide tokenization? What biological properties make k-mers useful units?
DNABERT learns that promoter sequences cluster together (have similar embeddings) without being explicitly told which sequences are promoters. Explain how masked language modeling enables this unsupervised learning of biological function.
Compare the trade-offs between DNABERT (smaller, human-genome-focused model) and Nucleotide Transformer (larger, multi-species model family). In what scenarios would you choose each?
A researcher wants to predict the effect of a variant in an enhancer 500kb from its target gene. Which DNA language model(s) from this chapter would be most appropriate, and why? What are the limitations?
Explain why DNA language models generally outperform conservation-based methods (like GERP++) for variant effect prediction, even though conservation scores use evolutionary information across species.
If you fine-tune DNABERT on enhancer data from liver cells, can you use it to predict enhancers in brain cells? Why or why not? What additional information would help?
Attention weights in DNA language models often show high attention between positions that are far apart in sequence. What biological interactions might this represent? Give specific examples.
LOGO uses multi-task pre-training (masked language modeling + region classification + chromatin state prediction). Why does learning multiple tasks simultaneously improve performance compared to learning each task separately?
Ethical considerations: DNA language models can predict regulatory effects of variants, but these predictions aren’t perfect. How should we handle cases where a model predicts high functional impact for a variant, but clinical geneticists are uncertain? Who should make the final decision about variant interpretation?
Data representation: These models are trained primarily on human reference genomes and well-studied populations. How might this bias affect predictions for variants common in under-represented populations? What steps could be taken to address this?
Mechanistic understanding vs. prediction accuracy: DNA language models can achieve high accuracy without explaining how a variant causes its effect. Is this acceptable for clinical use? When is mechanistic understanding essential vs. when is accurate prediction sufficient?
Resource allocation: Training large DNA language models can require substantial GPU infrastructure and engineering time. Is this a good use of research funding compared to funding experimental validation studies? How should the field balance computational vs. experimental approaches?
Generalization limits: Current models work well for SNVs in regulatory regions but struggle with structural variants and coding sequences. Should we develop specialized models for each variant type and genomic context, or pursue a single “universal” model? What are the trade-offs?
DNABERT (2021) Ji, Y., Zhou, Z., Liu, H., & Davuluri, R. V. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15), 2112-2120.
Nucleotide Transformer (2023) Dalla-Torre, H., Gonzalez, L., Mendoza-Revilla, J., et al. (2024). The Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods.
DNABERT-2 (2023) Zhou, Z., Ji, Y., Li, W., et al. (2023). DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. arXiv preprint.
GROVER (2024) Sanabria, M., Hirsch, J., Joubert, P. M., et al. (2024). DNA language model GROVER learns sequence context in the human genome. Nature Machine Intelligence.
DNA Language Models for Variant Effects (2023) Benegas, G., Batra, S. S., & Song, Y. S. (2023). DNA language models are powerful predictors of genome-wide variant effects. PNAS, 120(44), e2311219120.
Language Models for Biological Research (2024) Simon, E., Swanson, K., & Zou, J. (2024). Language models for biological research: a primer. Nature Methods, 21, 1422-1429.
Hugging Face Model Hub - Genomics Models https://huggingface.co/models?pipeline_tag=feature-extraction&search=dna
Nucleotide Transformer GitHub https://github.com/instadeepai/nucleotide-transformer
DNABERT Documentation https://github.com/jerryji1993/DNABERT
Deep Learning for Life Sciences Ramsundar, B., Eastman, P., Walters, P., & Pande, V. (2019). Chapter 8: Language Models. In Deep Learning for the Life Sciences. O’Reilly Media.
Biological Sequence Analysis Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
In Chapter 14: Next-Generation DNA Models, we’ll explore even more advanced architectures that push beyond the transformer paradigm:
These models address key limitations of current transformers:
Prerequisites for Chapter 14:
Coming up: We’ll see how alternative architectures can capture chromosome-scale context while remaining computationally tractable—opening new possibilities for understanding long-range gene regulation and structural variant effects.
[Continue to Chapter 14: Next-Generation DNA Models →]
This chapter is part of “AI for Biologists: From Genomic Variants to Cellular Models”
Licensed under CC BY-NC-SA 4.0