When Mendel crossed purple and white pea plants, he observed predictable patterns but didn’t know what caused the difference. Today, we can answer his question at the molecular level: the difference is a change in DNA sequence—a genetic variant.
When you sequence a genome, you don’t get one string of As, Ts, Gs, and Cs. You get millions of differences from the reference genome. Understanding these differences is the foundation of modern genetics. This chapter explains what variants are, how we classify them, and what effects they have.
The terms mutation, variant, and polymorphism are often used inconsistently. Here’s a practical guide:
Term | Definition | Connotation | When to Use |
---|---|---|---|
Variant | Any DNA sequence difference | Neutral | Default term—safe for any difference |
Mutation | DNA change, especially rare/new | Often implies disease | Rare changes causing disease |
Polymorphism | Common variant (>1% frequency) | Usually benign | Common variation like blood types |
SNP | Single Nucleotide Polymorphism | Common, benign | Pronounced “snip”—common single-base changes |
The relationship:
{@html ` All Variants
/ \
Mutations Polymorphisms
(rare, often (common, usually
harmful) benign)
`}
Examples:
In this chapter, we use “variant” as the default neutral term.
Understanding variants requires multiple perspectives. The same variant can be described by its size, location, and effect:
Classification | Types | Detection | Clinical Importance |
---|---|---|---|
By Size | SNV, indel, structural variant | SNVs easiest; SVs need long reads | Size affects detection and impact |
By Location | Coding, non-coding, regulatory | WES sees coding; WGS sees all | Location determines interpretability |
By Effect | LoF, missense, synonymous | Coding effects predictable; non-coding uncertain | Effect determines pathogenicity |
By Frequency | Common (>1%), rare (<1%) | All methods | Common variants rarely cause severe disease |
Definition: One base differs from the reference.
{@html `Reference: ...ATGCGATCG...
Your DNA: ...ATGCTATCG...
↑
G→T change
`}
Frequency: ~4-5 million per genome (one every 600-800 bases)
Detection: Easiest to detect with both WGS and WES—the “bread and butter” of variant calling
Definition: Small pieces of DNA added or removed (typically 1-50 bp)
Deletion example:
{@html `Reference: ...ATGCGATCG...
Variant: ...ATG---TCG... (CGA deleted)
`}
Insertion example:
{@html `Reference: ...ATGCGATCG...
Variant: ...ATGCAAAGATCG... (AAA inserted)
`}
Critical factor—divisible by 3?
Frequency: ~400,000-500,000 per genome
Detection: Slightly harder than SNVs, especially in repetitive regions
Definition: Large-scale changes (>50 bp, often >1 kb)
Types:
Impact: ~1,000-2,000 SVs per person, affecting more total bases than all SNVs combined
Detection challenge:
Technology | SV Detection Capability |
---|---|
Short-read (Illumina) | Poor—can infer but not confirm |
Long-read (PacBio/Nanopore) | Excellent—reads span breakpoints |
WES | Very poor—capture requires intact DNA |
WGS | Good with long reads |
The genome isn’t uniform. A variant’s effect depends critically on where it lands.
Why coding variants are easier:
Why non-coding variants are harder:
All coding variants affect how DNA is translated into protein. Understanding requires knowing the genetic code reads in 3-base units (codons):
{@html `DNA: ATG CAT GCA TTG AAA
Protein: Met-His-Ala-Leu-Lys
`}
Type | DNA Change | Protein Effect | Pathogenicity | Example Disease |
---|---|---|---|---|
Synonymous | GAA→GAG | Glu→Glu (same) | Usually benign | Rarely disease-causing |
Missense | GAA→GCA | Glu→Ala (different) | Variable | Sickle cell (HBB) |
Nonsense | CAG→TAG | Gln→Stop | Almost always harmful | Many CFTR variants |
Frameshift | ATG|CAT|GCA → ATG|CA_|GCA | Scrambles downstream | Almost always harmful | Duchenne muscular dystrophy |
In-frame indel | ATG|CAT|GCA|TTG → ATG|CAT|TTG | Removes/adds amino acid(s) | Variable | EGFR in lung cancer |
1. Synonymous (Silent):
2. Missense (Most Complex):
3. Loss-of-Function (LoF) = Nonsense + Frameshift:
4. In-frame indels:
{@html `Normal: ATG|CAT|GCA|TTG|AAA
Met-His-Ala-Leu-Lys
Frameshift ATG|CA_|GCA|TTG|AAA (1 base deleted)
(1 bp del): ATG|CAG|CAT|TGA|AA...
Met-Gln-His-STOP ← Wrong amino acids, early stop
In-frame ATG|CAT|___|TTG|AAA (3 bases deleted)
(3 bp del): ATG|CAT|TTG|AAA
Met-His-Leu-Lys ← One amino acid missing, frame preserved
`}
Background: Genes have introns (removed) and exons (kept). Splicing happens at precise sequences:
Canonical splice site variants (GT→AT or AG→AA):
Example: BRCA2 splice site mutations → non-functional protein → hereditary breast cancer
Near-splice variants:
Types and locations:
Element | Location | Effect | Example |
---|---|---|---|
Promoter | Near gene start | Affects transcription initiation | HFE promoter → hemochromatosis |
Enhancer | Can be 100+ kb away | Increases gene expression | MYC enhancer → cancer |
5’ UTR | Before start codon | Affects translation efficiency | Various |
3’ UTR | After stop codon | Affects mRNA stability, miRNA binding | TP53 3’UTR → cancer risk |
Challenge: Unlike coding variants (direct amino acid change), regulatory effects are:
Most are neutral but exceptions exist:
Bottom line: These are hardest to interpret—active research area.
When you sequence a genome and find millions of variants, you need a systematic filtering approach.
{@html `~4-5 million variants detected
↓
Filter 1: Frequency (remove common >1%)
↓
~50,000-100,000 rare variants
↓
Filter 2: Location (focus on coding + splice sites)
↓
~5,000-10,000 protein-affecting variants
↓
Filter 3: Gene relevance (phenotype match)
↓
~100-500 candidates
↓
Filter 4: Effect prediction (LoF, pathogenic missense)
↓
~5-20 strong candidates
↓
Filter 5: Inheritance pattern + family data
↓
1-3 likely causal variants
`}
Strong evidence for pathogenicity:
Uncertain significance (VUS):
Likely benign:
Case: Child with developmental delays. WES finds a novel variant in KMT2D gene:
Step 1—Check frequency:
Step 2—Check variant type:
Step 3—Check gene:
Step 4—Check inheritance:
Step 5—Check databases:
Conclusion: Pathogenic variant—likely explains patient’s phenotype
Variant Type | WGS | WES | Notes |
---|---|---|---|
Coding SNVs/indels | ✓✓ | ✓✓ | Both excellent |
Splice sites (canonical) | ✓✓ | ✓✓ | WES captures exon boundaries |
Deep intronic | ✓✓ | ✗ | WES misses most introns |
Regulatory (promoter/enhancer) | ✓✓ | ✗ | WES misses non-coding |
Structural variants | ✓ | ✗/? | WES poor; WGS with long reads best |
Clinical decision:
Variants exist at multiple scales: SNVs (1 bp), indels (1-50 bp), structural variants (>50 bp)—each requires different detection methods
Location determines interpretability: Coding variants are easier to interpret than non-coding; 98% of genome is non-coding
Effect on protein predicts pathogenicity: LoF variants (nonsense/frameshift) almost always harmful; missense variable; synonymous usually benign
Frequency matters: Common variants (>1%) rarely cause severe disease due to natural selection
Multiple evidence types needed: No single criterion sufficient—integrate frequency, effect, gene constraint, inheritance, and functional data
Many variants remain uncertain: VUS (Variant of Uncertain Significance) is common, especially for missense and non-coding variants
Sequencing strategy matters: WES excellent for coding variants; WGS needed for comprehensive analysis
Understanding variant types is foundational. The next sections explore:
All builds on this framework: variants are the molecular basis of alleles, and alleles follow Mendel’s rules—but now we see them directly in DNA sequence.