When you sequence someone’s genome using NGS, you get millions of differences compared to the reference genome. A typical person has:
The overwhelming question is: which ones matter?
Finding those needle-in-a-haystack variants requires annotation (adding meaning to each variant) and databases (comparing your findings to what’s already known).
Variant annotation adds biological and clinical context to each variant. For every difference you find, annotation helps answer four key questions:
{@html `┌─────────────────────────────────────────────────┐
│ THE FOUR ANNOTATION QUESTIONS │
├─────────────────────────────────────────────────┤
│ 1. LOCATION: Gene? Exon? Regulatory region? │
│ 2. FUNCTION: Protein change? Splicing effect? │
│ 3. POPULATION: Common or rare? │
│ 4. CLINICAL: Known disease association? │
└─────────────────────────────────────────────────┘
`}
Three major tools automate variant annotation. Each has distinct strengths:
Tool | Primary Strength | Best Used For | Output Example |
---|---|---|---|
VEP (Ensembl) | Comprehensive location mapping, regulatory regions | Whole-genome analysis including non-coding regions | “CFTR exon 11, stop-gain, p.Gly542*” |
ANNOVAR | Multi-database integration, flexible filtering | Prioritizing variants from large datasets | Adds gnomAD frequencies, conservation scores, disease links |
SnpEff | Automatic impact classification | Quick screening for disease-causing variants | HIGH (stop-gain), MODERATE (missense), LOW (synonymous) |
In practice, researchers often use multiple tools for cross-validation. For example, finding a CFTR stop-gain mutation:
Variant databases form a hierarchical information structure: known variants → population frequency → clinical significance.
What it is: The comprehensive catalog of human genetic variation maintained by NCBI since 1998.
Key features:
Usage: Check if your variant has an rs number. If common (>5% frequency), likely benign.
What it is: High-quality sequencing data from 140,000+ individuals across diverse populations.
Key features:
Example: A novel variant in a developmental disorder patient:
Why diversity matters: What’s common in one population may be rare in another. Population-specific data prevents misclassification.
What it is: Clinical variant database linking variants to diseases with evidence-based classifications.
Classification system:
{@html `Pathogenic ←→ Likely Pathogenic ←→ VUS ←→ Likely Benign ←→ Benign
▲ ▲ ▲
Causes disease Uncertain significance Harmless
`}
Clinical workflow:
The VUS challenge: Many variants lack sufficient evidence for classification. As data accumulates, VUSs get reclassified—patients often recheck results annually.
For variants not yet in ClinVar, computational tools predict likely effects:
Tool | Scoring System | Interpretation Threshold | Best For |
---|---|---|---|
CADD | Phred-like scale | >20 = top 1% deleterious >30 = top 0.1% |
General variant impact |
REVEL | 0-1 scale | >0.5 = likely pathogenic | Missense variants in Mendelian diseases |
AlphaMissense | AI-based classification | Benign/Ambiguous/Pathogenic | Structure-based predictions for 71M variants |
Important limitation: These are predictions, not proof. Functional studies or clinical evidence needed for confirmation.
Before annotating variants, we need a map of the genome showing where genes, exons, and regulatory regions are located.
Feature | GENCODE | RefSeq |
---|---|---|
Coverage | Comprehensive: ~45,000 genes (coding + non-coding) | Conservative: ~20,000 protein-coding genes |
Curation | Manual + automated | Primarily manual |
Identifiers | ENSG/ENST | NM/NP |
Best for | Research, non-coding RNA studies | Clinical diagnostics |
Updates | Frequent | Less frequent, more stable |
Clinical example: BRCA1 founder mutation reported as “NM_007294.3:c.5266dup” using RefSeq coordinates—the standard in clinical settings.
Biobanks are large collections of DNA samples linked to detailed health information. They answer: “Do people with this variant have higher disease rates?”
Biobank | Scale | Population | Key Strength | Representative Finding |
---|---|---|---|---|
UK Biobank | 500,000 | UK adults | NHS health records + imaging | Gene-environment interactions (coffee, sleep, heart disease) |
All of Us | 245,000+ | US diversity focus | Addressing health disparities | Population-specific drug metabolism variants |
KOVA | 5,305 | Korean | East Asian variant reference | 22.8% better filtering vs European databases |
FinnGen | 500,000+ | Finnish | Founder effect for rare variants | Rare heart/metabolic disorder variants |
ToMMo | 150,000 | Japanese (3 generations) | Gene-environment over time | Post-disaster stress/trauma genetics |
Clinical impact: Population-specific databases prevent misclassification. A variant common in East Asians but absent from European databases could be wrongly labeled pathogenic without appropriate reference data.
Patient: 4-year-old with developmental delays, seizures, abnormal brain MRI.
Diagnostic workflow:
{@html `25,000 variants found (whole-exome sequencing)
↓
[Annotation: VEP + ANNOVAR]
↓
~200 rare variants affecting protein function
↓
[Filter: ClinVar (remove benign) + gnomAD (<0.5%)]
↓
40 candidate variants
↓
[Prioritize: Genes causing similar symptoms]
↓
3 variants in relevant genes
↓
[Predict: CADD, REVEL, AlphaMissense]
↓
SCN1A variant identified:
• CADD: 32 (top 0.1%)
• REVEL: 0.89 (pathogenic)
• gnomAD: Not in 140,000+ people
• De novo (new mutation, not inherited)
↓
DIAGNOSIS: Dravet syndrome
`}
Outcome:
These databases and tools enable:
Research:
Medicine:
Global Health:
As sequencing becomes widespread, every genome sequenced adds to our collective knowledge, making the next diagnosis faster and more accurate.
Annotation adds meaning to millions of variants through four key questions: location, function, population frequency, and clinical significance
Three-tier database system: dbSNP (catalog) → gnomAD (frequency) → ClinVar (clinical interpretation)
Multiple tools complement each other: Use VEP, ANNOVAR, and SnpEff together; combine CADD, REVEL, and AlphaMissense for predictions
Population diversity is essential: Reference data must match patient ancestry to avoid misclassification
Integration is key: Clinical diagnosis requires combining annotation tools, databases, pathogenicity prediction, and biobank data into a coherent workflow