Chapter 5: Genetic Variation and Genomic Technologies

Interactive: Chapter 5


4,100,000.

That is approximately how many positions in your DNA differ from the person sitting next to you in class right now. Not approximate — not “a lot” — a specific, enumerable number. If you printed those differences on paper, one variant per line, you would fill 82,000 pages. Stack those pages and the pile would be taller than you are.

Somewhere in that stack — perhaps on page 47,231 or page 81,003 — there might be a single line that explains why your family has a history of early-onset heart disease, or why you can taste bitter compounds in broccoli that your roommate cannot, or why a drug that works for most people gives you a side effect that mystifies your physician. One line in 82,000 pages. And that is assuming the answer lies in a single variant. Many traits and diseases arise from combinations: two variants that are harmless alone but disruptive together, or a common variant that only matters in a particular environmental context.

For most of human history, this stack was invisible. We knew that heredity existed, that traits ran in families, that identical twins were more similar than fraternal ones — but the molecular substrate was opaque. The sequencing revolution of the last two decades cracked it open. We can now read every letter of a person’s genome in a matter of days, for a cost that continues to fall. The stack has become legible. The problem, now, is navigation.

This chapter is about the technologies that generate the stack, the types of variants it contains, and the computational strategies that make it navigable. It is the problem that motivates everything that follows in this book — because before AI can help us interpret variation, we need to understand what variation is, how we measure it, and why the sheer scale of it demands something smarter than a manual search.


The Biological Challenge: Scale, Complexity, and the Noncoding Genome

Before we can apply AI to interpret genetic variants, we need to understand what we’re dealing with. The human genome presents several interconnected challenges:

Challenge 1: Sheer Volume

Each human genome contains approximately 4-5 million variants compared to the reference genome (GRCh38). These include:

  • ~3.5 million single nucleotide variants (SNVs)
  • ~500,000 small insertions and deletions (indels, 1-50 bp)
  • ~5,000-10,000 structural variants (>50 bp)

If you examined one variant per minute, working 24/7, it would take over 7 years to review a single genome. We’re now sequencing hundreds of thousands of genomes per year. Manual review is impossible.

Challenge 2: Most of the Genome Isn’t Genes

Only about 1.5% of the human genome codes for proteins (exons). Yet variants outside protein-coding regions can have profound functional impacts:

  • Promoters control when and where genes are expressed
  • Enhancers can be located millions of base pairs away from their target genes
  • Silencers repress gene expression
  • Splice sites determine which parts of genes are included in final transcripts
  • Long-range regulatory elements coordinate expression of multiple genes

A variant in an enhancer 500 kilobases upstream of a critical developmental gene can be just as impactful as a variant that changes an amino acid in that gene’s protein. But we have far less understanding of how noncoding variants work.

Challenge 3: Context Matters

The same DNA sequence can have completely different functions depending on:

  • Cell type: An enhancer active in neurons might be inactive in liver cells
  • Developmental stage: Critical during embryogenesis, irrelevant in adults
  • Environmental conditions: Some regulatory elements respond to hormones, nutrients, or stress
  • Genetic background: Variants interact with each other

A variant that’s harmless in most people might have functional impact when combined with other variants. We call these epistatic interactions, and they’re everywhere.

Challenge 4: Limited Experimental Capacity

How do we figure out what variants actually do? The gold standard is experimental testing:

  • Introduce the variant into cells using CRISPR
  • Measure gene expression changes
  • Assess cellular phenotypes
  • Test in animal models

But this approach costs $10,000-$50,000 per variant and takes weeks to months. With 4-5 million variants per genome, experimental validation of everything is neither practical nor affordable.

We need computational predictions to prioritize which variants are worth experimental follow-up.


Learning Objectives

By the end of this chapter, you will be able to:

  • Distinguish between SNVs, indels, and structural variants and explain their relative frequencies
  • Describe the major genome sequencing technologies and their advantages for variant detection
  • Explain why noncoding variants are harder to interpret than coding variants
  • Identify the key functional categories of noncoding regulatory elements
  • Understand how variant allele frequency relates to functional impact
  • Explain why experimental validation cannot scale to millions of variants
  • Describe the concept of variant pathogenicity and why it’s complex for noncoding variants

5.1 Types of Genetic Variation

Human genetic diversity is beautiful and complex. Let’s start with the major categories of variation and what they mean biologically.

5.1.1 Single Nucleotide Variants (SNVs)

The most common type of genetic variation is the single nucleotide variant (SNV)—one DNA base differs from the reference genome.

Example:

Reference genome:  ...ATCG[A]GTCC...
Individual's DNA:  ...ATCG[G]GTCC...
                         ^
                     SNV: A→G

Key facts about SNVs:

  • Make up ~85% of all variants
  • ~3.5 million SNVs per individual genome
  • Most are inherited from parents (present in germline)
  • Some are de novo (new mutations not in parents)
  • Distribution: ~38,000 in coding exons, most in noncoding regions

Functional impact depends on location:

In protein-coding regions, SNVs can be:

  • Synonymous: Change the DNA but not the amino acid (due to genetic code redundancy)
    • Example: GAA → GAG (both code for glutamate)
    • Usually neutral, though can affect splicing or mRNA stability
  • Missense: Change one amino acid to another
    • Example: GAA (glutamate) → AAA (lysine)
    • Impact ranges from neutral to severe depending on:
      • Chemical properties (charged → uncharged is more disruptive)
      • Location (active site vs. surface)
      • Conservation (changing a conserved residue is riskier)
  • Nonsense: Create a premature stop codon
    • Example: TGG (tryptophan) → TGA (stop)
    • Usually severe—truncated protein often nonfunctional
    • Can trigger nonsense-mediated decay (mRNA destroyed)

In noncoding regions, SNVs can affect:

  • Transcription factor binding sites
  • Splicing signals
  • RNA secondary structure
  • Long-range regulatory interactions

5.1.2 Insertions and Deletions (Indels)

Indels are small insertions or deletions, typically 1-50 base pairs.

Example of a 2 bp deletion:

Reference:  ...ATCGAA[GT]CCTA...
Individual: ...ATCGAA[--]CCTA...
                      deletion

Key facts about indels:

  • ~500,000 indels per genome
  • ~15% of all variants
  • Can be inherited or de novo
  • Particularly important in repetitive sequences

In protein-coding regions:

  • Frameshift indels: Not a multiple of 3 bp
    • Shift the reading frame
    • Usually severe—completely alters downstream protein sequence
    • Example: Deletion of 2 bp shifts all subsequent codons
  • In-frame indels: Multiples of 3 bp
    • Add or remove amino acids without shifting frame
    • Impact varies (can be neutral or severe)
    • Example: 3 bp deletion removes one amino acid

In noncoding regions:

  • Can create or destroy regulatory motifs
  • Affect spacing between regulatory elements
  • Alter DNA shape and flexibility

5.1.3 Structural Variants (SVs)

Structural variants are large-scale genomic changes, typically >50 bp, often much larger (kilobases to megabases).

Types of SVs:

  1. Deletions: Large segments removed

    • Can delete entire genes or regulatory regions
    • Example: 22q11.2 deletion syndrome (DiGeorge syndrome)—3 megabase deletion affecting ~90 genes
  2. Duplications: Segments copied

    • Gene dosage effects (too much of a protein)
    • Example: PMP22 duplication causes Charcot-Marie-Tooth neuropathy
  3. Inversions: Segment flipped orientation

    • Can disrupt genes at breakpoints
    • Can separate genes from regulatory elements
  4. Translocations: Segments moved between chromosomes

    • Classic example: BCR-ABL fusion in chronic myeloid leukemia
  5. Copy number variants (CNVs): Deletions or duplications

    • ~5,000-10,000 per genome
    • Can affect gene dosage
    • Some are common and benign, others cause disorders

Detection challenges:

  • Require specialized analysis methods
  • Short-read sequencing can miss some SVs
  • Long-read sequencing (PacBio, Oxford Nanopore) improves detection
  • Often underappreciated in clinical diagnostics

5.1.4 Variant Allele Frequency and Functional Impact

Not all variants are created equal. One key predictor of functional impact is how common a variant is in the population.

The logic:

  • Variants with severe functional impact reduce reproductive fitness
  • Natural selection removes them from the population over generations
  • Therefore, common variants are usually neutral or mildly beneficial
  • Rare variants are enriched for those with functional impact

Allele frequency categories:

  • Common (>1% frequency):
    • Present in millions of people
    • Generally neutral or mildly deleterious
    • Examples: APOE ε4 allele (~15% frequency, risk factor for Alzheimer’s)
  • Low frequency (0.1-1%):
    • Present in thousands to millions
    • Mixed bag—some neutral, some with mild effects
  • Rare (<0.1%):
    • Present in dozens to thousands
    • Enriched for functional variants
    • Focus of clinical sequencing
  • Ultra-rare (<0.01%):
    • Seen in very few individuals or families
    • Highest enrichment for disorder-causing variants
    • But also includes recent neutral mutations
  • De novo (not in parents):
    • New mutation in the individual
    • ~70 de novo variants per person
    • Important in conditions affecting reproductive fitness

Important caveat: This is a statistical tendency, not a rule. Some common variants do have functional impact (e.g., sickle cell allele is protective against malaria). And most rare variants are still neutral—they’re just rare because they arose recently.


5.2 Genome Sequencing Technologies

Understanding genetic variation requires the ability to accurately read DNA sequences. Let’s explore the major technologies and their trade-offs.

5.2.1 From Microarrays to Whole-Genome Sequencing

DNA Microarrays (SNP chips)

  • Principle: Hybridization-based detection of known SNPs
  • Coverage: 500,000 to 5 million pre-selected SNPs
  • Advantages:
    • Cheap ($50-$100 per sample)
    • Fast (thousands of samples in parallel)
    • Great for population genetics and GWAS
  • Limitations:
    • Only detects pre-selected variants (can’t discover new ones)
    • Misses indels, structural variants, rare variants
    • Not useful for clinical diagnostics
  • Use cases: Ancestry testing (23andMe), GWAS studies, agriculture

Whole-Exome Sequencing (WES)

  • Principle: Sequence only protein-coding exons (~1.5% of genome)
  • Coverage: ~20,000 genes, ~180,000 exons
  • Advantages:
    • Cheaper than whole-genome ($300-$500)
    • High depth of coverage (>100×)
    • Enriched for functional variants in genes
    • Standard for clinical diagnostics
  • Limitations:
    • Misses noncoding variants (promoters, enhancers, etc.)
    • Variable coverage at exon boundaries
    • Misses ~2% of exons due to capture inefficiency
  • Use cases: Clinical diagnostics, studies of Mendelian disorders

Whole-Genome Sequencing (WGS)

  • Principle: Sequence the entire genome
  • Coverage: All 3.2 billion base pairs
  • Advantages:
    • Detects all variant types (SNVs, indels, SVs)
    • Covers noncoding regulatory regions
    • More uniform coverage than WES
    • Can detect repeat expansions, some epigenetic marks
  • Limitations:
    • More expensive ($600-$1,000, dropping fast)
    • Generates huge amounts of data (~100 GB per genome)
    • Harder to interpret (97% is noncoding)
  • Use cases: Research, comprehensive clinical diagnostics, population genomics

The trend: WGS costs are dropping rapidly. By 2025, WGS is approaching the cost of WES, making it increasingly the preferred method.

5.2.2 Short-Read vs. Long-Read Sequencing

Short-Read Sequencing (Illumina)

  • Read length: 150-300 base pairs
  • Advantages:
    • Extremely accurate (>99.9%)
    • High throughput (billions of reads)
    • Cost-effective
    • Dominant platform (>80% of market)
  • Limitations:
    • Struggles with repetitive regions
    • Can’t span some structural variants
    • Difficult to phase variants (determine if on same chromosome)
    • Hard to assemble de novo genomes
  • How it works: Sequencing by synthesis—fluorescently labeled nucleotides incorporated one at a time

Long-Read Sequencing

PacBio HiFi

  • Read length: 10,000-25,000 base pairs (10-25 kb)
  • Accuracy: ~99.9% (after circular consensus)
  • Advantages:
    • Spans repetitive regions
    • Can phase variants directly
    • Better SV detection
    • Reads through complex genomic regions
  • Use cases: De novo assembly, SVs, complex regions

Oxford Nanopore

  • Read length: Up to 2 million base pairs (ultra-long reads!)
  • Accuracy: ~95-99% (improving rapidly)
  • Advantages:
    • Longest reads available
    • Real-time sequencing
    • Portable (MinION device fits in your hand)
    • Can detect base modifications (methylation) directly
  • Unique feature: Measures ionic current changes as DNA passes through nanopore
  • Use cases: Structural variants, repeat expansions, rapid diagnostics

The future: Long-read sequencing is improving rapidly in accuracy and cost. Within 5-10 years, it may become the standard for comprehensive genomic analysis.

5.2.3 Sequencing Depth and Coverage

Depth and coverage are critical concepts for understanding sequencing quality.

Sequencing depth (or coverage): Number of times a given base is sequenced

  • Notation: “30× coverage” means each base is sequenced ~30 times on average
  • Higher depth = more confident variant calls

Typical depths:

  • WES: 80-100× in targeted regions
  • WGS: 30-40× for clinical, 10-15× for population studies
  • Single-cell: 0.01-1× (very sparse!)

Why does depth matter?

At 10× depth:

  • Can reliably detect homozygous variants
  • Will miss some heterozygous variants (need reads from both alleles)
  • Difficult to distinguish true variants from sequencing errors

At 30× depth:

  • Reliable heterozygous variant calling
  • Standard for clinical WGS
  • Good balance of cost and accuracy

At 100× depth:

  • Very high confidence
  • Can detect low-frequency somatic variants (cancer, mosaicism)
  • Expensive (3× more data than 30×)

Coverage uniformity also matters:

  • Some regions are hard to sequence (GC-rich, repetitive)
  • “30× average coverage” might mean some regions are only 5×
  • Exome capture creates variable coverage across exons

5.3 The Noncoding Genome: Regulatory Elements

The 98.5% of the genome that doesn’t code for proteins is not “junk DNA.” It’s a vast regulatory network controlling when, where, and how much genes are expressed.

5.3.1 Types of Regulatory Elements

Promoters

  • Location: Immediately upstream of gene start site (typically within 1 kb)
  • Function: Binding site for RNA polymerase and transcription factors
  • Size: ~100-1,000 bp
  • Key features:
    • Core promoter elements (TATA box, Inr, DPE)
    • Transcription factor binding sites
    • Often CpG-rich (clusters of CG dinucleotides)
  • Impact of variants: Can dramatically alter gene expression levels

Enhancers

  • Location: Can be far from target gene (often 50-500 kb away, sometimes >1 Mb!)
  • Function: Boost transcription of target genes in specific cell types/conditions
  • Size: ~50-1,500 bp
  • Key features:
    • Multiple transcription factor binding sites (TFBSs)
    • Cell-type specific activity
    • Can regulate multiple genes
    • DNA loops to contact promoters
  • Numbers: ~400,000-1,000,000 enhancers in human genome
  • Impact of variants: Can cause disorders by altering gene expression in specific tissues

Classic example: A variant in an enhancer 1 megabase from the SOX9 gene causes campomelic dysplasia (skeletal malformation). The variant doesn’t touch the gene, but disrupts its expression during bone development.

Silencers

  • Function: Repress gene transcription
  • Mechanism: Recruit repressive transcription factors
  • Less understood: Historically harder to identify than enhancers
  • Impact: Variants can cause inappropriate gene activation

Insulators

  • Function: Block enhancer-promoter interactions
  • Mechanism: Create boundaries between regulatory domains
  • Key protein: CTCF (binds insulator sites)
  • Impact: Variants can allow enhancers to act on wrong genes

Splice Sites

  • Location: Exon-intron boundaries
  • Function: Determine which exons are included in mature mRNA
  • Critical sequences:
    • Donor site (GT at start of intron): ...exon|GT...intron
    • Acceptor site (AG at end of intron): ...intron...AG|exon...
    • Branch point (A typically ~20-50 bp before acceptor)
  • Impact: Variants can cause:
    • Exon skipping (entire exon missing from mRNA)
    • Intron retention (intron not removed)
    • Cryptic splice site activation (wrong site used)
    • These often lead to frameshift or nonsense-mediated decay

Example: Many variants in the BRCA1 gene associated with cancer risk are actually splice site variants, not amino acid changes.

5.3.2 Chromatin States and Epigenetic Marks

DNA doesn’t exist naked in the cell—it’s wrapped around histone proteins, forming chromatin. The state of chromatin determines which parts of the genome are accessible.

Histone modifications are chemical tags on histone proteins that mark functional states:

  • H3K4me3 (histone H3, lysine 4, tri-methylated):

    • Mark of active promoters
    • Sharp peaks at transcription start sites
    • Used to identify active genes
  • H3K4me1 (histone H3, lysine 4, mono-methylated):

    • Mark of enhancers
    • Broader regions
    • Often co-occurs with H3K27ac at active enhancers
  • H3K27ac (histone H3, lysine 27, acetylated):

    • Mark of active regulatory regions
    • Distinguishes active from poised/inactive enhancers
    • Strong predictor of functional impact
  • H3K27me3 (histone H3, lysine 27, tri-methylated):

    • Mark of repressed regions
    • Often at developmentally silenced genes
    • Polycomb-mediated repression
  • H3K9me3 (histone H3, lysine 9, tri-methylated):

    • Mark of heterochromatin
    • Permanently silent regions
    • Repetitive elements, centromeres

Chromatin accessibility (open vs. closed):

  • Open chromatin: DNA accessible to transcription factors
    • Measured by: DNase-seq, ATAC-seq
    • Indicates regulatory regions
  • Closed chromatin: DNA tightly wrapped, inaccessible
    • Inactive regulatory regions
    • Silent heterochromatin

Why this matters for variant interpretation:

  • Variants in open chromatin with active histone marks are more likely to have functional impact
  • Variants in closed, repressed chromatin are less likely to matter
  • Cell-type specificity: Same variant might be in open chromatin in neurons but closed in liver cells

5.4 Experimental Data: The Foundation for AI Models

To build AI models that predict variant impacts, we need training data from experiments. Let’s look at the major data types.

5.4.1 Chromatin Immunoprecipitation Sequencing (ChIP-seq)

What it measures: Where specific proteins bind to DNA genome-wide

How it works:

  1. Crosslink proteins to DNA (formaldehyde)
  2. Fragment DNA into small pieces
  3. Use antibodies to pull down specific protein (e.g., a transcription factor)
  4. Sequence the DNA fragments that were bound
  5. Map reads to genome to find binding sites

Data output:

  • Peaks (regions with high read counts)
  • Indicates where protein was bound
  • Resolution: ~100-300 bp

Common applications:

  • Transcription factor ChIP-seq: Find binding sites for specific TFs
  • Histone modification ChIP-seq: Map H3K4me3, H3K27ac, etc.
  • RNA Polymerase II ChIP-seq: Find actively transcribed regions

Example dataset: ENCODE has ChIP-seq data for:

  • ~600 transcription factors
  • Dozens of histone modifications
  • Across ~100 cell types
  • Millions of regulatory elements characterized

5.4.2 Assays for Chromatin Accessibility

DNase-seq (DNase I hypersensitivity sequencing)

  • Principle: DNase I enzyme cuts accessible DNA
  • Measures: Regulatory elements, TF footprints
  • Resolution: ~50-200 bp
  • Advantages: Gold standard, high resolution
  • Limitations: Requires many cells, technically challenging

ATAC-seq (Assay for Transposase-Accessible Chromatin)

  • Principle: Tn5 transposase inserts sequencing adapters into accessible DNA
  • Measures: Open chromatin regions
  • Resolution: ~50-200 bp
  • Advantages:
    • Easy protocol
    • Requires few cells (50,000 cells, or even single cells!)
    • Fast
  • Limitations: More background noise than DNase-seq
  • Status: Becoming the dominant method

5.4.3 Chromosome Conformation Capture (3C and derivatives)

DNA forms 3D structures in the nucleus. Distant genomic regions physically contact each other, especially enhancer-promoter pairs.

Hi-C: Genome-wide contacts

  • Measures: All pairwise interactions across the genome
  • Resolution: 1 kb to 1 Mb (depends on sequencing depth)
  • Reveals:
    • Topologically associating domains (TADs)
    • A/B compartments (active vs. inactive)
    • Specific enhancer-promoter loops
  • Data size: Huge (billions of read pairs per experiment)

Promoter-Capture Hi-C

  • Focus: Interactions involving promoters specifically
  • Advantages: Enriched for functional regulatory contacts
  • Use: Linking enhancers to target genes

Why this matters:

  • An enhancer variant only matters if it regulates an important gene
  • Hi-C tells us which genes an enhancer likely controls
  • Critical for interpreting noncoding variants

5.4.4 Massively Parallel Reporter Assays (MPRAs)

The functional testing problem: We can measure chromatin states, but do variants actually affect gene expression?

MPRA solution: Test thousands of variants simultaneously

How it works:

  1. Synthesize thousands of DNA sequences (with and without variants)
  2. Each sequence gets a unique DNA barcode
  3. Insert into reporter constructs (sequence → barcode → reporter gene)
  4. Transfect into cells
  5. Measure reporter gene expression by sequencing barcodes
  6. Compare expression between variant and reference sequences

Example results:

  • “Variant A increases expression 2-fold” → likely functional
  • “Variant B has no effect” → likely neutral
  • Test 10,000+ variants in a single experiment

Limitations:

  • Not in native genomic context
  • May miss long-range effects
  • Cell type and condition specific
  • Expensive and technically challenging

But: Provides ground truth labels for training AI models!


5.5 The Variant Interpretation Challenge

Now we can see the full scope of the problem. Let’s bring it together.

5.5.1 The Clinical Genetics Workflow

When a patient gets whole-genome or whole-exome sequencing, here’s what happens:

Step 1: Variant Calling

  • Align sequencing reads to reference genome
  • Identify positions that differ (variants)
  • Filter out low-quality calls
  • Output: 4-5 million variants (WGS) or ~50,000 (WES)

Step 2: Annotation

  • Determine genomic context of each variant
    • Coding vs. noncoding
    • Gene name
    • Consequence type (missense, synonymous, splice site, etc.)
  • Add population frequencies from databases (gnomAD)
  • Add known associations (ClinVar, OMIM)

Step 3: Filtering

  • Remove common variants (>1% frequency) → Usually neutral
  • Focus on rare variants in genes relevant to phenotype
  • Prioritize by predicted functional impact
  • This is where AI comes in!

Step 4: Interpretation

  • Review remaining candidate variants
  • Check literature on genes
  • Evaluate evidence for functional impact
  • Assess whether variant explains patient phenotype

Step 5: Reporting

  • Classify variants:
    • Disorder-causing (pathogenic)
    • Likely disorder-causing (likely pathogenic)
    • Uncertain significance (VUS)
    • Likely neutral (likely benign)
    • Neutral (benign)
  • Report clinically relevant findings to physician

The bottleneck: Steps 3-5 require expert manual review. A clinical geneticist might spend 4-8 hours per case. We’re sequencing faster than we can interpret.

5.5.2 Why Noncoding Variants Are Harder

Coding variants have clear rules:

  • Stop codons are usually bad
  • Frameshifts are usually bad
  • Changing a conserved amino acid in an active site is probably bad
  • We understand the genetic code perfectly

Noncoding variants are murkier:

  • We don’t know where all regulatory elements are
  • Even when we know location, we don’t fully understand the code
    • Enhancers have complex grammars of multiple TF binding sites
    • Spacing and orientation matter
    • Context-dependent (cell type, developmental stage)
  • Effects can be subtle (20% change in expression, not complete loss)
  • Variants can affect tissue-specific expression (hard to detect experimentally)

Example challenge:

  • A variant in an enhancer might reduce expression of gene X by 30% specifically in developing neurons during weeks 8-12 of embryonic development
  • No way to test this experimentally in humans
  • Even patient-derived neurons in a dish won’t perfectly recapitulate developmental timing
  • We need computational predictions

5.5.3 The Promise of AI

This is where AI shines. AI models can:

  1. Learn patterns from large-scale experimental data

    • ENCODE, Roadmap Epigenomics: thousands of experiments
    • Learn what makes a functional enhancer
    • Learn chromatin signatures of regulatory elements
  2. Integrate multiple data types

    • Sequence + chromatin state + conservation + 3D structure
    • Capture complex interactions humans can’t easily describe
  3. Make predictions for every possible variant

    • Once trained, predict in silico (no experiments needed)
    • Fast: millions of variants in minutes
    • Consistent: same variant always gets same score
  4. Prioritize variants for experimental validation

    • Rank by predicted impact
    • Focus experimental resources on high-priority variants

What we need from AI models:

  • Accurate predictions of functional impact
  • Calibrated confidence scores (know when predictions are uncertain)
  • Interpretable results (why did the model make this prediction?)
  • Generalization across cell types and conditions

Summary

Key Takeaways

  • Human genomes contain 4-5 million variants compared to reference; most are neutral, but finding functional variants is critical for understanding genetic disorders
  • Variant types include SNVs (~85%), indels (~15%), and structural variants; each has distinct functional consequences and detection challenges
  • Sequencing technologies range from cheap microarrays (known variants only) to comprehensive WGS (all variant types); long-read sequencing is improving detection of complex variants
  • The noncoding genome (98.5%) contains vast regulatory networks including promoters, enhancers, silencers, and insulators; these control gene expression in cell-type and condition-specific ways
  • Experimental data (ChIP-seq, ATAC-seq, Hi-C, MPRAs) reveal regulatory element locations and functions, providing training data for AI models
  • Chromatin states (histone modifications, accessibility) indicate functional activity; variants in open, active chromatin are more likely to have functional impact
  • Clinical variant interpretation requires filtering millions of variants to identify a handful of candidates; manual review is time-consuming and doesn’t scale
  • Noncoding variants are especially challenging due to incomplete understanding of regulatory codes, context-dependence, and difficulty of experimental validation
  • AI’s promise is learning patterns from large-scale data to predict variant impacts accurately, consistently, and at scale

📖 Key Terms
Term Definition
Allele frequency Proportion of individuals in a population carrying a particular variant; rare variants are enriched for functional impacts
ATAC-seq Assay for Transposase-Accessible Chromatin; uses Tn5 transposase to map open chromatin regions genome-wide
ChIP-seq Chromatin Immunoprecipitation sequencing; identifies genome-wide binding sites of specific proteins (transcription factors, modified histones)
De novo variant A genetic variant present in an individual but absent from both parents; arises as a new mutation
Enhancer Regulatory DNA sequence that increases gene transcription; can be located far from target gene and functions in cell-type-specific manner
Epistasis Interaction between different genetic variants; functional impact of one variant depends on presence of others
Frameshift Insertion or deletion that’s not a multiple of 3 base pairs; shifts reading frame and alters all downstream amino acids
Hi-C Chromosome conformation capture method that maps all pairwise 3D contacts across the genome; reveals enhancer-promoter interactions
Indel Insertion or deletion of DNA sequence, typically 1-50 base pairs in length
Missense variant SNV that changes one amino acid to another in a protein sequence
MPRA Massively Parallel Reporter Assay; tests functional impact of thousands of variants simultaneously by measuring reporter gene expression
Nonsense variant SNV that creates a premature stop codon, typically resulting in truncated protein
Promoter Regulatory region immediately upstream of gene that controls transcription initiation; binding site for RNA polymerase and transcription factors
Splice site Sequence at exon-intron boundary that directs splicing; variants can cause exon skipping or intron retention
Structural variant (SV) Large-scale genomic alteration >50 bp, including deletions, duplications, inversions, and translocations
Synonymous variant SNV that changes DNA sequence but not amino acid due to genetic code redundancy; usually neutral but can affect splicing
Variant of uncertain significance (VUS) A genetic variant whose functional impact and clinical relevance are unclear; requires additional evidence
Whole-exome sequencing (WES) Sequencing of protein-coding regions (~1.5% of genome); standard for clinical diagnostics
Whole-genome sequencing (WGS) Sequencing of entire genome including noncoding regions; provides comprehensive variant detection

Test Your Understanding

  1. Variant frequency and function: You find a missense variant in a cancer-related gene. In one database, it’s present at 0.01% frequency. In another, it’s at 2% frequency. How would these different frequencies affect your interpretation of whether this variant has functional impact? What might explain the discrepancy?

  2. Technology trade-offs: A research team wants to study structural variants associated with autism spectrum disorder. They have budget for either: (a) short-read WGS of 1,000 individuals at 30× coverage, or (b) long-read WGS of 200 individuals at 30× coverage. What would you recommend and why? What types of variants might be missed or gained with each approach?

  3. Noncoding complexity: Imagine a variant in an enhancer 800 kb upstream of a neurodevelopmental gene. The variant is heterozygous (one copy affected). It reduces enhancer activity by 25% specifically in developing neurons. Why would this be difficult to detect experimentally? Why might patients with this variant show variable phenotypes?

  4. Experimental validation priorities: You’ve identified 50 rare variants in regulatory elements near genes associated with congenital heart defects. You can functionally test 5 variants using MPRA. What additional information would help you prioritize which 5 to test? Consider sequence features, chromatin states, conservation, and Hi-C data.

  5. Epistasis: Two variants individually have no functional impact, but when present together in the same individual, they cause a developmental disorder. Why does this make clinical interpretation challenging? How might AI models help (or struggle) with detecting such epistatic interactions?

  6. De novo variants: A patient has a severe developmental disorder, but trio sequencing reveals only 2 de novo coding variants, both missense variants in genes not previously associated with any disorder. Neither variant is in a highly conserved position. How would you approach interpreting whether one of these variants is responsible? What if the causative variant were actually in a noncoding regulatory element?

  7. Cell-type specificity: A variant falls within an enhancer that’s active only in pancreatic beta cells during embryonic development (weeks 8-12). The patient has neonatal diabetes. Why is it essentially impossible to functionally validate this variant in the patient’s cells? How could AI models trained on embryonic pancreas chromatin data help?

  8. Coverage considerations: When doing clinical WGS, some labs use 30× coverage while others use 60×. For a heterozygous rare variant, why might 30× coverage sometimes miss the variant? If budget is limited, what’s better: sequencing more patients at lower coverage or fewer patients at higher coverage?


Further Reading

Foundational Papers

  • Auton et al. (2015) “A global reference for human genetic variation.” Nature 526:68-74.

    • The 1000 Genomes Project final paper; comprehensive catalog of human variation
  • Karczewski et al. (2020) “The mutational constraint spectrum quantified from variation in 141,456 humans.” Nature 581:434-443.

    • The gnomAD v2 paper; essential resource for variant frequencies
  • ENCODE Project Consortium (2012) “An integrated encyclopedia of DNA elements in the human genome.” Nature 489:57-74.

    • The original ENCODE paper; foundation for understanding noncoding genome

Reviews for Understanding Regulatory Elements

  • Deplancke et al. (2016) “The genetics of transcription factor DNA binding variation.” Cell 166:538-554.

  • Shendure & Akey (2015) “The origins, determinants, and consequences of human mutations.” Science 349:1478-1483.

  • Maurano et al. (2012) “Systematic localization of common disease-associated variation in regulatory DNA.” Science 337:1190-1195.

Sequencing Technologies

  • Wenger et al. (2019) “Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.” Nature Biotechnology 37:1155-1162.

  • Sedlazeck et al. (2018) “Accurate detection of complex structural variations using single-molecule sequencing.” Nature Methods 15:461-468.

Clinical Variant Interpretation

  • Richards et al. (2015) “Standards and guidelines for the interpretation of sequence variants.” Genetics in Medicine 17:405-424.

    • ACMG guidelines for variant classification; clinical standard
  • Rehm et al. (2015) “ClinGen—the Clinical Genome Resource.” New England Journal of Medicine 372:2235-2242.

Online Resources

  • gnomAD Browser (gnomad.broadinstitute.org) — Explore variant frequencies in ~140,000 individuals
  • ENCODE Portal (encodeproject.org) — Download ChIP-seq, ATAC-seq, and other functional genomics data
  • UCSC Genome Browser (genome.ucsc.edu) — Visualize genomic regions with annotations
  • ClinVar (ncbi.nlm.nih.gov/clinvar) — Database of clinically relevant variants with interpretations

What’s Next?

Now that you understand the scale and complexity of genetic variation, we’re ready to explore how earlier computational tools attempted to tackle variant interpretation.

In Chapter 6, we’ll explore:

  • Why evolutionary conservation is predictive of functional impact
  • Traditional tools like SIFT and PolyPhen-2
  • Conservation scoring methods (GERP, phyloP, phastCons)
  • How these approaches prioritize variants for clinical investigation
  • The limitations that motivate machine learning approaches

Before moving on, make sure you:

  • Understand the differences between SNVs, indels, and SVs
  • Can explain why rare variants are enriched for functional impacts
  • Know the major types of noncoding regulatory elements
  • Appreciate the scale challenge (millions of variants per genome)
  • Have explored the gnomAD or ENCODE websites

Ready? Let’s see how conservation analysis and traditional tools approach variant interpretation!