If sequencing technology can read all 3.2 billion base pairs of the human genome, why would we choose to sequence only part of it? It’s a bit like asking: if you’re looking for a lost key in your house, why not search every room, every drawer, every pocket? The answer is simple: time, cost, and practicality.
In genomics, we face a similar trade-off. You can sequence:
The choice between these approaches has shaped how we discover disease genes, diagnose patients, and conduct research. Let’s understand what each approach offers and when to use which one.
Before we compare WGS and WES, we need to understand what the exome is and why it’s special.
Your genome contains about 20,000 protein-coding genes. But genes aren’t continuous stretches of DNA. Instead, each gene is split into:
When cells make proteins, they transcribe the entire gene into RNA, then cut out the introns and splice the exons together. The final messenger RNA (mRNA) contains only exonic sequence, which is then translated into protein.
The collection of all exons in the genome is called the exome. It makes up only 1-2% of your total DNA—about 30-50 million base pairs out of 3.2 billion. But here’s why it’s important:
About 85% of known disease-causing mutations occur in exons.
Why? Because exons directly code for proteins. A single base change in an exon can swap one amino acid for another (missense), create a premature stop signal (nonsense), or shift the reading frame (frameshift). Any of these can break a protein’s function, potentially causing disease. Changes in introns or intergenic regions can affect gene regulation or RNA splicing, but they’re less likely to have dramatic effects.
This high concentration of disease variants in exons created an opportunity: if most pathogenic variants are in just 1-2% of the genome, we could focus our sequencing there.
Before diving into details, let’s see how these approaches compare:
Feature | Whole-Exome Sequencing (WES) | Whole-Genome Sequencing (WGS) |
---|---|---|
Coverage | ~1-2% of genome (exons only) | 100% of genome |
Target size | 30-50 million bases | 3.2 billion bases |
Data per sample | ~6 GB | ~90-100 GB |
Cost (2024) | ~$400-500 | ~$600-1,000 |
Typical depth | 100-150× | 30-40× |
SNVs/indels in exons | ✓ Excellent | ✓ Excellent |
Structural variants | ✗ Mostly missed | ✓ Detected |
Non-coding variants | ✗ Missed | ✓ Detected |
Regulatory regions | ✗ Missed | ✓ Detected |
Repeat expansions | ✗ Missed | ✓ Detected (especially with long reads) |
Analysis complexity | Lower—fewer variants | Higher—millions of variants |
Interpretation | Easier—focus on coding | Harder—non-coding interpretation uncertain |
Diagnostic yield (rare disease) | 25-50% | 30-55% (slightly higher) |
Best for | Known coding disorders, cost-sensitive studies | Complex cases, structural variants, discovery |
WES doesn’t actually sequence only exons—that would be technically challenging. Instead, it uses a clever trick called target capture or enrichment:
Think of it like using a magnet to pick out just the metal pieces from a mixed pile of materials.
Figure: WES Process with Target Capture. This diagram shows the key step that distinguishes WES from WGS: exome enrichment. After DNA extraction and fragmentation, biotinylated probes (baits) specifically bind to exonic sequences. These are then captured using streptavidin-coated magnetic beads, physically separating exonic fragments from the rest of the genome. This targeted approach concentrates sequencing effort on the ~1-2% of the genome that codes for proteins. Source: Microbe Notes
Commercial capture kits from companies like Agilent SureSelect, Illumina TruSeq, and Twist Bioscience each target slightly different exon sets with different capture efficiencies. Not all exons capture equally well—regions with extreme GC content, repetitive sequences, or complex secondary structures may have coverage gaps.
WES made its breakthrough in 2010 with a paper that became a model for the approach. Researchers were studying four siblings with Miller syndrome—a rare developmental disorder causing facial abnormalities. Traditional gene-hunting methods had failed.
The strategy was simple but revolutionary:
They found variants in a gene called DHODH that none of the healthy family members carried. DHODH had never been linked to any human disease before. This discovery took months instead of years and cost thousands instead of millions.
This paper launched the WES era. Suddenly, researchers could identify disease genes for rare disorders by sequencing just a few affected individuals.
WES quickly became the first-line genetic test for diagnosing rare diseases, with a diagnostic yield of 25-50%—meaning for patients with suspected genetic disorders, WES identifies the genetic cause in about one-third to half of cases. This is remarkable considering these patients had often undergone years of testing with no diagnosis.
When WES works well:
When WES comes up short:
WGS is conceptually simpler than WES—you just sequence everything:
No target capture step. No enrichment. Just sequence the whole thing.
Figure: Whole-Genome Sequencing (WGS) Workflow. This diagram illustrates the complete process of WGS from sample to analysis. Unlike WES, this process captures all genomic regions including coding exons, introns, regulatory elements, and intergenic sequences. Source: Microbe Notes
Structural variants: Consider a patient with developmental delay and seizures. WES found nothing. With WGS, researchers discovered a large deletion removing three exons of a neurodevelopmental gene. WES had missed this because target capture requires intact DNA—when exons are deleted, there’s nothing for the baits to capture. WGS, by sequencing everything, detected the deletion through absence of reads in that region. This happens in about 5-10% of cases where WES fails.
Regulatory variants: Some forms of beta-thalassemia (reduced hemoglobin production) are caused by mutations in the regulatory regions of the HBB gene, not in the coding sequence. These would be missed by WES but are detectable with WGS.
Deep intronic variants: While WES captures exon-intron boundaries, deep intronic variants can create new splice sites or destroy existing ones. These “cryptic splice variants” can cause disease but lie beyond WES’s reach.
WGS opens up the study of non-coding variants, which make up 98% of the genome. Most non-coding sequence is probably functionally neutral, but some regions are critically important. Enhancers and promoters control when and where genes are expressed—a mutation in an enhancer can affect a gene hundreds of thousands of bases away, causing disease without touching the gene itself.
The challenge: We’re still learning which non-coding variants matter. The human genome contains millions of non-coding variants, and distinguishing functional from neutral ones is difficult. This is an active area of research.
Whether you’re sequencing a whole genome or just the exome, the basic workflow is similar—with one key difference for WES: an extra step to capture only the exons. This section walks through the laboratory and computational process that transforms a blood sample into a list of genetic variants.
Library preparation is the process of converting your DNA sample into a form that sequencing machines can read. The workflow differs significantly between short-read platforms (Illumina) and long-read platforms (PacBio, Nanopore), each optimized for their specific sequencing chemistry.
Illumina library prep is the most common approach for both WGS and WES, optimized for generating billions of short, highly accurate reads.
📺 Video Resource: For a detailed walkthrough of Illumina library preparation, watch this Illumina expert tutorial which covers best practices for Nextera library prep.
Overview of the Workflow:
1. DNA Extraction and QC
2. Fragmentation
3. End Repair and Adapter Ligation
4. PCR Amplification
5. Target Capture (WES Only!)
6. QC Check
Key Advantages: High accuracy (>99.9%), massive throughput (billions of reads), well-established protocols
Limitations: Short reads struggle with repetitive regions and structural variants
PacBio library prep creates circular DNA templates that enable multiple reads of the same molecule for high-accuracy long reads.
📺 Video Resource: For PacBio and Nanopore library preparation workflows, watch this comparative tutorial.
Overview of the Workflow:
1. DNA Extraction and QC
2. Fragmentation (Gentle!)
3. End Repair and Hairpin Adapter Ligation
4. No PCR Amplification (Usually!)
5. QC Check
The HiFi Advantage: The polymerase circles the SMRTbell template 10-20 times, reading the same DNA sequence repeatedly. A consensus is computed from these multiple passes, achieving >99.9% accuracy despite long read lengths.
Key Advantages: Long reads (10-20 kb), high accuracy, no PCR bias, detects DNA modifications
Limitations: Lower throughput than Illumina, higher cost per base
Nanopore library prep is the simplest and fastest, enabling ultra-long reads and even portable sequencing.
📺 Video Resource: See the comparative tutorial mentioned above for Nanopore workflows.
Overview of the Workflow:
1. DNA Extraction and QC
2. Minimal or No Fragmentation
3. Adapter Ligation
4. No PCR Amplification (PCR-Free)
5. Minimal QC
The Nanopore Advantage: Simplest library prep, real-time sequencing (see data as it’s generated), ultra-long reads (>100 kb, sometimes >1 Mb), portable devices (MinION is USB-sized).
Key Advantages: Ultra-long reads, fastest library prep, portable sequencing, real-time results, native DNA sequencing
Limitations: Higher per-base error rate than Illumina/PacBio (though improving rapidly), lower throughput per run
Feature | Illumina | PacBio HiFi | Oxford Nanopore |
---|---|---|---|
Library Prep Time | 4-8 hours | 3-6 hours | 10 min - 2 hours |
Read Length | 150-300 bp | 10-20 kb | 10-100+ kb |
Accuracy | >99.9% | >99.9% | 95-99% |
PCR Amplification | Yes (usually) | No (usually) | No |
Best For | WES, high-throughput WGS | Structural variants, phasing | Ultra-long reads, genome assembly |
Fragmentation | Required | Controlled | Minimal/none |
Special Feature | Target capture for WES | SMRTbell circular templates | Motor proteins, portability |
Bottom line: Choose your platform based on your scientific question:
The library is loaded onto a flow cell—a glass slide with millions of tiny spots where sequencing happens.
Cluster generation: Library fragments bind to oligonucleotides on the flow cell surface, then through “bridge amplification” create dense clusters of ~1,000 identical copies. Single molecules don’t produce enough signal to detect—clusters amplify the signal.
Sequencing by Synthesis (SBS):
This produces millions of short reads simultaneously—typically 2×150 bp (paired-end sequencing, reading both ends of each fragment).
Typical output:
PacBio uses SMRT Cells containing 25 million tiny wells called zero-mode waveguides (ZMWs). Each well holds a single DNA polymerase molecule.
Single-Molecule Real-Time (SMRT) sequencing:
Read lengths: 15,000-20,000 bp average, with some exceeding 100,000 bp. These long, accurate reads can span entire genes, resolve structural variants, phase variants, and sequence through repetitive regions.
Nanopore sequencing threads DNA through protein pores in a membrane. As bases pass through, they disrupt an electrical current in characteristic ways, enabling ultra-long reads (>100,000 bp, sometimes >1 million bp), real-time results, and portable devices like the USB-sized MinION.
After sequencing, you have millions or billions of reads. Now comes the computational challenge: figuring out what your genome looks like and how it differs from the reference.
Each base call comes with a quality score (Q score):
FastQC checks per-base quality, sequence duplication levels, adapter contamination, and GC content distribution. Low-quality bases (usually at read ends) and adapter sequences are trimmed.
Your reads need to be mapped back to their original genomic positions.
Alignment tools:
The aligner compares each read to the reference genome (e.g., GRCh38 or T2T-CHM13), finds the best-matching location, and produces a BAM file (Binary Alignment Map).
Challenges: Repetitive regions create ambiguity for short reads. Structural variants can cause incorrect alignment or prevent alignment entirely.
PCR created many identical copies of some original DNA molecules. Tools like Picard MarkDuplicates identify and mark duplicates based on identical mapping positions and sequences. Duplicates inflate coverage artificially and can bias variant calling.
Sophisticated software tools use statistical models to distinguish real variants from sequencing errors.
Popular tools:
How it works: For each genome position, the caller stacks up all covering reads, counts bases, calculates likelihood of a real variant versus error, considers quality scores, checks depth (20-30+ reads give confidence), and applies filters.
Example:
Output: A VCF file (Variant Call Format) listing all detected variants with position, reference/alternate bases, quality score, genotype, and read depth.
Initial variant calls contain false positives. Filtering removes these based on:
Result: A high-confidence variant call set—thousands to millions of variants depending on WGS versus WES.
Annotation tools (VEP, ANNOVAR, SnpEff) add biological context:
Genomic location: Is this in a gene? Which one? In an exon, intron, or intergenic region?
Functional impact:
Population frequency: How common is this in databases like gnomAD? Common variants (>1%) are usually benign.
Clinical significance: Is it in ClinVar? What’s the classification (pathogenic, benign, VUS)?
Prediction scores: CADD, REVEL scores indicate likely deleteriousness and pathogenicity.
Output: An annotated VCF file, often converted to a spreadsheet for easier interpretation.
Day 1-3: DNA extraction and library preparation (WES includes exome capture)
Day 4-5: Sequencing on NovaSeq 6000, generating ~80 million read pairs for WES
Day 6-7: Bioinformatics—QC, alignment, duplicate marking, variant calling and filtering
Day 8-10: Interpretation—filter for rare variants, focus on genes related to symptoms, prioritize high-impact variants, check ClinVar, validate with Sanger sequencing
Result: Genetic diagnosis made in ~10 days, compared to months or years with older approaches.
Patient: 8-year-old boy with intellectual disability, autism, and dysmorphic facial features
Testing strategy:
This deletion was missed by WES (no exons to capture) and too small for microarray. WGS provided the answer after other methods failed.
The cost gap between WES and WGS is shrinking rapidly:
Year | WES Cost | WGS Cost | Difference |
---|---|---|---|
2010 | ~$5,000 | ~$50,000 | 10× |
2015 | ~$1,000 | ~$5,000 | 5× |
2020 | ~$500 | ~$1,000 | 2× |
2024 | ~$400-500 | ~$600-1,000 | 1.5× |
As costs converge, the argument for WES weakens. Soon, WGS may cost the same as WES.
Storage needs:
For large biobanks, this 15× difference matters. The UK Biobank sequenced 500,000 genomes—at 90 GB each, that’s 45 petabytes requiring serious infrastructure.
Sequencing is fast. Interpretation is slow.
As one geneticist put it: “We went from being starved for data to drowning in it.”
There’s an interesting argument emerging: WES might be a transitional technology.
You can do WGS but initially analyze only the exonic regions—getting WES-equivalent results while keeping the full dataset for later. This gives you:
WES still has advantages:
Most WGS and WES today uses short-read sequencing (Illumina), which struggles with repetitive regions, structural variants, and phasing.
Long-read sequencing (PacBio HiFi, Oxford Nanopore) is changing this with reads of 15,000-20,000 bp or >100,000 bp that can:
The T2T-CHM13 genome (the first complete human genome) was built using long reads. As long-read WGS becomes more affordable, it will likely replace short-read approaches.
Whole-Exome Sequencing:
Whole-Genome Sequencing:
The trend is clear: We’re moving toward universal WGS. But for now, both approaches have their place, and the choice depends on your specific question, resources, and needs. The key insight: WES isn’t a subset of WGS in practice—it’s a different experimental design with different strengths and weaknesses. Understanding when to use each approach is an essential skill for modern geneticists and clinicians.