It is 2 PM on a Tuesday, and the weekly lab meeting has turned into a heated debate. A graduate student is presenting a rare variant in SCN1A — a sodium channel gene associated with epilepsy — found in a patient with developmental delays and uncontrolled seizures. She pulls up the prediction scores on the projector, and the room erupts.
“SIFT says it’s tolerated.”
“But PolyPhen says probably damaging — score of 0.89!”
“The CADD score is only 12. That’s below the usual cutoff of 15.”
“Wait — look at the conservation track. This residue is in a highly constrained region. PhyloP is 4.7.”
“It’s a leucine-to-proline substitution in a transmembrane helix. Prolines are helix breakers. Does anyone’s structural predictor have an opinion?”
“It appears once in gnomAD out of 125,000 alleles. That’s rare, but not absent.”
The PI leans back in her chair and looks at the ceiling. “So we have four tools, three contradictory opinions, one structural concern, and a frequency that proves nothing either way. Which do we trust?” A silence settles over the room.
This is the variant interpretation problem at its most honest — and it plays out in research labs and clinical genetics programs every day. No single tool commands enough authority to close the debate, because each tool was built to capture a different signal, each has its own blind spots, and the signals themselves sometimes point in opposite directions. What the field needed was not yet another individual tool, but a principled way to combine them. This chapter is about how machine learning ensemble methods provide exactly that.
Genetic variant interpretation faces a fundamental problem: no single feature perfectly predicts functional impact. Conservation matters, but not all conserved positions are functionally critical. Predicted protein changes matter, but not all amino acid substitutions affect function. Allele frequency matters, but rare variants aren’t always disorder-causing.
In Chapter 6, we explored individual prediction tools—each capturing one aspect of variant biology. But variants are complex, multifaceted entities. A variant in a highly conserved residue might still be neutral if it preserves chemical properties. A variant at a poorly conserved position might still have functional impact if it disrupts a critical binding site.
Why we can’t just pick the “best” tool:
First, there is no single “best” tool. Different tools excel at different tasks:
Second, the scale of the problem is enormous. Whole-genome sequencing identifies 4-5 million variants per person. Whole-exome sequencing still yields 20,000-30,000 variants. Even after filtering to coding variants in known disorder genes, hundreds of candidates remain.
Third, individual tools make errors in predictable patterns. SIFT tends to be conservative (fewer false positives but more false negatives). PolyPhen-2 tends to be aggressive (fewer false negatives but more false positives). Understanding and compensating for these systematic biases requires analyzing thousands of validated examples—a job for machine learning.
The computational solution: Instead of choosing one tool or manually weighing evidence, we can train machine learning models to automatically:
This approach, called ensemble learning, treats each prediction tool as a “voter” in a committee. Think of it like a scientific panel review—no single reviewer decides; the committee vote reduces individual bias. The machine learning model learns how much weight to give each voter, under what circumstances, and how to combine their votes into a final prediction. The result: predictions that are more accurate than any single tool alone.
By the end of this chapter, you will be able to:
Imagine you’re trying to estimate the weight of a large pumpkin at a county fair. You could:
Option 3 is ensemble learning. Even if no single person knows the exact weight, combining many estimates—especially when you know who tends to overestimate or underestimate—often gives remarkably accurate results.
The same principle applies to variant prediction. We have dozens of “expert judges” (prediction tools), each with its own strengths and biases. Ensemble methods learn to combine them optimally.
Consider a simple example. Tool A correctly identifies 70% of variants with functional impact but has a 20% false positive rate. Tool B correctly identifies 65% but has only a 10% false positive rate. If you simply pick the “better” tool (A), you miss opportunities:
An ensemble method can learn these patterns from thousands of validated examples, automatically discovering the best way to combine tools.
1. Feature Integration (CADD)
Combines raw features (conservation scores, frequencies, structural predictions) directly. The model learns which features matter and how to weight them. Think of this as giving the model the raw ingredients and letting it learn the recipe.
2. Meta-Prediction (MetaLR, MetaSVM, REVEL)
Combines outputs from existing prediction tools. Instead of learning from raw features, these models learn from what other tools predict. Think of this as combining expert opinions.
3. Neural Network Ensembles (DANN)
Uses neural networks instead of linear models to capture complex, nonlinear relationships between features. Can learn interactions between features that linear models miss.
Each approach has trade-offs in interpretability, computational cost, and accuracy. We’ll explore each in detail.
CADD (pronounced “cad”) was one of the first successful ensemble methods for variant interpretation. Published in 2014 and continuously updated, CADD scores are now among the most widely used in clinical genetics.
CADD asks a clever question: How different is this variant from what we’d expect to see in typical human genetic variation?
The reasoning: variants that cause disorders are rare in the population precisely because natural selection removes them. If we can quantify how “unusual” a variant is compared to common variation, we can estimate its likelihood of having functional impact.
CADD doesn’t directly predict “disorder-causing” vs “neutral.” Instead, it predicts “looks like common variants” vs “looks like rare, selected-against variants.” This subtle difference makes CADD applicable across many contexts.
CADD uses a brilliant training approach:
Positive examples (variants likely to be selected against):
Negative examples (variants tolerated by selection):
The model learns to distinguish “looks like it hasn’t been filtered by selection” (potentially functional) from “looks like it survived selection” (probably neutral).
CADD integrates 63 different features across multiple categories:
Sequence Conservation (15 annotations)
Functional Predictions (12 annotations)
Regulatory Annotations (20 annotations)
Transcript Annotations (8 annotations)
Population Genetics (8 annotations)
CADD doesn’t just use these 63 features—it learns their relative importance from data.
CADD uses a linear classifier that learns weights for each feature and combines them into a final score. You can think of this as a weighted vote: each of the 63 features casts a “vote,” and CADD learns how much to trust each voter.
[Optional: The Math]
CADD uses a Support Vector Machine (SVM) — a type of linear classifier. An SVM learns weights (w) for each feature (x) and combines them:
score = w₁x₁ + w₂x₂ + … + w₆₃x₆₃ + b
Where xᵢ are the 63 feature values, wᵢ are learned weights (positive weights increase the score; negative weights decrease it), and b is a bias term. For example, if PhyloP conservation is x₁ = 5.2 and its learned weight is w₁ = 0.3, it contributes 0.3 × 5.2 = 1.56 to the final score.
CADD outputs scores on a phred-like scale:
Higher scores indicate greater predicted deleteriousness.
Rough prioritization heuristics:
Important caveats:
Think of a CADD score like a pathologist grading a biopsy — it combines multiple features (conservation, structural impact, population frequency) into a single pathogenicity score. A high CADD score doesn’t make the final diagnosis, but it tells you which specimen to look at more closely.
Let’s return to Dr. Chen’s SCN1A variant:
Variant: chr2:166,848,542 A>C (GRCh38)
Gene: SCN1A
Change: Leu1231Pro (leucine to proline at position 1,231)
CADD score: 28.3
What does CADD = 28.3 tell us?
However, CADD alone doesn’t establish causation. We’d still need segregation with the disorder in the family, functional studies, absence in large control databases, and other clinical evidence.
CADD uses a linear model, which assumes features combine additively. But biological relationships are often nonlinear and interactive. For example:
DANN (Deep Annotation of geNetic variatioN) addresses this by using a neural network instead of a linear model.
CADD computes: score = w₁x₁ + w₂x₂ + … + b
DANN computes: score = neural_network(x₁, x₂, …, x₆₃)
The neural network can learn complex, nonlinear relationships between the 63 features—interactions that a linear model cannot capture.
What does this enable?
DANN uses a neural network with input, hidden, and output layers. The same 63 features used by CADD serve as inputs. The network was trained with the same strategy as CADD (simulated vs. common variants).
DANN shows modest but consistent improvements over CADD:
On ClinVar variants (known functional impact):
On validated neutral variants:
The improvement seems small (2-3%), but at the scale of clinical genetics—where thousands of variants must be prioritized—this translates to hundreds of additional correctly classified variants.
CADD’s linear model is interpretable: you can see exactly how much each feature contributed to the final score. DANN’s neural network is a “black box”: you can’t easily explain why it gave a specific score.
This matters in clinical genetics. When counseling patients, being able to say “this variant scores high because it’s in a highly conserved region, disrupts protein structure, and is absent from population databases” is more compelling than “the neural network gave it a high score.”
Current practice:
CADD and DANN work directly with 63 raw features. But what if you want to combine existing prediction tools (SIFT, PolyPhen-2, MutationTaster, etc.) without retraining everything from scratch?
Meta-predictors take the outputs of multiple existing tools as inputs and learn to combine them optimally.
Think of meta-prediction as a “second opinion” approach:
Run the variant through many existing tools:
Train a machine learning model on these tool scores:
Output a meta-score that’s more accurate than any individual tool
MetaLR and MetaSVM are sister methods that differ only in their machine learning algorithm:
Both integrate scores from 10 existing tools:
Notice that CADD itself is one of the inputs—meta-predictors can use other ensemble methods as features!
MetaLR and MetaSVM output scores from 0 to 1:
Clinical usage:
These thresholds are chosen to maximize specificity (minimize false positives) in clinical settings.
Advantages:
Limitations:
Most ensemble methods treat all variants equally. But clinical genetics has a specific pain point: rare missense variants in exome sequencing data.
Why are rare missense variants challenging?
REVEL (Rare Exome Variant Ensemble Learner) was specifically designed for this problem.
REVEL makes several key choices:
1. Focus exclusively on missense variants
Doesn’t try to handle splice variants, indels, or regulatory variants. Allows optimization for the specific challenges of missense prediction.
2. Use only rare variants in training
3. Integrate a specific set of 13 tools
Selected for complementary strengths in missense variant prediction: MutPred, FATHMM v2.3, VEST 3.0, PolyPhen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons.
4. Use a random forest classifier
Random forests are ensemble methods within an ensemble method!
What is a random forest?
Imagine you want to classify variants, so you train a decision tree:
Is SIFT < 0.05?
/ \
Yes No
/ \
Is PolyPhen > 0.9? Is PhyloP > 3?
/ \ / \
Yes No Yes No
/ \ / \
Functional Neutral Functional Neutral
One tree can overfit to training data and make errors. A random forest builds hundreds of trees, each trained on:
Then votes: majority vote across all trees gives the final prediction.
Why this works:
REVEL trains 500 decision trees and averages their predictions.
REVEL outputs scores from 0 to 1, calibrated specifically for rare missense variants:
Performance on rare missense variants:
This outperforms individual tools and general-purpose ensemble methods on rare missense variants specifically.
Use REVEL for:
Don’t use REVEL for:
Clinical tip: Many labs use a tiered approach:
Let’s compare the ensemble methods we’ve covered:
| Method | Type | # Features | Variant Types | Best For | Output Scale |
|---|---|---|---|---|---|
| CADD | Linear (SVM) | 63 | All | General screening | Phred (10-40) |
| DANN | Neural network | 63 | All | Complex cases | 0-1 |
| MetaLR | Meta (logistic) | 10 tools | All | Combining tools | 0-1 |
| MetaSVM | Meta (SVM) | 10 tools | All | Combining tools | 0-1 |
| REVEL | Meta (random forest) | 13 tools | Missense only | Rare missense | 0-1 |
On a benchmark set of 5,000 ClinVar variants and 5,000 common variants:
| Method | Sensitivity | Specificity | AUC |
|---|---|---|---|
| SIFT | 66% | 82% | 0.84 |
| PolyPhen-2 | 72% | 79% | 0.87 |
| CADD | 77% | 88% | 0.91 |
| DANN | 79% | 89% | 0.92 |
| MetaSVM | 81% | 90% | 0.93 |
| REVEL* | 75% | 92% | 0.93 |
*REVEL evaluated only on rare missense variants
Key observations:
When methods agree, confidence is high. For a random set of 1,000 rare variants in disorder genes:
The 29% where they disagree are the challenging cases that require expert review.
Decision tree for method selection:
Is the variant missense?
├─ Yes → Is it rare (MAF < 0.1%)?
│ ├─ Yes → Use REVEL (primary) + CADD (secondary)
│ └─ No → Use CADD
└─ No → What type?
├─ Truncating (nonsense, frameshift) → likely functional; confirm with gene constraint
├─ Splice-region → Use SpliceAI
└─ Regulatory → Use DeepSEA or Enformer (Chapters 7-8)
When ensemble methods disagree, consider:
1. Check the raw features
If CADD is high but REVEL is low: examine conservation, structure, and frequency independently. Often reveals why tools disagree.
2. Consider variant type
If splice variant, don’t trust REVEL (designed for missense). If in regulatory region, even high CADD may miss enhancer disruption.
3. Examine population data
If variant appears multiple times in gnomAD, likely neutral regardless of scores. If completely absent and in constrained gene, scores matter more.
4. Look for functional studies
Published experiments trump predictions. Check OMIM, ClinVar, PubMed for functional data.
5. When in doubt, use multiple lines of evidence
Key Takeaways:
| Term | Definition |
|---|---|
| Deleteriousness score | A quantitative measure of how likely a variant is to have functional impact or be selected against |
| Ensemble learning | Machine learning approach that combines multiple models or features to improve prediction accuracy |
| Feature integration | Ensemble strategy that combines raw data (e.g., conservation scores, frequencies) directly |
| Meta-prediction | Ensemble strategy that combines outputs from existing prediction tools |
| Phred-scaled score | Logarithmic scale where score X means “top 10^(-X/10)” (e.g., 20 = top 1%, 30 = top 0.1%) |
| Random forest | Ensemble method that combines predictions from many decision trees to reduce overfitting |
| Support Vector Machine (SVM) | Linear classifier that finds the optimal boundary between classes in high-dimensional space |
| Training data | Validated examples (with known labels) used to train machine learning models |
| Sensitivity | Proportion of variants with functional impact correctly identified (true positive rate) |
| Specificity | Proportion of neutral variants correctly identified (true negative rate) |
Why does combining multiple prediction tools often work better than using the single “best” tool? Explain using the concept of complementary information.
CADD uses common variants as negative training examples, reasoning that natural selection has already filtered out variants with functional impact. What are potential limitations of this assumption? When might it fail?
DANN improves on CADD by using neural networks instead of linear models. Explain in biological terms what kinds of relationships between features a neural network can learn that a linear model cannot.
MetaSVM and REVEL both use ensemble learning but perform differently on rare missense variants. Why does REVEL outperform MetaSVM specifically for this variant class?
A variant has CADD = 32, DANN = 0.95, MetaSVM = 0.45, and REVEL = 0.22. What might explain these conflicting scores? How would you interpret this variant?
Why is interpretability important in clinical variant interpretation? Describe a situation where you’d prefer CADD over DANN despite DANN’s higher accuracy.
Many ensemble methods are trained on ClinVar variants, which are heavily biased toward coding variants in well-studied genes. How might this affect their performance on regulatory variants or variants in poorly characterized genes?
If you were designing a new ensemble method for structural variants (deletions, duplications > 1 kb), what features would you integrate and why?