Interactive: Chapter 1 — Explore the variant hunting problem and see how AI narrows the search.
Dr. Sarah An stares at her computer screen, frustrated. She just received whole-genome sequencing (WGS) data from a 7-year-old patient with a complex rare disorder—severe skeletal abnormalities, neurodevelopmental delays, and metabolic dysfunction. Both parents are healthy, suggesting this is a de novo (new) mutation.
The WGS reveals 4.5 million genetic variants compared to the reference genome. Her computational pipeline identifies approximately 70 de novo variants—changes found in the patient but not in either parent. These are the prime suspects.
She runs standard pathogenicity prediction tools—CADD scores and PolyPhen-2—to narrow down the list. After filtering, she’s left with 3 coding mutations in genes with unknown function or disease relevance, and 7 noncoding variants in regulatory regions.
Any of these 10 variants could be causative. But which one?
Each functional validation experiment takes 2–3 months and costs $8,000–15,000. Testing all 10 would take 2+ years and over $100,000. Even then, the noncoding variants are particularly challenging—their effects on gene regulation are subtle and context-dependent, requiring cell-type-specific assays, enhancer reporter experiments, and potentially CRISPR editing in patient-derived cells.
Three days later, using an AI-powered variant prioritization model trained on millions of variants and functional genomics data, Sarah narrows the list to 2 high-confidence candidates: one coding variant in a gene involved in skeletal development, and one noncoding variant in an enhancer predicted to disrupt a transcription factor expressed in developing bone and neural tissue.
Within two months, functional experiments confirm the noncoding variant as causative—it disrupts an enhancer driving expression of a gene essential for both skeletal and neural development.
This is one example of how we can leverage the power of AI in genomics: not replacing scientific intuition, but dramatically amplifying it to navigate the vast search space of human genetic variation.
By the end of this chapter, you will be able to:
Modern biology faces two unprecedented explosions.
Data Explosion: Whole genome sequencing can easily read a single human genome of approximately 3 billion nucleotides, with each person having 3-5 million variants that differ from one another. Single-cell RNA-seq experiments generate data from millions of cells, while various epigenome sequencing techniques can provide information for chromatin accessibility maps across millions of genomic regions. Proteomics experiments can generate millions of spectra and identify thousands to tens of thousands of peptides and proteins, depending on the technology and study design.
Many Hypotheses to Explore: Large-scale genomics studies don’t just generate data but they also generate thousands of testable hypotheses. GWAS studies identify hundreds of loci associated with each complex trait, but most are in noncoding regions with unknown mechanisms. Genome sequencing in rare disease cohorts reveals dozens of candidate genes per patient, each requiring functional validation. Cancer genomics finds hundreds of somatic mutations per tumor, but only a subset are “driver” mutations versus neutral “passengers.” Single-cell atlases reveal thousands of cell type-specific gene expression patterns, each suggesting regulatory hypotheses. Spatial transcriptomics shows genes co-expressed in tissue neighborhoods, implicating thousands of potential cell-cell interactions.
Systems-Level Complexity: The challenge isn’t just quantity but the complexity inherent in our cells and tissues. Genes operate in networks, not isolation, and a single phenotype often involves dozens to hundreds of genes working together. Context matters profoundly: the same variant can be benign in one genetic background but pathogenic in another, and gene function depends on cell type, developmental stage, and environmental conditions. Combinatorial interactions add another layer of complexity, where two variants individually benign might be harmful together, causing the number of possible combinations to explode exponentially. Pleiotropic effects further complicate the picture: one gene affects multiple phenotypes while one phenotype is affected by multiple genes, creating a many-to-many mapping rather than a simple one-to-one relationship.
The fundamental problem is that we can generate biological data and hypotheses far faster than we can experimentally test them. A single GWAS might implicate 500 genes, and testing each would take decades. AI helps us predict which experiments to prioritize and which hypotheses are most likely to be true.
Artificial Intelligence is the broadest concept—any technique that enables computers to mimic human intelligence. This includes playing chess, recognizing faces, translating languages, predicting protein structures, and classifying cell types.
Machine Learning is a subset of AI focused on one question:
“Can computers learn patterns from data instead of having humans program every rule explicitly?”
Think of the difference between a recipe book and a chef who has tasted thousands of dishes. A recipe book follows fixed rules: “if ingredient X, then do step Y.” A chef who has tasted enough dishes develops intuition—they can sense what’s missing without consulting a recipe. Machine learning is closer to the chef: instead of programming every rule, we show the algorithm thousands of examples and let it discover the patterns.
Instead of telling a computer “if the DNA sequence has a TATA box at position -25 and GC content > 60%, it’s probably a promoter,” we give the computer thousands of labeled examples of promoters and non-promoters, and let it figure out the patterns on its own.
How a machine learning algorithm learns — an example: predicting variant functional impact
You have:
The algorithm learns by:
This cycle—guess, compare, adjust, repeat—is how the algorithm “learns” from data, much like how you improve at a skill through practice and feedback.
Want to see the math? For a simple model, the algorithm discovers the best weights for each input:
Score = (Conservation × w₁) + (Frequency × w₂) + (StructuralChange × w₃) + biasThe values w₁, w₂, w₃ (and bias) are discovered automatically from thousands of examples—we never specify them. A high w₁ would mean “conservation matters a lot for pathogenicity,” while a negative w₂ would mean “common variants are probably benign.”
Deep Learning uses artificial neural networks with many layers to automatically discover patterns in data—think of it as machine learning taken to a much greater depth.
Each layer builds on the previous one, like a series of filters in a microscopy pipeline. The first layer might detect simple sequence motifs such as TATA or CAAT boxes. The second layer combines those motifs to detect larger regulatory modules. The third layer identifies context-dependent regulatory logic. The fourth layer predicts cell-type-specific enhancer activity.
No one programs these layers to look for those patterns—the network discovers them by itself after seeing enough examples. This hierarchical feature learning is what makes deep learning so powerful for genomics.

Figure 1.1: The AI Hierarchy - From Broad to Specific.
Genomic Tool Examples:
| Tool | Category | Why? |
|---|---|---|
| BLAST | AI (not ML) | Uses programmed rules for alignment |
| Random Forest classifier | ML (not DL) | Learns from data, no neural networks |
| AlphaFold | DL | Deep neural networks with many layers |
AI models learn associations (correlations). They do not learn causation.
This is perhaps the most critical concept for biologists to understand.
Suppose you have data showing:
What can you conclude?
❌ WRONG: “Gene X causes Disease Y”
❌ WRONG: “Targeting Gene X will cure Disease Y”
✓ CORRECT: “Gene X expression and Disease Y are associated”
Why? All four scenarios below produce the same correlation, but require completely different interventions:
| Scenario | What’s actually happening | What you should do |
|---|---|---|
| Gene X → Disease Y | Gene X directly causes the disease | Target Gene X therapeutically |
| Disease Y → Gene X | The disease itself triggers Gene X expression | Treat the disease, not Gene X |
| Inflammation → Gene X and Disease Y | A third factor drives both | Treat inflammation |
| Environmental factor → Gene X and Disease Y | Shared cause in the environment | Change the environment |
An AI model trained on correlation data cannot tell you which scenario is true. All four look identical in the data.
“The causes of the data cannot be extracted from the data alone.” — Richard McElreath, Statistical Rethinking
Things to prove causation:
Controlled perturbation
Observation of effect
Mechanism validation
AI’s role: Predict which of 1000 genes to perturb first
Experiments’ role: Establish that perturbation actually causes the effect
You already draw causal models in biology—you just call them pathway diagrams or signaling cascades. The gene regulation pathway below is a perfect example:
Transcription Factor (TF)
↓
Enhancer Activity
↓
Gene Expression
↓
Protein Level
↓
Phenotype
Each arrow means “causes.” This diagram captures something important:
This distinction matters enormously for experiments. If you want to test whether TF activity causes the phenotype, you need to perturb TF (e.g., with a dominant negative or CRISPR)—not just observe the correlation between TF and phenotype in your RNA-seq data.
Researchers in statistics and computer science call these pathway diagrams Directed Acyclic Graphs (DAGs)—a formal name for something biologists have always drawn intuitively.
The takeaway: Before interpreting any AI prediction, draw out the causal pathway you believe is operating. This helps you design experiments that test the mechanism, not just the correlation.
Most of us learned statistics with one goal: get a p-value below 0.05 and reject the null hypothesis. “Is there an effect?” If yes, move on.
But genomics asks a fundamentally different question: “Which mechanism explains this phenotype?”
Consider a GWAS study that finds 200 variants associated with Type 2 diabetes. Rejecting the null hypothesis for each one is easy—we know they have effects. The hard question is: which of these variants disrupts insulin signaling, and how? Does it affect a transcription factor binding site? An enhancer? A splice site? These are competing mechanistic hypotheses, and no amount of p-value testing can distinguish between them.
An analogy: Imagine your car won’t start. A mechanic who only asks “Is there a problem?” (null hypothesis test) isn’t very helpful—obviously there’s a problem. A good mechanic asks “Is it the battery, the starter motor, or the fuel pump?”—competing hypotheses about mechanism. That’s what genomics needs.
What this means for how you use AI tools:
When an AI model predicts “this variant is 95% likely to be pathogenic,” it’s not just saying “effect exists.” Based on patterns learned from millions of variants, it’s implicitly pointing toward a mechanism. Your job as a scientist is to:
AI models are most valuable when they help you compare competing biological mechanisms, not just confirm that something is statistically significant.
AI has already made real discoveries across genomics. Here are key examples:
AlphaFold 2 achieved near-experimental accuracy predicting 3D protein structures in hours instead of months (Jumper et al 2021, Nature). AlphaFold 3 extended to protein complexes and DNA/RNA interactions (Abramson et al 2024, Nature). Over 200 million structures are now freely available, accelerating drug discovery and disease research.
DeepVariant treats sequencing as image recognition, reducing error rates by 50% vs. traditional methods (Poplin et al 2018, Nature Biotechnology). Now standard in clinical sequencing. Models like DeepSEA and Basenji extended this to predict regulatory variant effects (Zhou & Troyanskaya 2015, Nature Methods; Kelley et al 2018, Genome Research). Transformer models predict gene expression, chromatin state, and histone modifications from DNA sequence alone (Avsec et al 2021, Nature Methods). This enables predicting noncoding variant effects and revealing long-range regulatory interactions up to 100kb away.
Models like scGPT and Geneformer treat genes as words in language, learning reusable cellular representations (Cui et al 2024, Nature Methods; Theodoris et al 2023, Nature). These models build on Human Cell Atlas-scale datasets and can reduce cell type annotation from weeks to hours.
Deep learning screens 100+ million molecules virtually in days. Halicin—a novel antibiotic effective against drug-resistant bacteria—was discovered this way (Stokes et al 2020, Cell). Regulatory agencies are also developing frameworks for AI/ML models and other New Approach Methodologies in drug development. These tools may reduce, refine, or supplement some animal studies, but they should not be described as a blanket replacement for animal testing unless a specific FDA qualification or guidance document is cited.
These breakthroughs share key features:
Perhaps the most profound change AI brings isn’t speed or scale—it’s a fundamental transformation in how we do science.
Traditional: Hypothesis-Driven Research
Observe → Hypothesis → Experiment → Data → Accept/Reject → New Hypothesis
Limitations: One hypothesis at a time, months to years per cycle, testing only what we suspect.
New: AI-Augmented Discovery Loop
Large-scale data → AI training → Thousands of predictions
↑ ↓
New data ← Selective validation ← Prioritize by confidence
From sequential testing to parallel exploration. AI generates thousands of hypotheses simultaneously, experiments validate the most promising ones, results improve the model, and the cycle accelerates.
Recent work envisions AI Virtual Cells (AIVC)—comprehensive computational models simulating cellular behavior across molecular, cellular, and tissue scales (Bunne et al 2024, Cell).

Figure: Capabilities of the AI Virtual Cell. The AIVC provides universal representations (UR) of cell states that can be obtained across species and conditions from different data modalities (A). These representations enable predicting cell biology, modeling dynamics, and performing in silico experiments (B). The utility depends on interactions at individual, community, and societal levels—requiring accessibility, interpretability, evaluation frameworks, privacy protection, and collaborative development (C). Source: Bunne et al 2024, Cell. License: CC-BY 4.0.
This enables in silico experimentation:
Drug screening example:
| Approach | Compounds | Cost | Time | Hits |
|---|---|---|---|---|
| Traditional | 10,000 physical | $50M | 2 years | 5-10 |
| AI-augmented | 100M virtual → 1,000 physical | $5M | 6 months | 20-30 |
The most powerful approach combines AI prediction with human expertise:
Scientist's Question → Virtual Cell Simulation →
Scientist Reviews + Domain Knowledge →
Lab Experiments → Virtual Cell Learns → (Loop continues)
AI amplifies—doesn’t replace—biological expertise. Scientists still ask questions, interpret meaning, decide what to test, and validate results. But now they can explore vastly larger hypothesis spaces.
| Level | Who | What You Can Do | Time Investment |
|---|---|---|---|
| Consumer | All biologists | Use existing AI tools (AlphaFold, CADD scores) Interpret predictions critically Understand limitations and when to validate Recognize biases |
Hours (this course) |
| User | Data-oriented | Run pre-trained models on your data Perform data preprocessing and visualization Integrate AI into analysis pipelines |
Weeks of practice |
| Developer | Computational biology | Fine-tune and train new models Develop novel architectures Collaborate as equal partner with ML researchers |
Months to years |
This textbook targets Levels 1-2.
| Use AI When: | Don’t Use AI When: |
|---|---|
| Large datasets (1000+ examples) | Very little data (<100 examples) |
| Complex patterns (many variables) | Mechanism understanding is critical |
| Expensive/slow experiments | Very high stakes without validation |
| Need for scale (millions of predictions) | Problem is simple (basic statistics work) |
| Similar problems solved (transfer learning) | Training data doesn’t match your population |
Decision Framework:
Need to prioritize/predict many things?
NO → Traditional experiments
YES → Have >1000 training examples?
NO → Use statistics or small ML models
YES → Pattern too complex for simple rules?
NO → Try simple models first (linear, random forest)
YES → Consider deep learning
↓
Always validate key predictions experimentally!
| Failure Type | Example | What Went Wrong | Lesson |
|---|---|---|---|
| Overfitting | Sepsis prediction: 80% accuracy in training, random chance in real hospitals | Learned when nurses check vitals, not sepsis biology | Validate on truly independent data from different sources |
| Unnecessary Complexity | Deep learning 85% vs. simple linear 87% accuracy | Problem was actually linear; complexity hurt performance | Start simple, only add complexity when needed |
| Population Bias | 30% more variants flagged as “pathogenic” in African genomes | Training data >80% European ancestry; novelty interpreted as pathogenicity | Ensure training data represents application population |
| Confounding | Gene X “causes” disease | Actually: Disease → Inflammation → Gene X | Draw causal models; design experiments to test |
Key Takeaways:
| Term | Definition |
|---|---|
| Artificial Intelligence (AI) | Field of making computers perform tasks requiring human intelligence |
| Machine Learning (ML) | Algorithms that learn patterns from data without explicit programming |
| Deep Learning (DL) | ML using multi-layered neural networks |
| Causal Inference | Methods for establishing causation, not just correlation |
| DAG (Directed Acyclic Graph) | A formal name for the pathway diagrams biologists already draw—arrows indicate cause-and-effect, and no circular loops are allowed |
| Parameters | Numerical values a model learns to make predictions |
| Training Data | Examples with known answers used to teach models |
| Optimization | Adjusting parameters to minimize prediction errors |
| Overfitting | Learning training data too well, failing on new data |
| Bias | Systematic errors from unrepresentative training data |
| Confounding | When a third variable creates spurious associations |
| Foundation Model | Large-scale models trained on diverse data, transferable to many tasks |
| AI Virtual Cell (AIVC) | Computational model simulating cellular behavior across scales |
| Active Learning | Iterative process where AI identifies most informative next experiments |
| CRISPR knockout/knockdown | Gene editing to delete or reduce gene expression for causal validation |
| Drug inhibition | Chemical compounds blocking protein function to test causation |
| Overexpression | Artificially increasing gene expression levels to observe phenotypic effects |
Answer:
Example: BLAST is AI but not ML (uses rules). A Random Forest variant classifier is ML but not DL (no neural networks). AlphaFold is DL (deep neural networks).
Answer:
AI can only find correlations (patterns that occur together), not causation (one thing directly causing another).
Example: If Gene X expression correlates with Disease Y, there are multiple possible explanations:
AI cannot distinguish between these scenarios. Only controlled experiments (like CRISPR knockout, drug inhibition, or overexpression) can establish causation by directly perturbing the gene and observing if the disease phenotype changes.
Answer:
Use AI when:
Use simple experiments or statistics when:
Key principle: AI helps you decide what to test experimentally, but experiments prove why something happens.