Dr. Sarah An stares at her computer screen, frustrated. She just received whole-genome sequencing (WGS) data from a 7-year-old patient with a complex rare disorder—severe skeletal abnormalities, neurodevelopmental delays, and metabolic dysfunction. Both parents are healthy, suggesting this is a de novo (new) mutation. The WGS reveals 4.5 million genetic variants compared to the reference genome. Her computational pipeline identifies approximately 70 de novo variants—changes found in the patient but not in either parent. These are the prime suspects. She runs standard pathogenicity prediction tools—CADD scores and PolyPhen-2—to narrow down the list. After filtering, she’s left with 3 coding mutations in genes with unknown function or disease relevance, and 7 noncoding variants in regulatory regions.
Any of these 10 variants could be causative. But which one?
Each functional validation experiment takes 2-3 months and costs $8,000-15,000. Testing all 10 would take 2+ years and over $100,000. Even then, the noncoding variants are challenging to test—their effects on gene regulation are subtle and context-dependent, requiring cell-type-specific assays, enhancer reporter experiments, and potentially CRISPR editing in patient-derived cells. Three days later, using an AI-powered variant prioritization model trained on millions of variants and functional genomics data, Sarah narrows the list to 2 high-confidence candidates: one coding variant in a gene involved in skeletal development with a damaging structural prediction, and one noncoding variant in an enhancer predicted to disrupt binding of a critical transcription factor expressed in developing bone and neural tissue. Within two months, functional experiments confirm the noncoding variant as causative—it disrupts an enhancer driving expression of a gene essential for both skeletal and neural development.
This is one example how we can leverage the power of AI in genomics: not replacing scientific intuition, but dramatically amplifying it to navigate the vast search space of human genetic variation.
Modern biology faces two unprecedented explosions.
Data Explosion: Whole genome sequencing can easily read a single human genome of approximately 3 billion nucleotides, with each person having 3-5 million variants that differ from one another. Single-cell RNA-seq experiments generate data from millions of cells, while various epigenome sequencing techniques can provide information for chromatin accessibility maps across millions of genomic regions. Proteomics technologies can identify hundreds of millions of peptide sequences from a single experiment.
Many Hypothesis to be explored: Large-scale genomics studies don’t just generate data but they also generate thousands of testable hypotheses. GWAS studies identify hundreds of loci associated with each complex trait, but most are in noncoding regions with unknown mechanisms. Genome sequencing in rare disease cohorts reveals dozens of candidate genes per patient, each requiring functional validation. Cancer genomics finds hundreds of somatic mutations per tumor, but only a subset are “driver” mutations versus neutral “passengers.” Single-cell atlases reveal thousands of cell type-specific gene expression patterns, each suggesting regulatory hypotheses. Spatial transcriptomics shows genes co-expressed in tissue neighborhoods, implicating thousands of potential cell-cell interactions.
Systems-Level Complexity: The challenge isn’t just quantity but the complexity inherent in our cells and tissues. Genes operate in networks, not isolation, and a single phenotype often involves dozens to hundreds of genes working together. Context matters profoundly: the same variant can be benign in one genetic background but pathogenic in another, and gene function depends on cell type, developmental stage, and environmental conditions. Combinatorial interactions add another layer of complexity, where two variants individually benign might be harmful together, causing the number of possible combinations to explode exponentially. Pleiotropic effects further complicate the picture: one gene affects multiple phenotypes while one phenotype is affected by multiple genes, creating a many-to-many mapping rather than a simple one-to-one relationship.
The fundamental problem is that we can generate biological data and hypotheses far faster than we can experimentally test them. A single GWAS might implicate 500 genes, and testing each would take decades. AI helps us predict which experiments to prioritize and which hypotheses are most likely to be true.
Artificial Intelligence is the broadest concept—any technique that enables computers to mimic human intelligence. This includes playing chess, recognizing faces, translating languages, predicting protein structures, and classifying cell types.
Machine Learning is a subset of AI focused on one question:
“Can computers learn patterns from data instead of having humans program every rule explicitly?”
Instead of telling a computer “if the DNA sequence has TATA box at position -25 and GC content > 60%, it’s probably a promoter,” we give the computer thousands of examples of promoters and non-promoters, and let it figure out the patterns. The learning process works by having the algorithm compare its predictions to the correct answers, measure the error, and then automatically adjust its approach through mathematical optimization.
What Is “Learning” in Machine Learning? Here is an example for “Predicting variant functional impact”.
You have:
The algorithm learns by:
This adjustment process is optimization—finding the best parameters that minimize prediction errors.
For a simple model, the algorithm discovers the best weights:
Score = (Conservation × w₁) + (Frequency × w₂) + (StructuralChange × w₃) + bias
The algorithm discovers the best values for w₁, w₂, w₃, and bias by examining thousands of examples. We don’t tell the algorithm what values to use—it discovers them from data.
Deep Learning uses artificial neural networks with many layers to automatically discover patterns in data. Each layer builds on the previous one in a hierarchical fashion. The first layer might detect simple sequence motifs such as TATA or CAAT boxes, while the second layer might combine these motifs to detect larger regulatory modules. The third layer might identify context-dependent regulatory logic, and the fourth layer might predict cell-type-specific enhancer activity.

Figure 1.1: The AI Hierarchy - From Broad to Specific.
Genomic Tool Examples:
| Tool | Category | Why? |
|---|---|---|
| BLAST | AI (not ML) | Uses programmed rules for alignment |
| Random Forest classifier | ML (not DL) | Learns from data, no neural networks |
| AlphaFold | DL | Deep neural networks with many layers |
AI models learn associations (correlations). They do not learn causation.
This is perhaps the most critical concept for biologists to understand.
Suppose you have data showing:
What can you conclude?
❌ WRONG: “Gene X causes Disease Y”
❌ WRONG: “Targeting Gene X will cure Disease Y”
✓ CORRECT: “Gene X expression and Disease Y are associated”
Why? Consider these scenarios:
Scenario 1: Gene X → Disease Y (causal)
[Targeting Gene X might cure disease]
Scenario 2: Disease Y → Gene X (reverse causation)
[Gene X is just responding to disease]
Scenario 3: Inflammation → Gene X
↘
Disease Y (confounding)
[Both are symptoms; treat inflammation instead]
Scenario 4: Gene X ← Environmental Factor → Disease Y (common cause)
[Change environment, not the gene]
All four scenarios produce identical correlations, but require completely different interventions!
“The causes of the data cannot be extracted from the data alone. We need an additional external model, a causal model of some kind.” — Richard McElreath, Statistical Rethinking
This is why Nancy Cartwright’s slogan is so important: “No causes in, no causes out.” You cannot discover causation by data mining alone—you must bring causal assumptions to the data.
Things to prove causation:
Controlled perturbation
Observation of effect
Mechanism validation
AI’s role: Predict which of 1000 genes to perturb first
Experiments’ role: Establish that perturbation actually causes the effect
Modern causal inference uses Directed Acyclic Graphs (DAGs) to represent causal relationships. A DAG is a diagram with arrows showing cause-and-effect relationships, where “directed” means arrows have direction (A→B means A causes B) and “acyclic” means no circular loops exist (no A→B→C→A).
Simple DAG Example: Gene Regulation
Transcription Factor (TF)
↓
Enhancer Activity
↓
Gene Expression
↓
Protein Level
↓
Phenotype
This DAG states:
- Changing TF changes Enhancer (causal)
- Changing Protein changes Phenotype (causal)
- Protein and TF are correlated, but not directly causally related
Key insight: The DAG helps you design experiments to establish causation, not just correlation.
For biology students:
Many biology students learn statistics through a flowchart approach:
Is data normal? → YES → t-test
→ NO → Mann-Whitney U test
The goal is typically to reject a null hypothesis: “Does this gene variant have NO effect on disease risk?” If p < 0.05, we conclude “yes, it has an effect.”
This approach has serious limitations in modern genomics:
“No effect” is not a realistic biological hypothesis
Null models are not unique
Industrial vs. research contexts
What this means for AI and genomics:
When an AI model predicts “this variant is 95% likely to be pathogenic,” it’s not just saying “effect exists.” It’s implicitly proposing mechanisms based on patterns it learned. Your job as a scientist is to:
AI models should help you compare competing biological hypotheses, not just confirm that “something is significant.”
AI has already made real discoveries across genomics. Here are key examples:
AlphaFold 2 achieved near-experimental accuracy predicting 3D protein structures in hours instead of months (Jumper et al 2021, Nature). AlphaFold 3 extended to protein complexes and DNA/RNA interactions (Abramson et al 2024, Nature). Over 200 million structures are now freely available, accelerating drug discovery and disease research.
DeepVariant treats sequencing as image recognition, reducing error rates by 50% vs. traditional methods (Poplin et al 2018, Nature Biotechnology). Now standard in clinical sequencing. Models like DeepSEA and Basenji extended this to predict regulatory variant effects (Zhou & Troyanskaya 2015, Nature Methods; Kelley et al 2018, Genome Research). Transformer models predict gene expression, chromatin state, and histone modifications from DNA sequence alone (Avsec et al 2021, Nature Methods). This enables predicting noncoding variant effects and revealing long-range regulatory interactions up to 100kb away.
Models like scGPT and Geneformer treat genes as words in language, learning universal cellular representations (Cui et al 2024, Nature Methods; Theodoris et al 2023, Nature). This enabled the Human Cell Atlas and reduced cell type annotation from weeks to hours.
Deep learning screens 100+ million molecules virtually in days. Halicin—a novel antibiotic effective against drug-resistant bacteria—was discovered this way (Stokes et al 2020, Cell). A major regulatory milestone came in 2024 when the FDA accepted Recursion Pharmaceuticals’ AI-based models as a replacement for animal testing in certain toxicology studies. This represents the first time AI predictions were formally approved to substitute traditional animal experiments in drug development, potentially accelerating timelines while reducing costs and ethical concerns.
These breakthroughs share key features:
Perhaps the most profound change AI brings isn’t speed or scale—it’s a fundamental transformation in how we do science.
Traditional: Hypothesis-Driven Research
Observe → Hypothesis → Experiment → Data → Accept/Reject → New Hypothesis
Limitations: One hypothesis at a time, months to years per cycle, testing only what we suspect.
New: AI-Augmented Discovery Loop
Large-scale data → AI training → Thousands of predictions
↑ ↓
New data ← Selective validation ← Prioritize by confidence
From sequential testing to parallel exploration. AI generates thousands of hypotheses simultaneously, experiments validate the most promising ones, results improve the model, and the cycle accelerates.
Recent work envisions AI Virtual Cells (AIVC)—comprehensive computational models simulating cellular behavior across molecular, cellular, and tissue scales (Bunne et al 2024, Cell).

Figure: Capabilities of the AI Virtual Cell. The AIVC provides universal representations (UR) of cell states that can be obtained across species and conditions from different data modalities (A). These representations enable predicting cell biology, modeling dynamics, and performing in silico experiments (B). The utility depends on interactions at individual, community, and societal levels—requiring accessibility, interpretability, evaluation frameworks, privacy protection, and collaborative development (C). Source: Bunne et al 2024, Cell. License: CC-BY 4.0.
This enables in silico experimentation:
Drug screening example:
| Approach | Compounds | Cost | Time | Hits |
|---|---|---|---|---|
| Traditional | 10,000 physical | $50M | 2 years | 5-10 |
| AI-augmented | 100M virtual → 1,000 physical | $5M | 6 months | 20-30 |
The most powerful approach combines AI prediction with human expertise:
Scientist's Question → Virtual Cell Simulation →
Scientist Reviews + Domain Knowledge →
Lab Experiments → Virtual Cell Learns → (Loop continues)
AI amplifies—doesn’t replace—biological expertise. Scientists still ask questions, interpret meaning, decide what to test, and validate results. But now they can explore vastly larger hypothesis spaces.
| Level | Who | What You Can Do | Time Investment |
|---|---|---|---|
| Consumer | All biologists | Use existing AI tools (AlphaFold, CADD scores) Interpret predictions critically Understand limitations and when to validate Recognize biases |
Hours (this course) |
| User | Data-oriented | Run pre-trained models on your data Perform data preprocessing and visualization Integrate AI into analysis pipelines |
Weeks of practice |
| Developer | Computational biology | Fine-tune and train new models Develop novel architectures Collaborate as equal partner with ML researchers |
Months to years |
This textbook targets Levels 1-2.
| Use AI When: | Don’t Use AI When: |
|---|---|
| Large datasets (1000+ examples) | Very little data (<100 examples) |
| Complex patterns (many variables) | Mechanism understanding is critical |
| Expensive/slow experiments | Very high stakes without validation |
| Need for scale (millions of predictions) | Problem is simple (basic statistics work) |
| Similar problems solved (transfer learning) | Training data doesn’t match your population |
Decision Framework:
Need to prioritize/predict many things?
NO → Traditional experiments
YES → Have >1000 training examples?
NO → Use statistics or small ML models
YES → Pattern too complex for simple rules?
NO → Try simple models first (linear, random forest)
YES → Consider deep learning
↓
Always validate key predictions experimentally!
| Failure Type | Example | What Went Wrong | Lesson |
|---|---|---|---|
| Overfitting | Sepsis prediction: 80% accuracy in training, random chance in real hospitals | Learned when nurses check vitals, not sepsis biology | Validate on truly independent data from different sources |
| Unnecessary Complexity | Deep learning 85% vs. simple linear 87% accuracy | Problem was actually linear; complexity hurt performance | Start simple, only add complexity when needed |
| Population Bias | 30% more variants flagged as “pathogenic” in African genomes | Training data >80% European ancestry; novelty interpreted as pathogenicity | Ensure training data represents application population |
| Confounding | Gene X “causes” disease | Actually: Disease → Inflammation → Gene X | Draw causal models; design experiments to test |
Key Takeaways:
Answer:
Example: BLAST is AI but not ML (uses rules). A Random Forest variant classifier is ML but not DL (no neural networks). AlphaFold is DL (deep neural networks).
Answer:
AI can only find correlations (patterns that occur together), not causation (one thing directly causing another).
Example: If Gene X expression correlates with Disease Y, there are multiple possible explanations:
AI cannot distinguish between these scenarios. Only controlled experiments (like CRISPR knockout, drug inhibition, or overexpression) can establish causation by directly perturbing the gene and observing if the disease phenotype changes.
Answer:
Use AI when:
Use simple experiments or statistics when:
Key principle: AI helps you decide what to test experimentally, but experiments prove why something happens.
Learn:
Access Lab 1.1 on Google Colab
Learn: