Under the microscope, the tumor biopsy looks like a uniform mass of cells. The bulk RNA-seq confirms this impression: moderate upregulation of immune genes, modest changes in metabolic pathways. A seemingly straightforward picture — immune infiltration, metabolic stress, the usual hallmarks. You write it up in your notebook and prepare for the next experiment.
Then a collaborator runs the same tissue through a single-cell sequencer. The result is a revelation. That “uniform mass” contains at least 15 distinct cell populations. Some T cells are in an exhausted state, their cytotoxic function silenced by chronic antigen exposure. Others are actively proliferating — a completely different biology. Some cancer cells are dividing rapidly, others are quiescent and potentially therapy-resistant. The metabolic “moderate upregulation” was actually two opposing populations averaged together — one dramatically upregulated, the other completely silent. The average was a statistical artifact that described neither population accurately.
The single-cell data didn’t just add resolution. It told an entirely different biological story. The exhausted T cells explain why checkpoint inhibitor therapy might fail in this patient. The quiescent cancer cells explain why the tumor regrows after chemotherapy. The rare dendritic cell population — just 2% of cells, invisible in bulk measurements — shows the strongest activation signature and may be the key to designing a better immunotherapy. None of this was visible at bulk resolution. All of it was present, waiting to be seen.
This is the promise — and the challenge — of single-cell omics. The data is richer by orders of magnitude. But a single experiment profiling 50,000 cells produces as much data as 50,000 bulk experiments. Understanding what you’ve measured requires computational methods as sophisticated as the biology itself.
When you extract RNA from a tissue sample and perform bulk RNA-seq, you measure the average gene expression across millions of cells. If 90% of cells express gene A at 10 copies and 10% express it at 100 copies, you detect an average of 19 copies per cell. But you have no idea about this heterogeneity.
This averaging problem becomes critical in several contexts:
Tissue Complexity: Your brain contains over 100 different neuronal cell types, plus glia, immune cells, and vascular cells. A bulk measurement mixes all these signals together. When studying autism spectrum disorder or Alzheimer’s, you can’t tell which specific cell types show altered gene expression.
Rare Cell Populations: Stem cells often comprise less than 1% of a tissue. Circulating tumor cells can be 1 in 10 million blood cells. Pancreatic beta cells make up only 1-2% of the pancreas. These rare but critical populations vanish into the noise of bulk measurements.
Dynamic Processes: During development or immune responses, cells transition through transient states. Bulk measurements capture a static average but miss the trajectory. You can’t see that some cells are differentiating while others remain in a progenitor state.
Cellular Heterogeneity in Disease: Tumors aren’t uniform masses—they contain cancer cells at different stages, various immune cells, stromal cells, and more. In autism spectrum disorder, specific neuronal subtypes may be affected while others remain unaffected. Bulk measurements hide this crucial heterogeneity.
The experimental solution exists: single-cell RNA sequencing (scRNA-seq) and related technologies. But these methods generate massive datasets—analyzing 10,000 cells produces as much data as 10,000 bulk RNA-seq experiments. A single study might profile 100,000–500,000 cells. This is where computational methods become not just helpful but absolutely necessary.
After completing this chapter, you will be able to:
In bulk RNA-seq, which you may have encountered in earlier biology courses, the workflow is straightforward:
The result: one measurement per gene representing the average across all cells. For a human sample with ~20,000 genes, you get ~20,000 measurements total.
This approach works beautifully for some questions. If you want to know whether liver cells generally express more albumin than kidney cells, bulk sequencing answers this clearly. The liver-versus-kidney difference is so large that cellular heterogeneity within each tissue doesn’t matter.
Consider a simple example with real numbers. You analyze a tissue sample containing two cell types:
Bulk sequencing reports: 40 copies per cell on average (0.8 × 50 + 0.2 × 0 = 40).
Now imagine you’re comparing unaffected tissue to samples from patients with a metabolic disorder:
Unaffected tissue:
Affected tissue:
Bulk RNA-seq shows a 25% decrease in gene X expression. But gene X expression in individual cells hasn’t changed at all! The change is entirely due to altered cell type proportions. This is called a “composition effect,” and it confounds thousands of published bulk studies.
Single-cell RNA-seq solves this by measuring each cell independently. For the same sample, you’d get:
Cell_1 (Type A): 52 copies
Cell_2 (Type A): 48 copies
Cell_3 (Type B): 0 copies
Cell_4 (Type A): 51 copies
...
Cell_10000 (Type B): 0 copies
Now you can:
The cost: instead of one measurement per gene, you have 10,000 measurements per gene. Your data matrix goes from 20,000 numbers to 200 million numbers.
[What’s new here for AI/ML?] You likely already know about scRNA-seq from earlier courses. What changes with AI is what we can do with this data at scale. Traditional tools could cluster 10,000 cells; modern foundation models learn from 100 million cells across hundreds of studies to build universal representations of cellular state. The biology is familiar — the computational leap is new.
The fundamental problem in scRNA-seq is: How do you keep track of which RNA molecules came from which cell when you sequence millions of molecules together?
The solution is elegant: molecular barcoding. Each cell gets a unique DNA barcode sequence, and all RNA molecules from that cell are tagged with its barcode before pooling.
The most widely used platform, 10x Genomics Chromium, works like this:
Step 1: Cell Encapsulation
Step 2: Cell Lysis and Barcoding
Step 3: Breaking Droplets and Library Prep
Step 4: Sequencing and Demultiplexing
The beauty of this system: you can sequence 10,000 cells in one lane, and the barcodes tell you which reads came from which cell. No manual cell sorting required.
Typical 10x Genomics Run:
Why Not More Genes per Cell?
You might wonder: if cells express ~10,000–15,000 genes, why do we only detect 2,000–5,000? Three reasons:
This creates sparsity: most entries in your data matrix are zeros, even when the gene is actually expressed at low levels.
Smart-seq2/3 (full-length sequencing):
BD Rhapsody (targeted sequencing):
SPLiT-seq (combinatorial barcoding):
A single-cell RNA-seq dataset is organized as a matrix:
Cell_1 Cell_2 Cell_3 ... Cell_10000
Gene_1 (TP53) 45 0 23 12
Gene_2 (ACTB) 523 487 612 445
Gene_3 (GAPDH) 234 198 276 201
...
Gene_20000 0 2 0 0
Biological analogy: This is like a giant spreadsheet where rows are genes and columns are cells — imagine a classroom attendance sheet, except instead of ‘present/absent’ it counts how many mRNA molecules each student (cell) produced for each subject (gene). Most entries will be zero because most cells only express a fraction of all genes.
Dimensions:
Memory Requirements:
Unlike bulk RNA-seq where most genes have non-zero counts, single-cell data is typically 85–95% zeros:
Bulk RNA-seq: [523, 234, 45, 12, 89, ...] (most non-zero)
scRNA-seq: [0, 0, 45, 0, 0, 12, 0, 0, 0, 523, ...] (mostly zeros)
Sources of Zeros:
You cannot distinguish biological from technical zeros without additional information. This ambiguity complicates downstream analysis.
Unlike bulk RNA-seq, single-cell data has different statistical properties:
Bulk RNA-seq counts per gene:
Single-cell UMI counts per gene per cell:
This means standard bulk RNA-seq analysis methods often fail for single-cell data. Methods need to:
[Optional: The Math] — Sequencing Depth and Gene Detection Probability
For a gene expressed at m mRNA copies per cell, the probability of detecting it in scRNA-seq depends on:
- Capture efficiency (c): What fraction of mRNAs get captured? Typically c ≈ 0.1–0.3
- Sequencing depth (d): How many UMIs per cell? Typically d = 10,000–50,000
- Library size (L): Total mRNAs in cell, typically L ≈ 100,000–500,000
The expected number of UMIs detected for this gene is:
E[UMIs detected] = m × c × (d / L)
Example: A gene with 100 copies, 20% capture, 20,000 depth, 200,000 library size:
E[UMIs] = 100 × 0.2 × (20,000 / 200,000) = 2
The actual count follows a Poisson distribution, so:
- P(detecting 0) = e^(−2) ≈ 13.5%
- P(detecting 1) ≈ 27%
- P(detecting 2) ≈ 27%
This shows why even expressed genes often show zeros, and why increasing sequencing depth helps but has diminishing returns.
While scRNA-seq tells you which genes are expressed, it doesn’t directly tell you why. Gene regulation happens largely through regulatory elements—promoters and enhancers—and these elements need to be accessible to transcription factors.
Chromatin accessibility indicates which regulatory elements are “open” (accessible) versus “closed” (wrapped tightly in nucleosomes). In active regulatory regions:
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses a clever molecular trick:
Bulk ATAC-seq gives you accessibility averaged across millions of cells. Single-cell ATAC-seq (scATAC-seq) measures accessibility in individual cells.
Unlike scRNA-seq which measures ~20,000 genes, scATAC-seq measures accessibility across:
The data matrix looks similar:
Cell_1 Cell_2 Cell_3 ... Cell_10000
Peak_1 (chr1:1000) 1 0 1 0
Peak_2 (chr1:5400) 0 0 0 1
Peak_3 (chr2:8900) 2 1 0 0
...
Peak_200000 0 0 1 0
Values are typically 0, 1, or 2 (rare events where multiple insertions happened in the same peak in the same cell).
scATAC-seq data is extremely sparse—typically 95–99% zeros. Why?
Typical scATAC-seq cell:
The power of scATAC-seq comes from linking regulatory elements to gene expression:
Increasingly, researchers perform multimodal measurements: scRNA-seq + scATAC-seq on the same cells. This reveals:
Early scRNA-seq studies (2015–2017) profiled 100–1,000 cells. Today:
Computational implications:
A dataset with 1 million cells and 20,000 genes contains 20 billion values. Even in sparse format, this requires:
Standard analysis tools designed for bulk data simply cannot handle these scales.
The technical zeros in single-cell data create a phenomenon called “dropout”: genes that are expressed but appear as zeros in many cells.
Consequences:
Various computational approaches try to address this:
When you run scRNA-seq experiments on different days, with different reagent lots, or in different labs, you introduce batch effects—technical variation that doesn’t reflect biology.
Example scenario:
You profile:
Ideally, A and C should cluster together (both unaffected), and B and D together (both affected). But batch effects might make:
Batch effects can be larger than biological effects in single-cell data because:
Computational solutions:
In droplet-based scRNA-seq, sometimes two cells are captured in the same droplet. The result: a “doublet” that appears to express genes from both cell types.
Detection challenge:
If Cell Type A expresses genes {X, Y, Z} and Cell Type B expresses genes {A, B, C}, a doublet appears to express {X, Y, Z, A, B, C}. This might look like:
With 5–8% doublet rates in 10,000-cell experiments, you might have 500–800 doublets. Leaving them in your analysis creates artificial clusters.
Computational detection:
These tools work well but aren’t perfect. Manual inspection of suspicious clusters remains important.
With 20,000 genes, each cell is a point in 20,000-dimensional space. But most genes are zeros, and many genes are correlated.
Problems in high dimensions:
Solution: Dimensionality reduction
Compress 20,000 genes down to 20–50 dimensions that capture most variation:
These methods are essential preprocessing steps before clustering and visualization.
Most single-cell RNA-seq analyses follow this pipeline:
We’ll walk through each step with biological intuition.
Bad cells to remove:
Typical QC metrics:
Example:
Cell_1: 15,000 UMIs, 3,500 genes, 5% mitochondrial → KEEP
Cell_2: 300 UMIs, 180 genes, 8% mitochondrial → REMOVE (low quality)
Cell_3: 80,000 UMIs, 8,000 genes, 4% mitochondrial → REMOVE (likely doublet)
Cell_4: 12,000 UMIs, 2,800 genes, 35% mitochondrial → REMOVE (dying cell)
After QC, you might retain 85–95% of cells from a high-quality experiment.
Cells capture different amounts of RNA—some cells are larger, some droplets captured more efficiently. This creates technical variation that you need to remove.
Problem:
Cell A captured 10,000 UMIs total, gene X has 20 UMIs Cell B captured 30,000 UMIs total, gene X has 40 UMIs
Is gene X expressed more highly in Cell B? Not necessarily—Cell B just captured more molecules overall.
Solution: Normalize to counts per 10,000 (or per 1 million)
Cell A: 20 / 10,000 × 10,000 = 20 Cell B: 40 / 30,000 × 10,000 = 13.3
After normalization, Cell A actually has higher relative expression of gene X.
Log transformation:
Count data is heavily right-skewed (most genes have low counts, few have very high counts). Taking the log compresses the range:
log(count + 1)
The “+1” prevents log(0) = −infinity.
Not all 20,000 genes are informative. Many genes:
Highly variable genes show more variation than expected from technical noise alone. These are typically:
Standard practice: Select top 2,000–5,000 highly variable genes for downstream analysis. This:
PCA (Principal Component Analysis):
PCA finds linear combinations of genes that explain the most variation:
Most analyses use the first 20–50 PCs and discard the rest. This captures major biological variation while removing noise.
UMAP (Uniform Manifold Approximation and Projection):
UMAP takes the 20–50 PCs and projects them down to 2 dimensions for visualization. Unlike PCA, UMAP is nonlinear—it tries to preserve local structure (cells that are close in high dimensions stay close in 2D).
Typical result:
Cells form distinct clusters in UMAP space. Each cluster often corresponds to a cell type or cell state.
Biological analogy: Cell clustering and UMAP are like sorting a mixed population of cells under a microscope — cells that behave similarly cluster together, revealing hidden subpopulations you couldn’t see in bulk experiments. The UMAP plot is your map of the cell landscape.
Clustering algorithms group cells based on their expression profiles. The most common approach:
Key parameter: resolution
There’s no single “correct” clustering—it depends on your biological question.
After clustering, you have groups of cells, but what are they?
Marker gene approach:
Automated annotation:
Tools like SingleR, CellTypist, and scType use reference datasets to automatically annotate cells. They work well for common cell types but struggle with rare or novel populations.
Study: Travaglini et al., “A molecular cell atlas of the human lung from single-cell RNA sequencing.” Nature 2020.
Challenge: The lung contains dozens of cell types—epithelial cells (multiple subtypes), immune cells, endothelial cells, fibroblasts, and more. Bulk RNA-seq can’t resolve this complexity, and traditional histology only reveals morphology, not molecular state.
Approach:
Key Findings:
Epithelial diversity: Found rare cell types including pulmonary ionocytes (important for cystic fibrosis), neuroendocrine cells, and basal cell subtypes
Alveolar cell states: Distinguished AT1 cells (gas exchange) from AT2 cells (surfactant production), plus intermediate states suggesting regeneration
Immune landscape: Characterized tissue-resident macrophages, dendritic cells, T cell subsets, and B cells—each with distinct gene expression profiles
Disease alterations: In pulmonary fibrosis samples:
Impact:
Computational Challenge: Processing 312,928 cells required specialized infrastructure and algorithms. The researchers used Scanpy (Python) and Seurat (R), running on high-memory servers with 256+ GB RAM.
Study: Corces et al., “Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases.” Nature Genetics 2020 (adapted example for autism context).
Challenge: Genome-wide association studies (GWAS) identify hundreds of genetic variants associated with autism spectrum disorder. But 90%+ of these variants are in non-coding regions. Which cell types do they affect? What regulatory elements are involved?
Approach:
Key Findings:
Cell type-specific accessibility: Different neuronal subtypes showed distinct chromatin accessibility patterns, with excitatory neurons and inhibitory interneurons having the most distinctive profiles
GWAS variant enrichment: Autism-associated variants were significantly enriched in regulatory elements accessible in:
Target gene prediction: By linking accessible enhancers to nearby genes, identified likely target genes for non-coding variants, including genes involved in synaptic function and neuronal development
Developmental timing: Many affected regulatory elements showed evidence of being active during fetal brain development, suggesting critical windows for autism risk
Why This Needed Single-Cell:
Bulk ATAC-seq would have mixed signals from dozens of cell types. The brain contains:
By measuring individual cells, researchers pinpointed which specific cell types harbor the regulatory elements affected by autism-associated variants.
Throughout this chapter, you’ve seen why single-cell omics creates unique computational challenges:
Traditional statistical methods struggle with these challenges. This is where machine learning and deep learning become essential tools.
In the next chapter, we’ll explore single-cell foundation models—large neural networks trained on millions of cells that can:
These models represent a new paradigm: rather than analyzing each dataset in isolation, we can train models on comprehensive cell atlases and apply them to new data. This is the future of single-cell analysis.
Single-cell omics measures individual cells rather than tissue averages, revealing cellular heterogeneity invisible to bulk sequencing methods
Droplet-based scRNA-seq uses microfluidic barcoding to profile thousands of cells simultaneously, with each cell receiving a unique molecular barcode
Single-cell data is extremely sparse (85–95% zeros), creating unique computational challenges not present in bulk sequencing
scRNA-seq measures gene expression while scATAC-seq measures chromatin accessibility, providing complementary views of cellular state and gene regulation
Standard analysis pipeline includes QC, normalization, feature selection, dimensionality reduction, clustering, and cell type annotation
Major computational challenges include scale (millions of cells), dropout (technical zeros), batch effects, and doublets
Single-cell resolution reveals biology impossible to see otherwise: rare cell types, cell state transitions, composition effects, and disease-affected populations
Machine learning is essential for handling single-cell data scale, integrating modalities, and extracting biological insights from noisy, sparse measurements
| Term | Definition |
|---|---|
| ATAC-seq | Technology that maps chromatin accessibility by identifying regions where Tn5 transposase can insert sequencing adapters. |
| Batch effects | Technical variation introduced by processing samples at different times, with different reagents, or in different laboratories, which can obscure biological differences. |
| Cell barcoding | Molecular technique where each cell receives a unique DNA barcode sequence, allowing RNA molecules from thousands of cells to be pooled and sequenced together while maintaining cell identity. |
| Composition effect | A change in bulk measurements caused by altered cell type proportions rather than changes in gene expression within individual cell types. |
| Dimensionality reduction | Computational technique that projects high-dimensional data (20,000 genes) into lower dimensions (20–50 PCs or 2D for visualization) while preserving meaningful variation. |
| Doublet | A droplet containing two cells instead of one, resulting in mixed expression profiles that can be mistaken for novel cell types. |
| Dropout | Technical phenomenon where an expressed gene appears as zero count in single-cell data due to low capture efficiency, creating false sparsity. |
| Highly variable genes | Genes showing more expression variation across cells than expected from technical noise alone, typically including cell type markers and biologically dynamic genes. |
| PCA | Linear dimensionality reduction method that identifies axes of maximum variation in the data, commonly used as first step in single-cell analysis. |
| scATAC-seq | Single-cell version of ATAC-seq that measures chromatin accessibility in individual cells, revealing cell type-specific regulatory landscapes. |
| scRNA-seq | Transcriptomic technology that measures gene expression in individual cells, revealing cellular heterogeneity within tissues. |
| Sparsity | Property of single-cell data where most values are zeros (85–95%), resulting from a combination of low expression, low capture efficiency, and finite sequencing depth. |
| UMI | Random short DNA sequence attached to each RNA molecule before PCR, allowing true biological molecules to be distinguished from PCR duplicates. |
| UMAP | Nonlinear dimensionality reduction technique that preserves local structure, widely used for visualizing single-cell data in 2D. |
A researcher measures average gene expression from a tumor sample using bulk RNA-seq and finds that Gene X is expressed at 50 copies per cell. After performing single-cell RNA-seq on the same tumor, they discover that 60% of cells express Gene X at 100 copies, while 40% express it at 0 copies. When they analyze an additional tumor from a different patient, bulk shows 30 copies per cell, while single-cell shows 30% of cells expressing Gene X at 100 copies and 70% at 0 copies. What biological interpretation does single-cell resolution provide that bulk measurements miss?
In scRNA-seq, a gene expressed at 50 copies per cell shows counts of 0 in 30% of cells, 1–5 in 50% of cells, and 10+ in 20% of cells. Is this gene actually not expressed in 30% of cells, or is something else happening? What technical factors could cause this pattern?
You perform scRNA-seq on brain tissue and identify 15 clusters. One cluster expresses both neuronal markers (MAP2, SYP) and astrocyte markers (GFAP, AQP4). What are three possible explanations for this observation, and how would you investigate which explanation is correct?
A researcher compares unaffected brain tissue to samples from patients with autism spectrum disorder using scRNA-seq. They find that Gene Y shows no change in expression level within any cell type, but the proportion of excitatory neurons increases from 40% to 55% in affected samples. How would this appear in a bulk RNA-seq experiment? Why might this lead to incorrect conclusions?
Why is scATAC-seq data (95–98% zeros) even more sparse than scRNA-seq data (85–95% zeros), despite measuring a similar number of features? Consider the molecular biology underlying each technology.
You sequence 10,000 cells at 20,000 UMIs per cell versus 2,000 cells at 100,000 UMIs per cell (same total sequencing cost). What are the trade-offs between these two strategies, and when would you choose each approach?
Imagine you’re studying immune responses to infection. You profile immune cells at 0, 2, 6, 12, and 24 hours post-infection using scRNA-seq. You discover that some cells present at 6 hours don’t cluster with cells from any other time point. What are possible biological interpretations? How could you test these hypotheses?
A cell shows 50% of its reads mapping to mitochondrial genes. Why is this considered a quality control failure? What biological state might this cell be in, and why don’t we want to include such cells in downstream analysis?
Klein AM, et al. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 2015;161(5):1187–1201.
Zheng GXY, et al. “Massively parallel digital transcriptional profiling of single cells.” Nature Communications 2017;8:14049.
Buenrostro JD, et al. “Single-cell chromatin accessibility reveals principles of regulatory variation.” Nature 2015;523(7561):486–490.
Lähnemann D, et al. “Eleven grand challenges in single-cell data science.” Genome Biology 2020;21:31.
Luecken MD & Theis FJ. “Current best practices in single-cell RNA-seq analysis: a tutorial.” Molecular Systems Biology 2019;15(6):e8746.
Cusanovich DA, et al. “The cis-regulatory dynamics of embryonic development at single-cell resolution.” Nature 2018;555(7697):538–542.
In Chapter 16: Single-Cell Foundation Models, we’ll explore how deep learning models trained on millions of cells can:
These foundation models represent a paradigm shift: rather than analyzing each dataset independently, we can build upon comprehensive cell atlases to understand new data.
Before moving to Chapter 16, make sure you can: