Chapter 15: Introduction to Single-Cell Omics

Interactive: Chapter 15

Under the microscope, the tumor biopsy looks like a uniform mass of cells. The bulk RNA-seq confirms this impression: moderate upregulation of immune genes, modest changes in metabolic pathways. A seemingly straightforward picture — immune infiltration, metabolic stress, the usual hallmarks. You write it up in your notebook and prepare for the next experiment.

Then a collaborator runs the same tissue through a single-cell sequencer. The result is a revelation. That “uniform mass” contains at least 15 distinct cell populations. Some T cells are in an exhausted state, their cytotoxic function silenced by chronic antigen exposure. Others are actively proliferating — a completely different biology. Some cancer cells are dividing rapidly, others are quiescent and potentially therapy-resistant. The metabolic “moderate upregulation” was actually two opposing populations averaged together — one dramatically upregulated, the other completely silent. The average was a statistical artifact that described neither population accurately.

The single-cell data didn’t just add resolution. It told an entirely different biological story. The exhausted T cells explain why checkpoint inhibitor therapy might fail in this patient. The quiescent cancer cells explain why the tumor regrows after chemotherapy. The rare dendritic cell population — just 2% of cells, invisible in bulk measurements — shows the strongest activation signature and may be the key to designing a better immunotherapy. None of this was visible at bulk resolution. All of it was present, waiting to be seen.

This is the promise — and the challenge — of single-cell omics. The data is richer by orders of magnitude. But a single experiment profiling 50,000 cells produces as much data as 50,000 bulk experiments. Understanding what you’ve measured requires computational methods as sophisticated as the biology itself.

The Biological Challenge: Why Single Cells Matter

When you extract RNA from a tissue sample and perform bulk RNA-seq, you measure the average gene expression across millions of cells. If 90% of cells express gene A at 10 copies and 10% express it at 100 copies, you detect an average of 19 copies per cell. But you have no idea about this heterogeneity.

This averaging problem becomes critical in several contexts:

Tissue Complexity: Your brain contains over 100 different neuronal cell types, plus glia, immune cells, and vascular cells. A bulk measurement mixes all these signals together. When studying autism spectrum disorder or Alzheimer’s, you can’t tell which specific cell types show altered gene expression.

Rare Cell Populations: Stem cells often comprise less than 1% of a tissue. Circulating tumor cells can be 1 in 10 million blood cells. Pancreatic beta cells make up only 1-2% of the pancreas. These rare but critical populations vanish into the noise of bulk measurements.

Dynamic Processes: During development or immune responses, cells transition through transient states. Bulk measurements capture a static average but miss the trajectory. You can’t see that some cells are differentiating while others remain in a progenitor state.

Cellular Heterogeneity in Disease: Tumors aren’t uniform masses—they contain cancer cells at different stages, various immune cells, stromal cells, and more. In autism spectrum disorder, specific neuronal subtypes may be affected while others remain unaffected. Bulk measurements hide this crucial heterogeneity.

The experimental solution exists: single-cell RNA sequencing (scRNA-seq) and related technologies. But these methods generate massive datasets—analyzing 10,000 cells produces as much data as 10,000 bulk RNA-seq experiments. A single study might profile 100,000–500,000 cells. This is where computational methods become not just helpful but absolutely necessary.

Learning Objectives

After completing this chapter, you will be able to:

Explain the fundamental difference between bulk and single-cell measurements and when each is appropriate
Describe how droplet-based scRNA-seq works at a technical level
Understand the structure and unique characteristics of single-cell data matrices
Identify the major computational challenges in single-cell analysis (sparsity, scale, batch effects)
Outline the standard single-cell analysis pipeline from raw counts to biological interpretation
Recognize the trade-offs between different single-cell technologies
Explain how scATAC-seq complements scRNA-seq for understanding gene regulation

15.1 From Bulk to Single-Cell Resolution

15.1.1 The Bulk Sequencing Paradigm

In bulk RNA-seq, which you may have encountered in earlier biology courses, the workflow is straightforward:

Extract RNA from tissue (millions of cells)
Convert RNA to cDNA library
Sequence the library
Count reads per gene
Analyze differential expression

The result: one measurement per gene representing the average across all cells. For a human sample with ~20,000 genes, you get ~20,000 measurements total.

This approach works beautifully for some questions. If you want to know whether liver cells generally express more albumin than kidney cells, bulk sequencing answers this clearly. The liver-versus-kidney difference is so large that cellular heterogeneity within each tissue doesn’t matter.

15.1.2 When Averages Mislead

Consider a simple example with real numbers. You analyze a tissue sample containing two cell types:

Type A cells (80% of sample): Express gene X at 50 copies per cell
Type B cells (20% of sample): Express gene X at 0 copies per cell

Bulk sequencing reports: 40 copies per cell on average (0.8 × 50 + 0.2 × 0 = 40).

Now imagine you’re comparing unaffected tissue to samples from patients with a metabolic disorder:

Unaffected tissue:

Type A: 50 copies (80%)
Type B: 0 copies (20%)
Bulk average: 40 copies

Affected tissue:

Type A: 50 copies (60%) — some Type A cells died
Type B: 0 copies (40%) — Type B cells proliferated
Bulk average: 30 copies

Bulk RNA-seq shows a 25% decrease in gene X expression. But gene X expression in individual cells hasn’t changed at all! The change is entirely due to altered cell type proportions. This is called a “composition effect,” and it confounds thousands of published bulk studies.

15.1.3 The Single-Cell Solution

Single-cell RNA-seq solves this by measuring each cell independently. For the same sample, you’d get:

Cell_1 (Type A): 52 copies
Cell_2 (Type A): 48 copies
Cell_3 (Type B): 0 copies
Cell_4 (Type A): 51 copies
...
Cell_10000 (Type B): 0 copies

Now you can:

Identify that two cell types exist
Quantify their proportions
Measure gene expression within each type separately
Detect composition changes versus expression changes
Discover rare cell types and transient states

The cost: instead of one measurement per gene, you have 10,000 measurements per gene. Your data matrix goes from 20,000 numbers to 200 million numbers.

[What’s new here for AI/ML?] You likely already know about scRNA-seq from earlier courses. What changes with AI is what we can do with this data at scale. Traditional tools could cluster 10,000 cells; modern foundation models learn from 100 million cells across hundreds of studies to build universal representations of cellular state. The biology is familiar — the computational leap is new.

15.2 Single-Cell RNA-Seq Technology

15.2.1 The Core Challenge: Cell Barcoding

The fundamental problem in scRNA-seq is: How do you keep track of which RNA molecules came from which cell when you sequence millions of molecules together?

The solution is elegant: molecular barcoding. Each cell gets a unique DNA barcode sequence, and all RNA molecules from that cell are tagged with its barcode before pooling.

15.2.2 Droplet-Based scRNA-seq (10x Genomics)

The most widely used platform, 10x Genomics Chromium, works like this:

Step 1: Cell Encapsulation

Single cells flow through a microfluidic chip
Each cell is captured in a tiny oil droplet (nanoliter volume)
Each droplet also contains one bead coated with barcoded primers
Result: ~10,000 individual “reactors,” each with one cell and one bead

Step 2: Cell Lysis and Barcoding

Inside each droplet, cells are lysed
RNA molecules are released
They bind to primers on the bead
Each primer has the same barcode sequence for all primers on one bead
Reverse transcription creates cDNA with the barcode attached

Step 3: Breaking Droplets and Library Prep

Droplets are broken
All barcoded cDNA is pooled together
Standard sequencing library is prepared
PCR amplification and sequencing

Step 4: Sequencing and Demultiplexing

Paired-end sequencing reads each molecule:
- Read 1: Cell barcode (16 bp) + UMI (12 bp)
- Read 2: RNA sequence (50–100 bp)
Software assigns each read to its cell of origin using the barcode
UMI (Unique Molecular Identifier) distinguishes PCR duplicates from true biological copies

The beauty of this system: you can sequence 10,000 cells in one lane, and the barcodes tell you which reads came from which cell. No manual cell sorting required.

15.2.3 Technical Specifications

Typical 10x Genomics Run:

Cells captured: 5,000–10,000 per channel
Genes detected per cell: 2,000–5,000 (out of ~20,000 total)
UMIs per cell: 10,000–50,000 (depends on cell size and type)
Doublet rate: 5–8% (two cells in one droplet)
Cost: ~$1,000 per 10,000 cells (reagents only)
Time: 1–2 days for library prep, plus sequencing time

Why Not More Genes per Cell?

You might wonder: if cells express ~10,000–15,000 genes, why do we only detect 2,000–5,000? Three reasons:

Capture efficiency: Not every mRNA molecule in the cell gets captured and converted to cDNA (~10–40% efficiency)
Sequencing depth: With 10,000–50,000 UMIs per cell, you only sample a fraction of the transcriptome
Low expression: Many genes are expressed at very low levels (1–5 copies per cell), and you may miss them entirely

This creates sparsity: most entries in your data matrix are zeros, even when the gene is actually expressed at low levels.

15.2.4 Alternative Technologies

Smart-seq2/3 (full-length sequencing):

Sequence entire transcript, not just 3’ end
Higher genes detected per cell (5,000–8,000)
Better for isoform analysis
But: much more expensive, typically only 100–500 cells per experiment
Used when you need high-quality data for fewer cells

BD Rhapsody (targeted sequencing):

Use panels of 400–1,000 genes instead of whole transcriptome
Much cheaper per cell
Good for focused questions (immune profiling, cell type identification)
Miss discovery of unexpected genes

SPLiT-seq (combinatorial barcoding):

Can scale to 100,000+ cells per experiment
More complex protocol
Lower quality per cell but incredible scale

15.3 The Single-Cell Data Matrix

15.3.1 Structure and Size

A single-cell RNA-seq dataset is organized as a matrix:

               Cell_1  Cell_2  Cell_3  ...  Cell_10000
Gene_1 (TP53)    45      0      23          12
Gene_2 (ACTB)   523     487    612         445
Gene_3 (GAPDH)  234     198    276         201
...
Gene_20000       0       2      0           0

Biological analogy: This is like a giant spreadsheet where rows are genes and columns are cells — imagine a classroom attendance sheet, except instead of ‘present/absent’ it counts how many mRNA molecules each student (cell) produced for each subject (gene). Most entries will be zero because most cells only express a fraction of all genes.

Dimensions:

Rows: 20,000–60,000 genes (depending on species and genome annotation)
Columns: 1,000–500,000 cells (depending on experimental design)
Total values: Up to 30 billion numbers for a large experiment

Memory Requirements:

A 20,000 × 100,000 matrix of integers needs ~8 GB RAM
But most values are zero, so sparse matrix formats reduce this to ~500 MB
Still, analyzing large datasets requires significant computational resources

15.3.2 The Sparsity Problem

Unlike bulk RNA-seq where most genes have non-zero counts, single-cell data is typically 85–95% zeros:

Bulk RNA-seq: [523, 234, 45, 12, 89, ...]  (most non-zero)
scRNA-seq:    [0, 0, 45, 0, 0, 12, 0, 0, 0, 523, ...]  (mostly zeros)

Sources of Zeros:

Biological zeros: Gene not expressed in this cell type
Technical zeros: Gene is expressed but not captured/detected
- mRNA not captured during lysis
- cDNA not amplified efficiently
- Molecule not sequenced (sampling)

You cannot distinguish biological from technical zeros without additional information. This ambiguity complicates downstream analysis.

15.3.3 Count Statistics

Unlike bulk RNA-seq, single-cell data has different statistical properties:

Bulk RNA-seq counts per gene:

Range: 10 to 10,000,000
Distribution: Approximately negative binomial
Coefficient of variation: moderate

Single-cell UMI counts per gene per cell:

Range: 0 to 5,000 (occasionally higher)
Distribution: Zero-inflated negative binomial
Many genes: median count is 0
High-expression genes: median might be 10–50

This means standard bulk RNA-seq analysis methods often fail for single-cell data. Methods need to:

Handle excess zeros explicitly
Account for extreme count variability
Work with sparse matrix formats for memory efficiency

[Optional: The Math] — Sequencing Depth and Gene Detection Probability

For a gene expressed at m mRNA copies per cell, the probability of detecting it in scRNA-seq depends on:

Capture efficiency (c): What fraction of mRNAs get captured? Typically c ≈ 0.1–0.3

Sequencing depth (d): How many UMIs per cell? Typically d = 10,000–50,000

Library size (L): Total mRNAs in cell, typically L ≈ 100,000–500,000

The expected number of UMIs detected for this gene is:

E[UMIs detected] = m × c × (d / L)

Example: A gene with 100 copies, 20% capture, 20,000 depth, 200,000 library size:

E[UMIs] = 100 × 0.2 × (20,000 / 200,000) = 2

The actual count follows a Poisson distribution, so:

P(detecting 0) = e^(−2) ≈ 13.5%

P(detecting 1) ≈ 27%

P(detecting 2) ≈ 27%

This shows why even expressed genes often show zeros, and why increasing sequencing depth helps but has diminishing returns.

15.4 Single-Cell ATAC-Seq: Measuring Chromatin Accessibility

15.4.1 Why Chromatin Accessibility Matters

While scRNA-seq tells you which genes are expressed, it doesn’t directly tell you why. Gene regulation happens largely through regulatory elements—promoters and enhancers—and these elements need to be accessible to transcription factors.

Chromatin accessibility indicates which regulatory elements are “open” (accessible) versus “closed” (wrapped tightly in nucleosomes). In active regulatory regions:

DNA is not tightly wrapped around histones
Transcription factors can bind
Chromatin remodeling complexes have done their work
Gene expression is possible (though not guaranteed)

15.4.2 ATAC-Seq Technology

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses a clever molecular trick:

Tn5 transposase: An enzyme that can insert sequencing adapters into DNA
Preferential insertion: Tn5 inserts much more efficiently into accessible chromatin than closed chromatin
Sequencing: The insertion sites reveal where chromatin was accessible

Bulk ATAC-seq gives you accessibility averaged across millions of cells. Single-cell ATAC-seq (scATAC-seq) measures accessibility in individual cells.

15.4.3 scATAC-Seq Data Structure

Unlike scRNA-seq which measures ~20,000 genes, scATAC-seq measures accessibility across:

~100,000–300,000 genomic regions (called “peaks”)
Each peak is typically 200–500 base pairs
Peaks correspond to promoters, enhancers, and other regulatory elements

The data matrix looks similar:

               Cell_1  Cell_2  Cell_3  ...  Cell_10000
Peak_1 (chr1:1000)  1      0      1           0
Peak_2 (chr1:5400)  0      0      0           1
Peak_3 (chr2:8900)  2      1      0           0
...
Peak_200000         0      0      1           0

Values are typically 0, 1, or 2 (rare events where multiple insertions happened in the same peak in the same cell).

15.4.4 Even More Sparse Than scRNA-Seq

scATAC-seq data is extremely sparse—typically 95–99% zeros. Why?

Fewer molecules per cell: A cell has ~100,000–500,000 mRNA molecules but only ~100,000 accessible sites
Lower detection efficiency: Not every accessible site gets a Tn5 insertion
More features measured: 200,000 peaks versus 20,000 genes
Binary-ish nature: A peak is either accessible (1–2 insertions) or not (0), versus genes which can have counts in the hundreds

Typical scATAC-seq cell:

Total unique fragments: 5,000–20,000
Accessible peaks detected: 2,000–10,000
Matrix sparsity: 95–98%

15.4.5 Linking Accessibility to Gene Expression

The power of scATAC-seq comes from linking regulatory elements to gene expression:

Promoter accessibility: If a gene’s promoter is accessible, it can be transcribed (though other factors matter)
Enhancer activity: Distal enhancers (sometimes 100,000+ bp away) can regulate genes
Transcription factor binding: Accessible regions show where TFs can bind
Cell type identification: Different cell types have different accessibility patterns

Increasingly, researchers perform multimodal measurements: scRNA-seq + scATAC-seq on the same cells. This reveals:

Which accessible enhancers correlate with which gene expression
Cell types defined by both expression and chromatin state
Regulatory circuits underlying cell identity

15.5 Computational Challenges in Single-Cell Analysis

15.5.1 Scale: From Thousands to Millions of Cells

Early scRNA-seq studies (2015–2017) profiled 100–1,000 cells. Today:

Individual studies: 10,000–100,000 cells routinely
Large consortia: Human Cell Atlas aims for 10–100 million cells
Meta-analyses: Combining datasets can reach millions of cells

Computational implications:

A dataset with 1 million cells and 20,000 genes contains 20 billion values. Even in sparse format, this requires:

20–50 GB of memory to hold in RAM
Specialized algorithms that work with sparse matrices
Parallel computing for many operations
GPUs for deep learning applications

Standard analysis tools designed for bulk data simply cannot handle these scales.

15.5.2 Sparsity and Dropout

The technical zeros in single-cell data create a phenomenon called “dropout”: genes that are expressed but appear as zeros in many cells.

Consequences:

Correlation analysis is unreliable (two genes both expressed but both showing many zeros appears like no correlation)
Distance metrics are distorted (cells might appear more similar than they are because both have many zeros)
Clustering can be unstable (small variations in which genes are detected can change cluster assignments)

Various computational approaches try to address this:

Imputation methods: Fill in probable non-zero values (but risk adding false signal)
Dropout-aware models: Explicitly model the dropout process
Dimensionality reduction: Project to lower dimensions where signals are more stable

15.5.3 Batch Effects

When you run scRNA-seq experiments on different days, with different reagent lots, or in different labs, you introduce batch effects—technical variation that doesn’t reflect biology.

Example scenario:

You profile:

Sample A (unaffected tissue) on Monday in Lab 1
Sample B (affected tissue) on Friday in Lab 1
Sample C (unaffected tissue) on Wednesday in Lab 2
Sample D (affected tissue) on Thursday in Lab 2

Ideally, A and C should cluster together (both unaffected), and B and D together (both affected). But batch effects might make:

A and B cluster together (same lab)
C and D cluster together (same lab)

Batch effects can be larger than biological effects in single-cell data because:

Individual cells have high technical noise
Small changes in capture efficiency affect many genes
Different sequencing runs have different depths

Computational solutions:

Batch correction algorithms (Harmony, Seurat integration, scVI)
Careful experimental design (process samples together when possible)
Including technical replicates to estimate batch effects

15.5.4 Doublets: When Two Cells Enter One Droplet

In droplet-based scRNA-seq, sometimes two cells are captured in the same droplet. The result: a “doublet” that appears to express genes from both cell types.

Detection challenge:

If Cell Type A expresses genes {X, Y, Z} and Cell Type B expresses genes {A, B, C}, a doublet appears to express {X, Y, Z, A, B, C}. This might look like:

A novel cell type (wrong interpretation)
A transitional state (wrong interpretation)
A doublet (correct interpretation)

With 5–8% doublet rates in 10,000-cell experiments, you might have 500–800 doublets. Leaving them in your analysis creates artificial clusters.

Computational detection:

DoubletFinder: Simulates artificial doublets and finds cells similar to them
Scrublet: Similar approach, fast implementation
Solo: Uses deep learning to classify doublets

These tools work well but aren’t perfect. Manual inspection of suspicious clusters remains important.

15.5.5 The Curse of Dimensionality

With 20,000 genes, each cell is a point in 20,000-dimensional space. But most genes are zeros, and many genes are correlated.

Problems in high dimensions:

Distances between points become less meaningful
Clustering algorithms struggle
Visualization is impossible
Overfitting in machine learning

Solution: Dimensionality reduction

Compress 20,000 genes down to 20–50 dimensions that capture most variation:

PCA (Principal Component Analysis): Linear projection
UMAP (Uniform Manifold Approximation and Projection): Nonlinear, preserves local structure
t-SNE: Nonlinear, good for visualization but distorts global structure

These methods are essential preprocessing steps before clustering and visualization.

15.6 The Standard Single-Cell Analysis Pipeline

15.6.1 Overview: From Reads to Biology

Most single-cell RNA-seq analyses follow this pipeline:

Quality Control: Remove low-quality cells and genes
Normalization: Account for technical variation
Feature Selection: Identify most informative genes
Dimensionality Reduction: PCA, then UMAP/t-SNE
Clustering: Group similar cells together
Cell Type Annotation: Identify what each cluster represents
Differential Expression: Find marker genes for each cell type
Biological Interpretation: Connect to pathways, processes, functions

We’ll walk through each step with biological intuition.

15.6.2 Quality Control

Bad cells to remove:

Dead or dying cells: Show low total UMIs, high mitochondrial gene percentage
Empty droplets: Contain ambient RNA but no real cell
Doublets: Two cells in one droplet

Typical QC metrics:

Total UMI count per cell: Remove cells with <500 or >50,000 (likely doublets)
Number of genes detected: Remove cells with <200 genes
Mitochondrial gene percentage: Remove cells with >20% mitochondrial reads (dying cells leak cytoplasm, leaving mostly mitochondria)

Example:

Cell_1: 15,000 UMIs, 3,500 genes, 5% mitochondrial  → KEEP
Cell_2: 300 UMIs, 180 genes, 8% mitochondrial        → REMOVE (low quality)
Cell_3: 80,000 UMIs, 8,000 genes, 4% mitochondrial   → REMOVE (likely doublet)
Cell_4: 12,000 UMIs, 2,800 genes, 35% mitochondrial  → REMOVE (dying cell)

After QC, you might retain 85–95% of cells from a high-quality experiment.

15.6.3 Normalization

Cells capture different amounts of RNA—some cells are larger, some droplets captured more efficiently. This creates technical variation that you need to remove.

Problem:

Cell A captured 10,000 UMIs total, gene X has 20 UMIs Cell B captured 30,000 UMIs total, gene X has 40 UMIs

Is gene X expressed more highly in Cell B? Not necessarily—Cell B just captured more molecules overall.

Solution: Normalize to counts per 10,000 (or per 1 million)

Cell A: 20 / 10,000 × 10,000 = 20 Cell B: 40 / 30,000 × 10,000 = 13.3

After normalization, Cell A actually has higher relative expression of gene X.

Log transformation:

Count data is heavily right-skewed (most genes have low counts, few have very high counts). Taking the log compresses the range:

log(count + 1)

The “+1” prevents log(0) = −infinity.

15.6.4 Feature Selection: Finding Variable Genes

Not all 20,000 genes are informative. Many genes:

Are housekeeping genes expressed the same in all cells (GAPDH, ACTB)
Are expressed at very low levels with mostly technical noise
Show little variation across cell types

Highly variable genes show more variation than expected from technical noise alone. These are typically:

Cell type-specific markers (different cell types express different sets)
Genes involved in dynamic processes (cell cycle, immune response)
Biologically interesting targets

Standard practice: Select top 2,000–5,000 highly variable genes for downstream analysis. This:

Reduces noise from uninformative genes
Speeds up computation
Improves biological signal

15.6.5 Dimensionality Reduction: PCA and UMAP

PCA (Principal Component Analysis):

PCA finds linear combinations of genes that explain the most variation:

PC1: Direction of maximum variation
PC2: Second-most variation (orthogonal to PC1)
PC3, PC4, … up to PC50

Most analyses use the first 20–50 PCs and discard the rest. This captures major biological variation while removing noise.

UMAP (Uniform Manifold Approximation and Projection):

UMAP takes the 20–50 PCs and projects them down to 2 dimensions for visualization. Unlike PCA, UMAP is nonlinear—it tries to preserve local structure (cells that are close in high dimensions stay close in 2D).

Typical result:

Cells form distinct clusters in UMAP space. Each cluster often corresponds to a cell type or cell state.

Biological analogy: Cell clustering and UMAP are like sorting a mixed population of cells under a microscope — cells that behave similarly cluster together, revealing hidden subpopulations you couldn’t see in bulk experiments. The UMAP plot is your map of the cell landscape.

15.6.6 Clustering: Grouping Similar Cells

Clustering algorithms group cells based on their expression profiles. The most common approach:

Build a k-nearest neighbor graph (each cell connected to its ~20 most similar cells)
Apply community detection algorithm (like Louvain or Leiden)
Result: clusters of cells

Key parameter: resolution

Low resolution: Fewer, larger clusters (broad cell types)
High resolution: More, smaller clusters (subtle cell states)

There’s no single “correct” clustering—it depends on your biological question.

15.6.7 Cell Type Annotation

After clustering, you have groups of cells, but what are they?

Marker gene approach:

Find genes differentially expressed in each cluster
Compare to known markers:
- T cells: CD3D, CD3E
- B cells: CD19, MS4A1
- Macrophages: CD68, CD14
- Neurons: MAP2, SYP
Assign cell type based on markers

Automated annotation:

Tools like SingleR, CellTypist, and scType use reference datasets to automatically annotate cells. They work well for common cell types but struggle with rare or novel populations.

Case Study 15.1: Mapping the Human Lung at Single-Cell Resolution

Study: Travaglini et al., “A molecular cell atlas of the human lung from single-cell RNA sequencing.” Nature 2020.

Challenge: The lung contains dozens of cell types—epithelial cells (multiple subtypes), immune cells, endothelial cells, fibroblasts, and more. Bulk RNA-seq can’t resolve this complexity, and traditional histology only reveals morphology, not molecular state.

Approach:

Profiled 312,928 cells from 3 unaffected donor lungs and 5 samples from patients with pulmonary fibrosis
Used 10x Genomics droplet-based scRNA-seq
Identified 58 distinct cell populations

Key Findings:

Epithelial diversity: Found rare cell types including pulmonary ionocytes (important for cystic fibrosis), neuroendocrine cells, and basal cell subtypes
Alveolar cell states: Distinguished AT1 cells (gas exchange) from AT2 cells (surfactant production), plus intermediate states suggesting regeneration
Immune landscape: Characterized tissue-resident macrophages, dendritic cells, T cell subsets, and B cells—each with distinct gene expression profiles
Disease alterations: In pulmonary fibrosis samples:
- Increased myofibroblasts (produce excess collagen)
- Altered epithelial cell populations
- Shifted macrophage states toward pro-fibrotic profiles
- Loss of typical alveolar structure

Impact:

Provided reference atlas for lung biology
Revealed cell types altered in pulmonary fibrosis
Identified potential therapeutic targets
Demonstrated power of single-cell resolution for understanding complex tissues

Computational Challenge: Processing 312,928 cells required specialized infrastructure and algorithms. The researchers used Scanpy (Python) and Seurat (R), running on high-memory servers with 256+ GB RAM.

Case Study 15.2: Single-Cell ATAC-Seq Reveals Autism-Associated Regulatory Variation

Study: Corces et al., “Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases.” Nature Genetics 2020 (adapted example for autism context).

Challenge: Genome-wide association studies (GWAS) identify hundreds of genetic variants associated with autism spectrum disorder. But 90%+ of these variants are in non-coding regions. Which cell types do they affect? What regulatory elements are involved?

Approach:

Performed scATAC-seq on postmortem brain tissue
Profiled 100,000+ nuclei across multiple brain regions
Integrated with autism GWAS data
Linked accessible regulatory elements to genes

Key Findings:

Cell type-specific accessibility: Different neuronal subtypes showed distinct chromatin accessibility patterns, with excitatory neurons and inhibitory interneurons having the most distinctive profiles
GWAS variant enrichment: Autism-associated variants were significantly enriched in regulatory elements accessible in:
- Excitatory neurons (cortical layers 2–4)
- Specific interneuron subtypes
- Not enriched in glia or other cell types
Target gene prediction: By linking accessible enhancers to nearby genes, identified likely target genes for non-coding variants, including genes involved in synaptic function and neuronal development
Developmental timing: Many affected regulatory elements showed evidence of being active during fetal brain development, suggesting critical windows for autism risk

Why This Needed Single-Cell:

Bulk ATAC-seq would have mixed signals from dozens of cell types. The brain contains:

Excitatory neurons (~70% of neurons)
Inhibitory interneurons (~20%)
Astrocytes (~10–15% of cells)
Oligodendrocytes (~10%)
Microglia (~5–10%)
Endothelial cells, other glia

By measuring individual cells, researchers pinpointed which specific cell types harbor the regulatory elements affected by autism-associated variants.

15.7 Why Single-Cell Data Needs Machine Learning

Throughout this chapter, you’ve seen why single-cell omics creates unique computational challenges:

Scale: Millions of cells, billions of measurements
Sparsity: 85–99% zeros, biological versus technical ambiguity
Noise: High technical variation, batch effects, doublets
Complexity: Dozens of cell types, continuous cell states, developmental trajectories
Integration: Combining multiple modalities (RNA + ATAC), multiple samples, multiple time points

Traditional statistical methods struggle with these challenges. This is where machine learning and deep learning become essential tools.

In the next chapter, we’ll explore single-cell foundation models—large neural networks trained on millions of cells that can:

Denoise sparse single-cell data
Transfer knowledge across datasets
Predict cell types and states
Generate hypotheses about gene regulation
Integrate multi-modal measurements

These models represent a new paradigm: rather than analyzing each dataset in isolation, we can train models on comprehensive cell atlases and apply them to new data. This is the future of single-cell analysis.

Summary

Key Takeaways

Single-cell omics measures individual cells rather than tissue averages, revealing cellular heterogeneity invisible to bulk sequencing methods
Droplet-based scRNA-seq uses microfluidic barcoding to profile thousands of cells simultaneously, with each cell receiving a unique molecular barcode
Single-cell data is extremely sparse (85–95% zeros), creating unique computational challenges not present in bulk sequencing
scRNA-seq measures gene expression while scATAC-seq measures chromatin accessibility, providing complementary views of cellular state and gene regulation
Standard analysis pipeline includes QC, normalization, feature selection, dimensionality reduction, clustering, and cell type annotation
Major computational challenges include scale (millions of cells), dropout (technical zeros), batch effects, and doublets
Single-cell resolution reveals biology impossible to see otherwise: rare cell types, cell state transitions, composition effects, and disease-affected populations
Machine learning is essential for handling single-cell data scale, integrating modalities, and extracting biological insights from noisy, sparse measurements

📖 Key Terms

Term	Definition
ATAC-seq	Technology that maps chromatin accessibility by identifying regions where Tn5 transposase can insert sequencing adapters.
Batch effects	Technical variation introduced by processing samples at different times, with different reagents, or in different laboratories, which can obscure biological differences.
Cell barcoding	Molecular technique where each cell receives a unique DNA barcode sequence, allowing RNA molecules from thousands of cells to be pooled and sequenced together while maintaining cell identity.
Composition effect	A change in bulk measurements caused by altered cell type proportions rather than changes in gene expression within individual cell types.
Dimensionality reduction	Computational technique that projects high-dimensional data (20,000 genes) into lower dimensions (20–50 PCs or 2D for visualization) while preserving meaningful variation.
Doublet	A droplet containing two cells instead of one, resulting in mixed expression profiles that can be mistaken for novel cell types.
Dropout	Technical phenomenon where an expressed gene appears as zero count in single-cell data due to low capture efficiency, creating false sparsity.
Highly variable genes	Genes showing more expression variation across cells than expected from technical noise alone, typically including cell type markers and biologically dynamic genes.
PCA	Linear dimensionality reduction method that identifies axes of maximum variation in the data, commonly used as first step in single-cell analysis.
scATAC-seq	Single-cell version of ATAC-seq that measures chromatin accessibility in individual cells, revealing cell type-specific regulatory landscapes.
scRNA-seq	Transcriptomic technology that measures gene expression in individual cells, revealing cellular heterogeneity within tissues.
Sparsity	Property of single-cell data where most values are zeros (85–95%), resulting from a combination of low expression, low capture efficiency, and finite sequencing depth.
UMI	Random short DNA sequence attached to each RNA molecule before PCR, allowing true biological molecules to be distinguished from PCR duplicates.
UMAP	Nonlinear dimensionality reduction technique that preserves local structure, widely used for visualizing single-cell data in 2D.

Conceptual Questions

A researcher measures average gene expression from a tumor sample using bulk RNA-seq and finds that Gene X is expressed at 50 copies per cell. After performing single-cell RNA-seq on the same tumor, they discover that 60% of cells express Gene X at 100 copies, while 40% express it at 0 copies. When they analyze an additional tumor from a different patient, bulk shows 30 copies per cell, while single-cell shows 30% of cells expressing Gene X at 100 copies and 70% at 0 copies. What biological interpretation does single-cell resolution provide that bulk measurements miss?
In scRNA-seq, a gene expressed at 50 copies per cell shows counts of 0 in 30% of cells, 1–5 in 50% of cells, and 10+ in 20% of cells. Is this gene actually not expressed in 30% of cells, or is something else happening? What technical factors could cause this pattern?
You perform scRNA-seq on brain tissue and identify 15 clusters. One cluster expresses both neuronal markers (MAP2, SYP) and astrocyte markers (GFAP, AQP4). What are three possible explanations for this observation, and how would you investigate which explanation is correct?
A researcher compares unaffected brain tissue to samples from patients with autism spectrum disorder using scRNA-seq. They find that Gene Y shows no change in expression level within any cell type, but the proportion of excitatory neurons increases from 40% to 55% in affected samples. How would this appear in a bulk RNA-seq experiment? Why might this lead to incorrect conclusions?
Why is scATAC-seq data (95–98% zeros) even more sparse than scRNA-seq data (85–95% zeros), despite measuring a similar number of features? Consider the molecular biology underlying each technology.
You sequence 10,000 cells at 20,000 UMIs per cell versus 2,000 cells at 100,000 UMIs per cell (same total sequencing cost). What are the trade-offs between these two strategies, and when would you choose each approach?
Imagine you’re studying immune responses to infection. You profile immune cells at 0, 2, 6, 12, and 24 hours post-infection using scRNA-seq. You discover that some cells present at 6 hours don’t cluster with cells from any other time point. What are possible biological interpretations? How could you test these hypotheses?
A cell shows 50% of its reads mapping to mitochondrial genes. Why is this considered a quality control failure? What biological state might this cell be in, and why don’t we want to include such cells in downstream analysis?

What’s Next?

In Chapter 16: Single-Cell Foundation Models, we’ll explore how deep learning models trained on millions of cells can:

Learn universal representations of cellular state
Transfer knowledge across datasets and species
Predict gene expression from other modalities
Generate hypotheses about gene regulation
Enable zero-shot cell type identification

These foundation models represent a paradigm shift: rather than analyzing each dataset independently, we can build upon comprehensive cell atlases to understand new data.

Before moving to Chapter 16, make sure you can:

Explain the key differences between bulk and single-cell measurements
Describe how droplet-based scRNA-seq works technically
Understand why single-cell data is sparse and what this means computationally

목차