Interactive Lab: Chapter 3 — Try the 2D classifier, activation function explorer, single neuron playground, and training visualizer!
You’re a second-year biology student staring at 200 DNA sequences, each 20 nucleotides long. Half of them are real splice donor sites (the GT at exon-intron boundaries where your pre-mRNA gets cut), and half are decoy sequences that look similar but aren’t functional. Your professor challenges you: “Can you figure out the rule that separates real splice sites from fake ones?”
You start by looking for the obvious pattern: GT at positions 3-4. Sure enough, almost every real splice site has it. But so do most of the decoys—your professor was sneaky. You look more carefully. Real ones tend to have a G or A at position 2, a run of purines before the GT, and a specific pattern at positions 5-8. You start combining rules: “If position 2 is A AND position 5 is A AND position 7 is G, then… probably real?”
After an hour, you have a messy decision tree with 12 rules that correctly classifies about 75% of the sequences. Your professor smiles and says: “Not bad. But a neural network can learn to do this at 95% accuracy in about 3 seconds—and it discovers patterns you didn’t even notice.”
How?
That’s what this chapter is about. In Chapter 2, you learned that deep learning is a practical approximation of Bayesian inference—finding the best explanation (MAP) by minimizing a loss function. Now we’ll open the box and see the actual machinery: artificial neurons, layers, forward propagation, and the remarkable algorithm called backpropagation that allows networks to learn from mistakes.
The key insight: neural networks aren’t magic. They’re built from the simplest possible building block—a single artificial neuron that does weighted addition followed by a simple nonlinear function. Stack enough of these together, and they can learn almost any pattern.
By the end of this chapter, you will be able to:
You learned about neurons in introductory biology. Let’s revisit the key features:
A biological neuron:
Dendrites (inputs) → Cell body (integration + threshold) → Axon (output)
↑ ↑ ↑
Signals from Sum of signals Signal to next
other neurons exceeds threshold? neurons
Key properties:
This is exactly what artificial neurons mimic.
An artificial neuron takes the biological concept and simplifies it to pure math:
Inputs: x₁, x₂, x₃, ... xₙ (like signals from dendrites)
Weights: w₁, w₂, w₃, ... wₙ (like synaptic strengths)
Bias: b (like the resting potential)
Summation: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b (like cell body integration)
Activation: a = f(z) (like the firing threshold)
Output: a (like the axon signal)
Genomics example: Is this variant pathogenic?
Imagine a single neuron that takes three features of a genetic variant:
z = (0.8 × conservation) + (-0.6 × frequency) + (0.5 × structural_impact) + (-0.1)
↑ ↑ ↑ ↑
"High conservation "Common variants "Structural changes Bias
suggests important" are usually benign" may be damaging"
Notice:
The weights encode what the neuron has learned about the relationship between inputs and the output.
| Biological Neuron | Artificial Neuron | What It Does |
|---|---|---|
| Dendrites | Inputs (x₁, x₂, …) | Receive information |
| Synaptic strength | Weights (w₁, w₂, …) | How much each input matters |
| Resting potential | Bias (b) | Default tendency |
| Cell body summation | z = Σwᵢxᵢ + b | Combine all information |
| Firing threshold | Activation function f(z) | Decide whether to “fire” |
| Axon output | Output a = f(z) | Send signal forward |
| Long-term potentiation | Increasing weight during training | Learning from experience |
| Long-term depression | Decreasing weight during training | Unlearning wrong patterns |
Important caveat: This analogy is useful for intuition, but artificial neurons are a dramatic simplification. Real neurons communicate through complex temporal patterns of spikes, use dozens of neurotransmitters, and have intricate dendritic computation. The analogy helps you understand the concept, not the biology.
Without an activation function, our neuron just computes: z = w₁x₁ + w₂x₂ + b
This is a linear function. It can only draw straight lines.
Why is this a problem?
Think about classifying genetic variants as pathogenic vs. benign. In reality:
This relationship is nonlinear—you can’t separate pathogenic from benign with a single straight line in the conservation × structural_impact space.
The activation function adds the nonlinearity we need.
σ(z) = 1 / (1 + e^(-z))
Input: any number from -∞ to +∞
Output: a number between 0 and 1
Biology analogy: Like the dose-response curve in pharmacology. At low drug concentrations, there’s no response. At high concentrations, the response saturates. In between, there’s a smooth S-shaped transition.
When it’s useful: When you want an output that represents a probability. “There’s a 73% chance this variant is pathogenic.”
The problem: For very large or very small inputs, the sigmoid is nearly flat (gradient ≈ 0). This makes learning very slow — a problem called vanishing gradients that we’ll discuss later.
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Input: any number from -∞ to +∞
Output: a number between -1 and +1
Biology analogy: Like gene expression fold-change centered on zero. Positive means upregulated, negative means downregulated, zero means no change. The magnitude tells you how strong the effect is, saturating at extreme values.
Advantage over sigmoid: Centered around zero, which helps with training. But still suffers from vanishing gradients at the extremes.
ReLU(z) = max(0, z)
Input: any number
Output: 0 if input is negative, otherwise the input itself
Biology analogy: Like a gene that’s either silent (expression = 0) or active (expression proportional to signal). There’s no “negative expression” — once the gene is off, it’s off. But when it’s on, the response is proportional to the input.
Why it’s the most popular today:
The downside: “Dead neurons” — if a neuron’s input is always negative, it always outputs 0 and can never recover. Variants like Leaky ReLU (allows a small negative slope) fix this.
| Activation | Output Range | Best For | Problem |
|---|---|---|---|
| Sigmoid | 0 to 1 | Output layer (binary classification) | Vanishing gradients |
| Tanh | -1 to 1 | Hidden layers (old models) | Vanishing gradients |
| ReLU | 0 to ∞ | Hidden layers (modern default) | Dead neurons |
Modern practice: Use ReLU (or its variants) for hidden layers, sigmoid for binary output, softmax (generalized sigmoid) for multi-class output.
Let’s see what a single neuron can actually do. With two inputs (conservation score and structural impact), the neuron computes:
z = w₁ × conservation + w₂ × structural_impact + b
output = sigmoid(z)
If output > 0.5, we predict “pathogenic.” Otherwise, “benign.”
The decision boundary is where output = 0.5, which means z = 0:
w₁ × conservation + w₂ × structural_impact + b = 0
This is the equation of a straight line in 2D space (or a plane in 3D, or a hyperplane in higher dimensions).
The single neuron divides the input space with a straight line.
Everything on one side is predicted pathogenic; everything on the other side is predicted benign.
What if the real pattern looks like this?
Structural Impact
↑
| B B B B P P
| B B P P P P
| B P P P P B
| P P P B B B
| P P B B B B
+——————————————→ Conservation
P = Pathogenic, B = Benign
The pathogenic variants form a diagonal band — you can’t separate them with a single straight line! This is an example of a nonlinear decision boundary.
This is why we need multiple neurons organized in layers.
A single neuron draws one line. What if we combine multiple neurons?
Layer 1: Three neurons, each drawing a different line
Neuron 1: "Is conservation > 0.7?"
Neuron 2: "Is structural impact > 0.5?"
Neuron 3: "Is conservation + structural impact > 1.0?"
Layer 2: One neuron that combines the answers
Neuron 4: "Based on answers from neurons 1-3, is this pathogenic?"
Three lines can carve out a triangular region in the input space. More neurons = more lines = more complex boundaries.
Input Layer Hidden Layer Output Layer
(features) (learned patterns) (prediction)
x₁ ——→ [h₁] ——→
╲ ╱ ╲
x₂ ——→ [h₂] ——→ [output] → "Pathogenic" or "Benign"
╱ ╲ ╱
x₃ ——→ [h₃] ——→
Input layer: Raw features (conservation, frequency, structural impact)
Hidden layer(s): Neurons that discover intermediate patterns. We don’t tell them what to look for — they figure it out during training.
Output layer: Final prediction.
Each connection has its own weight. In this small network:
Modern networks like AlphaFold have millions of parameters. The principle is the same — just more neurons and more layers.
A network with many hidden layers is called a deep neural network:
Input → Hidden 1 → Hidden 2 → Hidden 3 → ... → Output
↑ ↑ ↑
Simple Intermediate Complex
patterns combinations abstractions
Genomics analogy: How a deep network reads DNA
Layer 1: Detects simple motifs
"There's a TATA at this position"
"There's a GC-rich region here"
Layer 2: Combines motifs into modules
"TATA box + Inr element = core promoter signature"
"CpG island + GC-box = housekeeping promoter pattern"
Layer 3: Recognizes regulatory logic
"Core promoter + enhancer motifs within 500bp = active promoter"
"Core promoter + repressor motifs = silenced promoter"
Layer 4: Predicts function
"This sequence is an active enhancer in neural tissue"
Each layer builds on the previous one, creating increasingly abstract representations. The first layer sees individual nucleotides; the last layer understands regulatory function.
This hierarchical feature learning is what makes deep learning so powerful for genomics.
Forward propagation is simply computing the output of the network given an input. Information flows forward from input to output, one layer at a time.
Example: Predicting splice site functionality
Input: A 20-nucleotide DNA sequence, one-hot encoded (each position is 4 numbers: [A, C, G, T])
Step 1: Input → Hidden Layer 1 (10 neurons)
For each hidden neuron j (j = 1 to 10):
z_j = w₁ⱼ×x₁ + w₂ⱼ×x₂ + ... + w₈₀ⱼ×x₈₀ + bⱼ
h_j = ReLU(z_j)
Result: 10 numbers representing "features" the network discovered
Step 2: Hidden Layer 1 → Hidden Layer 2 (5 neurons)
For each neuron k (k = 1 to 5):
z_k = w₁ₖ×h₁ + w₂ₖ×h₂ + ... + w₁₀ₖ×h₁₀ + bₖ
h_k = ReLU(z_k)
Result: 5 numbers representing higher-level patterns
Step 3: Hidden Layer 2 → Output (1 neuron)
z_out = w₁×h₁ + w₂×h₂ + ... + w₅×h₅ + b
output = sigmoid(z_out)
Result: 0.87 → "87% likely to be a functional splice site"
That’s it. Forward propagation is just repeated weighted sums followed by activation functions. Matrix multiplication and simple nonlinear functions — nothing more.
Neural networks need numerical inputs. How do we convert a DNA sequence?
Sequence: A T G C A ...
↓ ↓ ↓ ↓ ↓
A: 1 0 0 0 1 ...
C: 0 0 0 1 0 ...
G: 0 0 1 0 0 ...
T: 0 1 0 0 0 ...
Each nucleotide becomes a vector of length 4, with a 1 at the position corresponding to the base and 0s elsewhere. A 20-nucleotide sequence becomes 80 numbers — a format the network can process.
Why not just use A=1, C=2, G=3, T=4? Because that would imply G is “greater than” C or that T-G = A, which makes no biological sense. One-hot encoding treats each base as an independent category.
Remember from Chapter 2:
Minimizing loss = Maximizing likelihood
Loss = -log P(data | parameters)
Now let’s see this in practice with neural networks.
For predicting “pathogenic or benign”:
Loss = -[y × log(ŷ) + (1-y) × log(1-ŷ)]
Where:
y = true label (1 = pathogenic, 0 = benign)
ŷ = network's prediction (e.g., 0.87)
Let’s compute:
Example 1: Good prediction
Example 2: Bad prediction
Example 3: Confident wrong prediction
Key insight: The loss is small when the prediction matches the truth, and large when it doesn’t. Being confidently wrong is penalized most severely.
Total Loss = (1/N) × Σ Loss for each training example
"Average badness across all training examples"
Training goal: Find the weights that minimize this total loss.
We know:
But how do we know which weights to change, and by how much?
This is where backpropagation comes in — arguably the most important algorithm in deep learning.
Imagine a factory assembly line that produces defective products:
Raw materials → Station A → Station B → Station C → Final product (defective!)
↑ ↑ ↑
"Was it "Or was it "Or was it
my fault?" my fault?" my fault?"
When the final product is bad, you need to figure out which station is responsible. You trace backward from the defect:
Backpropagation does exactly this — it traces backward from the loss, computing how much each weight contributed to the error.
Backpropagation relies on a calculus concept called the chain rule:
If y depends on u, and u depends on x, then:
dy/dx = dy/du × du/dx
"How y changes with x" = "How y changes with u" × "How u changes with x"
Applied to our network:
How Loss changes with w₁ (a weight in Layer 1):
dLoss/dw₁ = dLoss/dOutput × dOutput/dHidden × dHidden/dw₁
↑ ↑ ↑
"How much does "How much does "How much does
output affect hidden layer w₁ affect
the loss?" affect output?" hidden layer?"
Read it like a sentence: “How much does the loss change when we nudge w₁?” = “How much does the output affect the loss?” × “How much does the hidden layer affect the output?” × “How much does w₁ affect the hidden layer?” Each factor is easy to compute individually. Multiplied together, they tell us exactly how much w₁ is responsible for the error — and therefore how much to adjust it.
Repeat for many epochs:
For each batch of training examples:
1. FORWARD PASS
Input → Hidden → Output → Prediction
2. COMPUTE LOSS
Compare prediction to true label
Loss = how wrong we are
3. BACKWARD PASS (Backpropagation)
Compute gradient of loss with respect to every weight
"How much is each weight responsible for the error?"
4. UPDATE WEIGHTS
w_new = w_old - learning_rate × gradient
"Adjust each weight to reduce the error"
The learning rate controls how big each adjustment step is:
Connecting to Chapter 2: This is the gradient descent we discussed — climbing the posterior landscape (or equivalently, descending the loss landscape). Backpropagation is how we compute the direction to move.
Task: Train a network to predict whether coding variants are pathogenic.
Data: 10,000 labeled variants from ClinVar
Network architecture:
5 inputs → [16 ReLU neurons] → [8 ReLU neurons] → [1 sigmoid neuron]
↓
P(pathogenic)
Total parameters: (5×16+16) + (16×8+8) + (8×1+1) = 96 + 136 + 9 = 241 parameters
Training process:
Epoch 1: Random weights → Loss = 0.693 (equivalent to random guessing)
Epoch 10: Learning patterns → Loss = 0.42
Network discovers: "high conservation = more likely pathogenic"
Epoch 50: Refining → Loss = 0.18
Network discovers: "rare + conserved + structural change = pathogenic"
Epoch 100: Converged → Loss = 0.12
Network has learned complex nonlinear interactions
"High conservation + rare + near active site" → strong pathogenic signal
"High conservation + common" → probably benign (population evidence)
Key observation: We never told the network these rules. It discovered them from data by adjusting 241 parameters to minimize prediction errors.
The problem: A network with enough parameters can memorize the training data perfectly — but fail on new data.
Genomics analogy: Imagine studying for an exam by memorizing every single practice problem with its exact answer. You score 100% on the practice set. But on the real exam, the questions are slightly different, and you fail because you never understood the underlying principles.
Training accuracy: 99.5% ← Memorized training data
Test accuracy: 62.3% ← Fails on new data
This network learned: "Variant rs12345 is pathogenic"
Instead of learning: "Rare, conserved variants in constrained genes tend to be pathogenic"
Solutions:
| Technique | What It Does | Biology Analogy |
|---|---|---|
| More training data | Harder to memorize when there’s more to learn | Studying from 100 textbooks vs. 1 |
| Dropout | Randomly disable neurons during training | Studying with a different study group each day |
| Regularization | Penalize large weights (Chapter 2’s “prefer simpler models”) | Occam’s razor |
| Early stopping | Stop training before overfitting occurs | Knowing when to stop studying and sleep |
| Validation set | Monitor performance on unseen data | Practice exams that aren’t in your study guide |
The problem: In deep networks with sigmoid/tanh activations, gradients become extremely small in early layers, so they learn very slowly or not at all.
Layer 5 gradient: 0.25
Layer 4 gradient: 0.25 × 0.25 = 0.0625
Layer 3 gradient: 0.0625 × 0.25 = 0.0156
Layer 2 gradient: 0.0156 × 0.25 = 0.0039
Layer 1 gradient: 0.0039 × 0.25 = 0.00098 ← Almost zero! Layer 1 barely learns.
Solutions:
Hyperparameters are choices YOU make before training (unlike weights, which the network learns).
| Hyperparameter | What to Try | Rule of Thumb |
|---|---|---|
| Learning rate | 0.001, 0.01, 0.1 | Start with 0.001 for Adam optimizer |
| Number of layers | 1-5 for most tasks | Start simple, add layers if underfitting |
| Neurons per layer | 16, 32, 64, 128 | Wider for more complex data |
| Batch size | 32, 64, 128 | 32 is a good default |
| Epochs | 10-1000 | Use early stopping |
The golden rule: Start simple. A small network that works is better than a large network that overfits.
This section contains matrix mathematics. It is completely optional — you can understand neural networks without it.
You can skip this section and still understand the concepts! This is for those curious about the formal mathematics.
For a network with one hidden layer:
Hidden layer:
z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾ (matrix multiplication + bias)
h = f(z⁽¹⁾) (activation function, element-wise)
Output layer:
z⁽²⁾ = W⁽²⁾h + b⁽²⁾
ŷ = σ(z⁽²⁾) (sigmoid for binary classification)
Where:
Starting from the loss L = -[y log(ŷ) + (1-y) log(1-ŷ)]:
Step 1: Gradient at output
dL/dz⁽²⁾ = ŷ - y (remarkably simple!)
Step 2: Gradient for output weights
dL/dW⁽²⁾ = (ŷ - y) × hᵀ
dL/db⁽²⁾ = ŷ - y
Step 3: Gradient passed to hidden layer
dL/dh = W⁽²⁾ᵀ × (ŷ - y)
Step 4: Gradient through activation
dL/dz⁽¹⁾ = dL/dh ⊙ f'(z⁽¹⁾) (⊙ = element-wise multiplication)
For ReLU: f'(z) = 1 if z > 0, else 0
Step 5: Gradient for hidden weights
dL/dW⁽¹⁾ = dL/dz⁽¹⁾ × xᵀ
dL/db⁽¹⁾ = dL/dz⁽¹⁾
W⁽¹⁾ ← W⁽¹⁾ - α × dL/dW⁽¹⁾
b⁽¹⁾ ← b⁽¹⁾ - α × dL/db⁽¹⁾
W⁽²⁾ ← W⁽²⁾ - α × dL/dW⁽²⁾
b⁽²⁾ ← b⁽²⁾ - α × dL/db⁽²⁾
Where α = learning rate
The beauty of backpropagation: no matter how deep the network, the chain rule lets us compute gradients for every weight efficiently in a single backward pass.
Key Takeaways:
Artificial neurons mimic biological neurons — weighted inputs, summation, nonlinear activation, output. But this is a simplification, not a biological model.
Activation functions add nonlinearity — without them, any network collapses to a single linear function. ReLU is the modern default for hidden layers.
A single neuron draws a straight line — it can only separate data that’s linearly separable. This is why we need multiple neurons and layers.
Layers build hierarchical features — early layers detect simple patterns (DNA motifs), later layers combine them into complex concepts (regulatory logic).
Forward propagation is repeated weighted sums + activations. Information flows from input to output.
Loss functions measure prediction error — binary cross-entropy for classification, connecting to Chapter 2’s negative log-likelihood.
Backpropagation computes gradients using the chain rule — it assigns “blame” to each weight for the error, working backward from the output.
Training = forward pass + loss + backward pass + weight update, repeated thousands of times.
Overfitting is the main danger — combat it with more data, dropout, regularization, and early stopping.
Start simple — a small working network is better than a complex failing one.
| Term | Definition |
|---|---|
| Artificial neuron (perceptron) | Basic unit that computes a weighted sum of inputs, adds bias, applies activation function |
| Weight | Learned parameter controlling the strength of a connection between neurons |
| Bias | Learned parameter allowing the neuron’s activation threshold to shift |
| Activation function | Nonlinear function applied after summation (ReLU, sigmoid, tanh) |
| ReLU (Rectified Linear Unit) | max(0, z) — most popular modern activation |
| Hidden layer | Layer of neurons between input and output that learns intermediate features |
| Deep neural network | Network with multiple hidden layers |
| Forward propagation | Computing output from input through the network layers |
| One-hot encoding | Representing categorical data (A, C, G, T) as binary vectors |
| Loss function | Measures how wrong the network’s predictions are |
| Binary cross-entropy | Standard loss for binary classification problems |
| Backpropagation | Algorithm that computes gradients of the loss with respect to all weights using the chain rule |
| Gradient | Direction and magnitude of steepest increase in a function |
| Learning rate | Hyperparameter controlling the step size of weight updates |
| Epoch | One complete pass through the entire training dataset |
| Overfitting | Network memorizes training data instead of learning general patterns |
| Dropout | Regularization technique that randomly disables neurons during training |
| Vanishing gradient | Problem where gradients become too small for early layers to learn |
| Hyperparameter | Settings chosen before training (learning rate, architecture, etc.) |
Answer:
XOR is a pattern where the output is 1 only when exactly one of two inputs is 1:
Input A | Input B | Output
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
A single neuron can only draw a straight line to separate classes. But XOR requires a curved or multi-part boundary — no single line can separate the 1s from the 0s.
Why this matters for genomics: Many biological relationships are XOR-like:
These nonlinear interactions require at least two layers (a hidden layer + output layer) to capture. This is one of the fundamental reasons we need deep networks for biological prediction tasks.
Answer:
This is overfitting to the training distribution.
During training, the network adjusted its 241+ parameters to minimize loss on European-ancestry variants. The patterns it learned are:
The problem: African genomes have ~25% more genetic variation than European genomes. Many variants that are common in African populations are rare or absent in European databases. The network has never seen these patterns during training, so it misclassifies them — often flagging normal African variation as “pathogenic” because it looks rare from a European perspective.
This is not just overfitting — it’s a dangerous form of bias. Solutions include:
From a neural network perspective: The network’s learned weights encode European-specific patterns. The decision boundaries it drew during training don’t generalize to regions of feature space occupied by African-ancestry variants.
Answer:
Diagnosis: Severe overfitting. The network has memorized the training enhancers (low training loss) but cannot generalize to new enhancers (high validation loss).
The gap:
Training loss: 0.08 (excellent on seen data)
Validation loss: 0.95 (terrible on unseen data)
Gap: 0.87 (huge — clear overfitting)
What to try, in order:
More training data — If you have only 500 enhancer sequences, try to get 5,000. This is often the single most effective fix.
Regularization — Add L2 regularization (weight decay) to penalize large weights. Start with λ = 0.01.
Dropout — Add dropout (p = 0.3-0.5) after hidden layers. This forces the network to learn redundant representations.
Simpler architecture — Reduce the number of layers and neurons. If you have a 5-layer network, try 2 layers. The network might be too complex for the amount of data you have.
Early stopping — Monitor validation loss during training and stop when it starts increasing (even if training loss is still decreasing).
Data augmentation — For DNA sequences, you could augment by reverse-complementing sequences (both strands should have similar regulatory activity).
The underlying principle from Chapter 2: With limited data, the posterior distribution is wide (high uncertainty). A complex model with many parameters can find a sharp peak in training data that doesn’t generalize. Regularization acts like an informative prior, keeping the model simpler and more generalizable.