Backpropagation

The Heart of Neural Network Learning

Dr. Dhaval Patel • 2025

🧠 What Makes Neural Networks Learn?

Imagine you're trying to teach a computer to recognize handwritten digits. You show it thousands of examples, but how does it actually learn from mistakes?

7

8

4

Input
(784 pixels)

●

Hidden
(16 neurons)

●

Hidden
(16 neurons)

2

3

8

Output
(10 digits)

Backpropagation is the algorithm that makes learning possible
It works backwards through the network, adjusting connections
Like a coach analyzing what went wrong after each play
The math is complex, but the intuition is beautifully simple

🎯 The Big Picture

Learning Through Error Correction

📊 The Learning Challenge

Let's start with a concrete example. Our network is trying to classify this handwritten digit:

2

Target: The network should recognize this as the digit "2"

But our untrained network outputs garbage! Instead of confidently saying "2", it might output [0.1, 0.2, 0.05, 0.3, 0.1, 0.1, 0.05, 0.05, 0.03, 0.02] - essentially random guesses.

The question is: How do we fix this?

We can't directly change the output neurons
We can only adjust weights and biases
But we need to figure out which weights to change and by how much
This is where backpropagation comes to the rescue!

💡 The Core Intuition: Three Ways to Improve

When a neuron gives the wrong answer, there are exactly three things we can adjust:

1

Adjust the Bias

Like adjusting a neuron's "threshold" - make it easier or harder to activate. If we want neuron "2" to be more active, increase its bias!

2

Adjust the Weights

Like adjusting connection strength - strengthen connections from bright pixels that look like a "2", weaken connections from pixels that don't.

3

Change Previous Layer Activations

Like asking for better input - we can't directly control this, but we can identify what we wish the previous layer would do.

🔄

Repeat Backwards

The magic happens here! We propagate these "wishes" backwards through every layer, creating a chain of adjustments.

🧠 Test Your Understanding

Question: If we want the "2" neuron to activate more strongly, and we see that it's connected to a bright pixel that's clearly part of the digit "2", what should we do to the weight of that connection?

A) Decrease the weight (make it more negative)

B) Increase the weight (make it more positive)

C) Set the weight to zero

D) The weight doesn't matter

Answer: B) Increase the weight (make it more positive)

Think of it this way: Weight × Activation = Contribution

If the pixel is bright (high activation) and it's part of a "2", then increasing the weight makes this pixel contribute more strongly to the "2" neuron. This is exactly what we want!

Strong "2" pixel (0.9) × Bigger weight (0.8) = Stronger vote for "2" (0.72)

⚖️ Weight Adjustments: The Details

🎯 Proportional to Activations

Here's a key insight: adjust weights in proportion to the input activations.

                    Big activation = Big weight change

                    Small activation = Small weight change

Why? Because you get more "bang for your buck" by adjusting connections to bright neurons. A small change to a weight connected to a bright neuron has a big impact!

Bright input neuron (0.8) → Large weight adjustment
Dim input neuron (0.1) → Small weight adjustment
Off input neuron (0.0) → No weight adjustment

🧠 Hebbian Learning Connection

This connects to a famous principle in neuroscience:

"Neurons that fire together, wire together"

In our case: pixels that are "on" when we're looking at a "2" should have stronger connections to the "2" neuron.

Active Input + Desired Output
↓
Strengthen Connection

🔄 Propagating Backwards: The Chain Reaction

Now comes the truly clever part. We've figured out how to improve the output layer, but what about all the hidden layers?

?

What should
these do?

?

And these?

📢

We know what
we want here!

← ← ←

                    The Big Idea: If we know what we want from the output layer, we can work backwards to figure out what we want from each previous layer!
                

Layer 3 (Output): "I want neuron 2 to be more active and the others less active"
Layer 2 (Hidden): "To help layer 3, I should adjust based on my connections to it"
Layer 1 (Hidden): "To help layer 2, I should adjust based on my connections to it"
Weights: "Each weight adjusts based on the neurons it connects"

This creates a beautiful chain reaction of adjustments, flowing backwards through the entire network!

🔢 The Mathematics: Understanding the Gradient

Behind all this intuition lies elegant mathematics. Let's connect our intuitive understanding to the formal calculus that makes it all work.

🏔️ The gradient vector points uphill in cost space - we walk downhill to minimize error

The Gradient Vector:
∇C = [∂C/∂w⁽¹⁾, ∂C/∂b⁽¹⁾, ..., ∂C/∂w⁽ᴸ⁾, ∂C/∂b⁽ᴸ⁾]ᵀ

Each entry in this vector tells us: "How much does the cost change if I nudge this particular weight or bias?"

                    The Chain Rule Connection: To find ∂C/∂w⁽ᴸ⁾, we trace the path:

                    w⁽ᴸ⁾ → z⁽ᴸ⁾ → a⁽ᴸ⁾ → C₀

∂C/∂w⁽ᴸ⁾: "How does cost change with the last layer's weight?"
Chain Rule: Break it into smaller, manageable pieces
Each piece: A simple derivative we can actually calculate
Multiply together: Get the total effect on cost

Beautiful Insight: The "impossible" derivative ∂C/∂w⁽ᴸ⁾ becomes three simple derivatives multiplied together!

🔗 The Chain Rule: Breaking Down the "Impossible"

Let's see how we calculate ∂C₀/∂w⁽ᴸ⁾ step by step. This seems impossible at first, but the chain rule makes it manageable!

1

The Chain of Dependencies

w⁽ᴸ⁾ → z⁽ᴸ⁾ → a⁽ᴸ⁾ → C₀
A tiny change to the weight causes a change to z, which changes a, which changes the cost.

∂C₀/∂w⁽ᴸ⁾ = (∂z⁽ᴸ⁾/∂w⁽ᴸ⁾) × (∂a⁽ᴸ⁾/∂z⁽ᴸ⁾) × (∂C₀/∂a⁽ᴸ⁾)

2

First Link: ∂z⁽ᴸ⁾/∂w⁽ᴸ⁾

From z⁽ᴸ⁾ = w⁽ᴸ⁾a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾
"How does the weighted sum change when we change the weight?"

∂z⁽ᴸ⁾/∂w⁽ᴸ⁾ = a⁽ᴸ⁻¹⁾

This is why bright neurons have more influence!

3

Second Link: ∂a⁽ᴸ⁾/∂z⁽ᴸ⁾

From a⁽ᴸ⁾ = σ(z⁽ᴸ⁾)
"How does the activation change when we change the weighted sum?"

∂a⁽ᴸ⁾/∂z⁽ᴸ⁾ = σ'(z⁽ᴸ⁾)

This is the slope of the activation function!

4

Third Link: ∂C₀/∂a⁽ᴸ⁾

From C₀ = (a⁽ᴸ⁾ - y)²
"How does the cost change when we change the final activation?"

∂C₀/∂a⁽ᴸ⁾ = 2(a⁽ᴸ⁾ - y)

Bigger errors create stronger learning signals!

🎯

Putting It All Together

The Final Formula:

∂C₀/∂w⁽ᴸ⁾ = a⁽ᴸ⁻¹⁾ × σ'(z⁽ᴸ⁾) × 2(a⁽ᴸ⁾ - y)

Three simple pieces that capture the essence of learning!

🧮 Mathematical Intuition Check

Looking at our formula: ∂C₀/∂w⁽ᴸ⁾ = a⁽ᴸ⁻¹⁾ × σ'(z⁽ᴸ⁾) × 2(a⁽ᴸ⁾ - y)

If the previous neuron has activation a⁽ᴸ⁻¹⁾ = 0 (completely off), what happens to the gradient for this weight?

A) The gradient becomes very large

B) The gradient becomes zero

C) The gradient becomes negative

D) The gradient is unaffected

Answer: B) The gradient becomes zero

This is a profound insight! When a⁽ᴸ⁻¹⁾ = 0, the entire gradient becomes zero because:

∂C₀/∂w⁽ᴸ⁾ = 0 × σ'(z⁽ᴸ⁾) × 2(a⁽ᴸ⁾ - y) = 0

What this means: If a neuron is completely inactive (0), then the weights connecting TO it won't learn at all! This is why we need activation functions and why "dead neurons" are a problem in deep learning.

                    This mathematical property directly reflects our intuition: "You can't strengthen a connection if there's no signal flowing through it!"
                

🎲 Making It Practical: Mini-Batches

🐌 The Slow Way: Full Batch

In theory, we should look at ALL training examples before making any weight adjustments.

Problem: This is incredibly slow!
50,000 examples × complex calculations = hours per step

🐢 Very slow, but perfectly accurate steps

🏃‍♂️ The Fast Way: Mini-Batches

Instead, we use random mini-batches of ~100 examples and take many quick steps.

Solution: Much faster!
100 examples × many quick steps = rapid learning

🏃‍♂️ Like a drunk person stumbling quickly downhill
(but it works!)

🧠 Master Challenge: Putting It All Together

Final Challenge: You have a network with layers of sizes [784, 16, 16, 10]. During backpropagation, you notice that ∂C/∂w₁₅,₇⁽²⁾ (the weight from neuron 7 in layer 1 to neuron 15 in layer 2) has a very large magnitude. What does this tell you about the learning process?

A) This weight is broken and should be reset to zero

B) This weight has a strong influence on the cost and will receive large updates

C) The network is overfitting to the training data

D) The learning rate is too high

Answer: B) This weight has a strong influence on the cost and will receive large updates

You've just witnessed the beauty of backpropagation in action! A large gradient magnitude means:

                    Mathematical Interpretation:

                    |∂C/∂w₁₅,₇⁽²⁾| is large → small changes to this weight cause big changes to the cost

Learning Interpretation:
This connection is crucial for the network's current performance. The weight update will be:
w₁₅,₇⁽²⁾ ← w₁₅,₇⁽²⁾ - α × (large gradient)

What this means practically:
• This connection will change significantly in the next update
• The network has identified this as an important feature pathway
• The learning is focused where it matters most

Beautiful insight: The mathematics automatically prioritizes the most impactful connections!

🎯 The Complete Picture

How It All Comes Together

🎯 The Complete Picture: Intuition Meets Mathematics

1

Forward Pass

Feed an example through: a⁽ᴸ⁾ = σ(w⁽ᴸ⁾a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾)
Calculate what the network predicts and compare to truth: C₀ = (a⁽ᴸ⁾ - y)²

2

Backward Pass: Chain Rule Magic

∂C₀/∂w⁽ᴸ⁾ = (∂z⁽ᴸ⁾/∂w⁽ᴸ⁾) × (∂a⁽ᴸ⁾/∂z⁽ᴸ⁾) × (∂C₀/∂a⁽ᴸ⁾)
Break the "impossible" derivative into three manageable pieces.

3

Propagate Through All Layers

∂C₀/∂a_k⁽ᴸ⁻¹⁾ = Σⱼ (∂z_j⁽ᴸ⁾/∂a_k⁽ᴸ⁻¹⁾) × (∂a_j⁽ᴸ⁾/∂z_j⁽ᴸ⁾) × (∂C₀/∂a_j⁽ᴸ⁾)
Work backwards layer by layer, summing over all paths of influence.

4

Average Over Mini-Batches

∇C ≈ (1/m) Σₖ∈batch ∇Cₖ
Use random samples to approximate the full gradient efficiently.

5

Update and Repeat

w ← w - α∇C
Step downhill in parameter space, repeat until the network learns!

                The Mathematical Miracle: Simple calculus rules (chain rule, partial derivatives) combined with matrix operations create a learning algorithm that can recognize faces, translate languages, and play games at superhuman levels!
            

✨ The Beauty of Backpropagation

We've journeyed from intuition to rigorous mathematics. Here's what makes backpropagation so remarkable:

🧠 Biological Inspiration: "Neurons that fire together, wire together" becomes ∂z/∂w = a
📐 Mathematical Elegance: The chain rule decomposes complexity into simplicity
⚡ Computational Efficiency: One backward pass computes ALL gradients simultaneously
🎯 Error-Driven Learning: Bigger mistakes automatically get more attention
🌊 Information Flow: Knowledge propagates backwards through the entire network

From Simple to Extraordinary:
∂/∂x, matrix multiplication, and σ(x) → Image recognition, language translation, game mastery!

The Philosophical Wonder: We took the fuzzy concept of "learning from mistakes" and turned it into precise mathematical operations. Each partial derivative represents a tiny adjustment that collectively creates intelligence.

                    🧠 Remember: Every time a neural network correctly identifies a cat, translates a sentence, or recommends a movie, it's because millions of these tiny mathematical adjustments, orchestrated by backpropagation, taught it to recognize patterns in data.
                

The next time you see ∂C/∂w, don't just see symbols - see the mathematical poetry of learning itself!

🎓 You now understand both the intuition AND the mathematics of how neural networks learn!