The Heart of Neural Network Learning
Dr. Dhaval Patel • 2025
Imagine you're trying to teach a computer to recognize handwritten digits. You show it thousands of examples, but how does it actually learn from mistakes?
Learning Through Error Correction
Let's start with a concrete example. Our network is trying to classify this handwritten digit:
Target: The network should recognize this as the digit "2"
The question is: How do we fix this?
When a neuron gives the wrong answer, there are exactly three things we can adjust:
Like adjusting a neuron's "threshold" - make it easier or harder to activate. If we want neuron "2" to be more active, increase its bias!
Like adjusting connection strength - strengthen connections from bright pixels that look like a "2", weaken connections from pixels that don't.
Like asking for better input - we can't directly control this, but we can identify what we wish the previous layer would do.
The magic happens here! We propagate these "wishes" backwards through every layer, creating a chain of adjustments.
Here's a key insight: adjust weights in proportion to the input activations.
Why? Because you get more "bang for your buck" by adjusting connections to bright neurons. A small change to a weight connected to a bright neuron has a big impact!
This connects to a famous principle in neuroscience:
In our case: pixels that are "on" when we're looking at a "2" should have stronger connections to the "2" neuron.
Now comes the truly clever part. We've figured out how to improve the output layer, but what about all the hidden layers?
This creates a beautiful chain reaction of adjustments, flowing backwards through the entire network!
Behind all this intuition lies elegant mathematics. Let's connect our intuitive understanding to the formal calculus that makes it all work.
Each entry in this vector tells us: "How much does the cost change if I nudge this particular weight or bias?"
Let's see how we calculate ∂C₀/∂w⁽ᴸ⁾ step by step. This seems impossible at first, but the chain rule makes it manageable!
w⁽ᴸ⁾ → z⁽ᴸ⁾ → a⁽ᴸ⁾ → C₀
A tiny change to the weight causes a change to z, which changes a, which changes the cost.
From z⁽ᴸ⁾ = w⁽ᴸ⁾a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾
"How does the weighted sum change when we change the weight?"
This is why bright neurons have more influence!
From a⁽ᴸ⁾ = σ(z⁽ᴸ⁾)
"How does the activation change when we change the weighted sum?"
This is the slope of the activation function!
From C₀ = (a⁽ᴸ⁾ - y)²
"How does the cost change when we change the final activation?"
Bigger errors create stronger learning signals!
The Final Formula:
Three simple pieces that capture the essence of learning!
In theory, we should look at ALL training examples before making any weight adjustments.
Instead, we use random mini-batches of ~100 examples and take many quick steps.
How It All Comes Together
Feed an example through: a⁽ᴸ⁾ = σ(w⁽ᴸ⁾a⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾)
Calculate what the network predicts and compare to truth: C₀ = (a⁽ᴸ⁾ - y)²
∂C₀/∂w⁽ᴸ⁾ = (∂z⁽ᴸ⁾/∂w⁽ᴸ⁾) × (∂a⁽ᴸ⁾/∂z⁽ᴸ⁾) × (∂C₀/∂a⁽ᴸ⁾)
Break the "impossible" derivative into three manageable pieces.
∂C₀/∂a_k⁽ᴸ⁻¹⁾ = Σⱼ (∂z_j⁽ᴸ⁾/∂a_k⁽ᴸ⁻¹⁾) × (∂a_j⁽ᴸ⁾/∂z_j⁽ᴸ⁾) × (∂C₀/∂a_j⁽ᴸ⁾)
Work backwards layer by layer, summing over all paths of influence.
∇C ≈ (1/m) Σₖ∈batch ∇Cₖ
Use random samples to approximate the full gradient efficiently.
w ← w - α∇C
Step downhill in parameter space, repeat until the network learns!
We've journeyed from intuition to rigorous mathematics. Here's what makes backpropagation so remarkable:
The next time you see ∂C/∂w, don't just see symbols - see the mathematical poetry of learning itself!