Welcome to the fascinating world of Gradient Descent!
The algorithm that teaches machines to recognize patterns
We want our neural network to look at handwritten digits and correctly identify what number they represent.
Each 28×28 pixel image becomes 784 input values (one for each pixel's brightness).
These flow through layers of neurons, each with weights and biases, to produce 10 outputs (one for each digit 0-9).
Our network has 13,002 parameters (weights and biases) that determine its behavior. Initially, these are just random numbers!
With random weights, our network produces "utter trash" 🗑️
To improve, we first need to measure how badly our network is performing. This is where the cost function comes in!
When we show our network a "3", we want the third output neuron to light up (value = 1) and all others to be dim (value = 0).
The cost measures how far we are from this ideal.
A low cost means the network is doing well. A high cost means it's making lots of mistakes.
Can you rank these outputs from highest cost to lowest cost for classifying a "4"?
Now we have a clear goal: Find the values of our 13,002 parameters that minimize the cost function.
Imagine we only have ONE parameter to optimize. Our cost function would look like a simple curve, and we want to find its minimum.
For very simple functions, we could solve this mathematically (find where slope = 0).
Now imagine our cost function has two inputs instead of one. We can visualize this as a landscape with hills and valleys.
In multiple dimensions, "slope" becomes a vector called the gradient.
The gradient points in the direction of steepest ascent, so the negative gradient points downhill!
The same principle works in our actual problem with 13,002 dimensions, even though we can't visualize it!
Think of all weights and biases as one giant vector with 13,002 components.
The gradient is also a vector with 13,002 components, telling us how to nudge each parameter.
Some connections in our network have a much bigger impact on the final result than others. The gradient naturally encodes this importance!
Large change in cost function
→ This weight really matters!
→ Make a bigger adjustment
Small change in cost function
→ This weight doesn't matter much
→ Make a smaller adjustment
We've learned what gradient descent does, but how do we actually compute that gradient for a neural network?
But that's a story for our next lesson! 📚
Neural network learning is really just finding the best values for thousands of parameters by minimizing a cost function.
Next up: Backpropagation - the algorithm that makes it all possible!