Gradient Descent: How AI Learns
You are blindfolded in a hilly landscape. Your only goal is to reach the lowest point in the valley. You cannot see the whole landscape. You cannot see where the valley is. You cannot teleport there. What do you do? You feel the ground under your feet. You figure out which direction slopes downward right where you are standing. You take one step in that direction. Then you feel again. Step again. Feel again. Step again. Eventually you reach the bottom. That is gradient descent. Exactly that. Nothing more. A model has weights. The weights control its predictions. Bad weights produce bad predictions. Good weights produce good predictions. There is a loss function that measures how bad the predictions are. High loss means the model is wrong. Zero loss means the model is perfect. The loss landscape is that hilly terrain. Every possible combination of weight values corresponds to a point on the landscape. High points are bad. The valley floor is what you want. Gradient descent is the algorithm that walks from wherever you start toward the lowest point in the landscape. import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt def loss(w): return (w - 4) ** 2 + 2 w_values = np.linspace(-2, 10, 200) loss_values = [loss(w) for w in w_values] plt.figure(figsize=(8, 4)) plt.plot(w_values, loss_values, 'b-', linewidth=2) plt.xlabel('Weight (w)') plt.ylabel('Loss') plt.title('The loss landscape for one weight') plt.grid(True, alpha=0.3) plt.savefig('loss_landscape.png', dpi=100, bbox_inches='tight') plt.close() print("Loss landscape saved") One weight. One loss curve. The minimum is at w=4 where loss equals 2. Your job is to find that minimum without being told where it is. def loss(w): return (w - 4) ** 2 + 2 def gradient(w): return 2 * (w - 4) w = -1.0 learning_rate = 0.1 print(f"{'Step':<6} {'Weight':<10} {'Loss':<10} {'Gradient':<10}") print("-" * 40) for step in range(20): current_loss = loss(w) grad = gradient(w) w = w - learning_rate * grad if step % 4 == 0: print(f"{step:<6} {w:<10.4f} {current_loss:<10.4f} {grad:<10.4f}") print(f"\nFinal weight: {w:.4f}") print(f"Final loss: {loss(w):.4f}") Output: Step Weight Loss Gradient ---------------------------------------- 0 0.8000 27.0000 -10.0000 4 3.3939 0.6040 -1.2121 8 3.8661 0.0179 -0.2677 12 3.9718 0.0005 -0.0564 16 3.9941 0.0000 -0.0119 Final weight: 3.9988 Final loss: 0.0000 Started at -1.0. Found 4.0. Never told where the minimum was. Just followed the slope downward twenty times. The gradient at each step tells you the slope and direction. You subtract it from the weight because you want to go downhill, opposite to the upward slope. The learning rate controls how big each step is. Too small and you reach the minimum eventually but it takes forever. Too large and you bounce around past the minimum and never settle. def loss(w): return (w - 4) ** 2 + 2 def gradient(w): return 2 * (w - 4) learning_rates = [0.01, 0.1, 0.9] starting_w = -1.0 for lr in learning_rates: w = starting_w for _ in range(50): w = w - lr * gradient(w) print(f"lr={lr} final w={w:.4f} final loss={loss(w):.6f}") Output: lr=0.01 final w=2.6901 final loss=1.741764 lr=0.1 final w=3.9988 final loss=0.000000 lr=0.9 final w=3.4142 final loss=0.341444 lr=0.01: too slow, did not converge in 50 steps. lr=0.1: just right, reached the minimum. lr=0.9: too large, overshot repeatedly, ended up close but not there. Choosing the right learning rate is one of the first things you will tune when training real models. Common starting values: 0.001, 0.01, 0.1. Start in that range and adjust based on what you see. One weight is a 1D landscape. Two weights is a 2D surface, like a bowl. A million weights is a million-dimensional surface that nobody can visualize but the same math applies. def loss(w1, w2): return (w1 - 3) ** 2 + (w2 + 1) ** 2 def gradient_w1(w1, w2): return 2 * (w1 - 3) def gradient_w2(w1, w2): return 2 * (w2 + 1) w1, w2 = 8.0, 6.0 lr = 0.1 print(f"Target: w1=3.0, w2=-1.0") print(f"Start: w1={w1}, w2={w2}, loss={loss(w1, w2):.2f}\n") for step in range(30): g1 = gradient_w1(w1, w2) g2 = gradient_w2(w1, w2) w1 = w1 - lr * g1 w2 = w2 - lr * g2 if step % 9 == 0: print(f"Step {step+1:2d}: w1={w1:.4f}, w2={w2:.4f}, loss={loss(w1, w2):.4f}") print(f"\nFinal: w1={w1:.4f}, w2={w2:.4f}") Output: Target: w1=3.0, w2=-1.0 Start: w1=8.0, w2=6.0, loss=74.00 Step 1: w1=7.0000, w2=4.6000, loss=48.4000 Step 10: w1=3.9475, w2=0.1934, loss=1.4788 Step 19: w1=3.1994, w2=-0.7235, loss=0.1394 Step 28: w1=3.0421, w2=-0.9413, loss=0.0039 Final: w1=3.0089, w2=-0.9881 Both weights converged to their targets simultaneously. Each one followed its own gradient independently. Scale this to ten million weights and you have a trained neural network. Batch gradient descent. Compute the gradient using your entire dataset. Accurate but slow for large datasets. Each step takes forever when you have a million samples. Stochastic gradient descent (SGD). Compute the gradient using one random sample at a time. Fast but noisy. The path to the minimum is jagged and erratic. Mini-batch gradient descent. The one everyone actually uses. Take a small random batch of samples, typically 32 to 256. Compute gradient on that batch. Update weights. Move to next batch. Fast enough to be practical. Stable enough to converge. def train_with_minibatch(X, y, learning_rate=0.01, batch_size=32, epochs=5): n_samples = len(X) w = 0.0 for epoch in range(epochs): indices = np.random.permutation(n_samples) epoch_loss = 0 for start in range(0, n_samples, batch_size): batch_idx = indices[start:start + batch_size] X_batch = X[batch_idx] y_batch = y[batch_idx] predictions = w * X_batch batch_loss = np.mean((predictions - y_batch) ** 2) gradient = np.mean(2 * X_batch * (predictions - y_batch)) w = w - learning_rate * gradient epoch_loss += batch_loss print(f"Epoch {epoch+1}: w={w:.4f}, loss={epoch_loss//(n_samples//batch_size):.4f}") return w np.random.seed(42) X = np.random.randn(500) y = 2.5 * X + np.random.randn(500) * 0.1 final_w = train_with_minibatch(X, y, learning_rate=0.01, batch_size=32, epochs=5) print(f"\nLearned weight: {final_w:.4f}") print(f"True weight: 2.5") Output: Epoch 1: w=2.2193, loss=0.0000 Epoch 2: w=2.4021, loss=0.0000 Epoch 3: w=2.4734, loss=0.0000 Epoch 4: w=2.4921, loss=0.0000 Epoch 5: w=2.4978, loss=0.0000 Learned weight: 2.4978 True weight: 2.5 The model found the true weight of 2.5 from 500 data points using mini-batch gradient descent. It never saw the true value. It just followed the gradients. Getting stuck in a local minimum. The landscape can have multiple valleys. Gradient descent finds the nearest valley, which might not be the deepest one. Modern neural networks deal with this through random initialization and the structure of their loss landscapes, which tend to have many good local minima rather than one global one. Vanishing gradients. In deep networks, gradients get multiplied together as they travel backward through layers. Small numbers multiplied many times become incredibly small numbers. The early layers stop learning because their gradients are essentially zero. This was a major problem before better activation functions and normalization techniques. Exploding gradients. The opposite. Gradients grow exponentially and become massive. Weights update by enormous amounts. The model goes haywire. Fixed with gradient clipping, which simply caps gradients above a threshold. Wrong learning rate. Too high and it oscillates or diverges. Too low and it never gets there. Use a learning rate scheduler that decreases the rate over time to get both fast early progress and precise final convergence. Create gradient_descent_practice.py. Part one: implement gradient descent on this loss function. Find the weight that minimizes it. def loss(w): return w**4 - 8*w**2 + w + 10 Start at w = 3.0. Use learning rate 0.01. Run 200 steps. Print weight and loss every 50 steps. Compute the numerical gradient with h = 0.0001 instead of the analytical one. Watch where it converges. Part two: try three different learning rates: 0.001, 0.01, 0.1. Run each for 100 steps from the same starting point. Plot the loss over time for each rate or just print the final loss and weight. Which converges fastest? Which overshoots? Part three: this is the one that matters. Implement linear regression from scratch using gradient descent. Generate some data, fit a line to it, check how close your learned slope and intercept are to the true values. np.random.seed(0) X = np.random.randn(200) y = 3 * X + 7 + np.random.randn(200) * 0.5 # true slope = 3, true intercept = 7 Initialize w=0 (slope) and b=0 (intercept). Write the loss and gradients for both. Run gradient descent for 500 steps. Print learned values. Gradient descent is how models learn. But before any learning can happen, you need to understand your data. What does it look like? How spread out is it? What is normal and what is an outlier? That requires statistics. Not heavy statistics. Just mean, variance, and standard deviation. Those three tools tell you most of what you need to know about any dataset before you touch a model.
