Multivariate Calculus - Learning Module

Loading content...

0/245

Gradient Vectors — The Direction of Steepest Ascent

The Central Question of Optimization

Imagine standing on a hilly landscape, blindfolded, with one goal: reach the lowest point. You can't see the terrain, but you can feel the ground's slope beneath your feet. Logic dictates a simple strategy: take steps in the direction where the ground descends most steeply.

This intuition—moving in the direction of steepest descent—is precisely what gradient-based optimization does in machine learning. But instead of a 2D hillside, we navigate loss functions with millions of dimensions. Instead of physical terrain, we traverse abstract parameter spaces. The tool that generalizes 'slope' to this vast setting is the gradient vector.

The gradient is not merely important to machine learning—it is the foundation upon which nearly all modern optimization rests. Every training iteration of every neural network computes gradients. Every weight update moves opposite to the gradient. Understanding gradients deeply is essential for understanding how machines learn.

What You Will Learn

By the end of this page, you will deeply understand gradient vectors: their definition via partial derivatives, their geometric meaning as the direction of steepest ascent, their relationship to directional derivatives, and their practical computation. You'll see why gradients are the key to optimization in high-dimensional spaces.

Partial Derivatives: Measuring Change Along Axes

Before we can construct the gradient, we need to understand how to measure change in a multivariable function along individual coordinate directions. This is the role of partial derivatives.

Definition (Partial Derivative):

The partial derivative of f(x) = f(x₁, x₂, ..., xₙ) with respect to the variable xᵢ is:

$$\frac{\partial f}{\partial x_i}(\mathbf{x}) = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

The Core Idea:

A partial derivative measures how f changes when we vary only one variable while holding all others fixed. It's literally the ordinary derivative with respect to that variable, treating all other variables as constants.

Alternative Notations:

You'll encounter several equivalent notations:

Leibniz notation: ∂f/∂xᵢ
Subscript notation: fₓᵢ or f_i
Gradient component: [∇f]ᵢ = ∂f/∂xᵢ
Index notation: ∂ᵢf

Computing Partial Derivatives:

The procedure is straightforward:

Identify which variable you're differentiating with respect to
Treat all other variables as constants
Apply standard single-variable differentiation rules

Example:

For f(x, y, z) = x²y + yz³ + sin(xy):

∂f/∂x = 2xy + y·cos(xy)
∂f/∂y = x² + z³ + x·cos(xy)
∂f/∂z = 3yz²

partial_derivatives.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def f(x, y, z):
    """Example function: f(x, y, z) = x²y + yz³ + sin(xy)"""
    return x**2 * y + y * z**3 + np.sin(x * y)
 
def partial_f_partial_x(x, y, z):
    """Analytical partial derivative with respect to x."""
    return 2*x*y + y * np.cos(x*y)
 
def partial_f_partial_y(x, y, z):
    """Analytical partial derivative with respect to y."""
    return x**2 + z**3 + x * np.cos(x*y)
 
def partial_f_partial_z(x, y, z):
    """Analytical partial derivative with respect to z."""
    return 3 * y * z**2
 
def numerical_partial_derivative(f, point, variable_index, h=1e-7):
    """
    Compute partial derivative numerically using central differences.
    
    Parameters:
    - f: function taking coordinates as separate arguments
    - point: tuple of (x, y, z, ...)
    - variable_index: which variable to differentiate (0-indexed)
    - h: step size for finite differences
    
    Returns:
    - Numerical approximation of partial derivative at point
    """
    point_plus = list(point)
    point_minus = list(point)
    
    point_plus[variable_index] += h
    point_minus[variable_index] -= h
    
    # Central difference formula: (f(x+h) - f(x-h)) / 2h
    return (f(*point_plus) - f(*point_minus)) / (2 * h)
 
# Test at a specific point
test_point = (2.0, 3.0, 1.0)  # x=2, y=3, z=1
 
print("Partial Derivatives at (x=2, y=3, z=1):")
print("-" * 50)
 
# Compute analytical derivatives
analytical = [
    partial_f_partial_x(*test_point),
    partial_f_partial_y(*test_point),
    partial_f_partial_z(*test_point),
]
 
# Compute numerical derivatives
numerical = [
    numerical_partial_derivative(f, test_point, i) 
    for i in range(3)
]
 
variables = ['x', 'y', 'z']
for i, var in enumerate(variables):
    print(f"∂f/∂{var}:")
    print(f"  Analytical: {analytical[i]:.8f}")
    print(f"  Numerical:  {numerical[i]:.8f}")
    print(f"  Error:      {abs(analytical[i] - numerical[i]):.2e}")
    print()
 
print("The gradient vector at this point is:")
print(f"∇f = ({analytical[0]:.4f}, {analytical[1]:.4f}, {analytical[2]:.4f})")

Numerical vs Analytical Gradients

In practice, we use automatic differentiation (like PyTorch's autograd) which computes exact analytical gradients through the chain rule. However, numerical gradients via finite differences are invaluable for gradient checking—verifying that your analytical gradient implementation is correct.

The Gradient Vector

While partial derivatives tell us how a function changes along each coordinate axis individually, we often need to understand change in all directions at once. The gradient vector collects all partial derivatives into a single mathematical object.

Definition (Gradient):

The gradient of a scalar-valued function f: ℝⁿ → ℝ is the vector of all partial derivatives:

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The symbol ∇ is called nabla (or sometimes del). The gradient ∇f(x) is a vector in ℝⁿ—it has the same dimension as the input space.

Why a Vector?

The gradient's power lies in treating all directional information as a unified object. Having separate partial derivatives would require n separate numbers with no relationship. The gradient vector encodes:

Magnitude: How rapidly the function changes (‖∇f‖)
Direction: The direction of steepest increase
Orthogonality: Perpendicular to level sets

Gradient Properties:

Essential Gradient Properties

•Linearity: ∇(αf + βg) = α∇f + β∇g for scalars α, β
•Product Rule: ∇(fg) = f∇g + g∇f
•Chain Rule: ∇(g∘f) = g'(f) · ∇f for g: ℝ → ℝ
•Zero at Extrema: If x* is a local minimum or maximum of f (with no constraints), then ∇f(x*) = 0
•Perpendicular to Level Sets: ∇f is orthogonal to the level surface f(x) = c at every point on that surface

ML Examples of Gradients:

Linear Regression (MSE Loss):

For loss L(w) = (1/n)Σᵢ(wᵀxᵢ - yᵢ)², the gradient is:

$$\nabla_\mathbf{w} L = \frac{2}{n} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i - y_i) \mathbf{x}_i = \frac{2}{n} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y})$$

Logistic Regression (Cross-Entropy Loss):

For loss L(w) = -(1/n)Σᵢ[yᵢ log(σ(wᵀxᵢ)) + (1-yᵢ)log(1-σ(wᵀxᵢ))]:

$$\nabla_\mathbf{w} L = \frac{1}{n} \sum_{i=1}^{n} (\sigma(\mathbf{w}^\top \mathbf{x}_i) - y_i) \mathbf{x}_i$$

Notice how both gradients have the intuitive form: error × input.

Geometric Interpretation: Direction of Steepest Ascent

The gradient's geometric meaning is what makes it so powerful for optimization. Understanding this geometry deeply transforms gradient descent from a mechanical algorithm into an intuitive navigation strategy.

The Fundamental Theorem:

The gradient ∇f(x) points in the direction of steepest ascent of f at x. The rate of increase in that direction equals the magnitude ‖∇f(x)‖.

Conversely, -∇f(x) points in the direction of steepest descent.

Proof Sketch:

The directional derivative (rate of change) in direction u (unit vector) is ∇f · u = ‖∇f‖ cos(θ), where θ is the angle between ∇f and u. This is maximized when θ = 0 (i.e., u parallel to ∇f) and minimized when θ = π (u anti-parallel to ∇f).

Orthogonality to Level Sets:

Consider a level set L_c = {x : f(x) = c}. If we move along the level set, f doesn't change—so the directional derivative in any tangent direction is zero. But D_uf = ∇f · u = 0 means u ⊥ ∇f. Therefore, ∇f is perpendicular to all tangent directions—it's normal to the level set.

Intuitive Picture:

Imagine a topographic map with contour lines (level sets of elevation):

Contour lines connect points of equal height
The gradient points directly uphill, perpendicular to contour lines
Steeper terrain (closer contours) means larger ‖∇f‖
Flat regions have ∇f ≈ 0

gradient_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
 
def f(x, y):
    """A quadratic function (like MSE loss)."""
    return (x - 2)**2 + (y - 1)**2
 
def gradient_f(x, y):
    """Gradient of f."""
    return np.array([2*(x - 2), 2*(y - 1)])
 
# Create a grid for contour plot
x = np.linspace(-1, 5, 100)
y = np.linspace(-2, 4, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
 
# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Plot 1: Contour plot with gradient vectors
ax1 = axes[0]
contours = ax1.contour(X, Y, Z, levels=15, cmap='viridis')
ax1.clabel(contours, inline=True, fontsize=8)
 
# Add gradient vectors at several points
points = [
    (0, 0), (1, 0), (0, 2), (1, 2), (3, 0),
    (4, 1), (3, 2), (0, 3), (4, 3)
]
 
for px, py in points:
    grad = gradient_f(px, py)
    grad_normalized = grad / (np.linalg.norm(grad) + 1e-8)
    
    # Plot gradient vector (direction of steepest ascent)
    ax1.arrow(px, py, grad_normalized[0]*0.4, grad_normalized[1]*0.4,
              head_width=0.12, head_length=0.08, fc='red', ec='red')
    ax1.plot(px, py, 'ko', markersize=5)
 
# Mark minimum
ax1.plot(2, 1, 'g*', markersize=15, label='Minimum (2, 1)')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Gradient Vectors (Red) Perpendicular to Contours')
ax1.legend()
ax1.set_aspect('equal')
 
# Plot 2: Gradient descent path
ax2 = axes[1]
contours2 = ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.7)
 
# Gradient descent from starting point
start = np.array([4.5, 3.5])
learning_rate = 0.1
path = [start.copy()]
 
current = start.copy()
for i in range(20):
    grad = gradient_f(current[0], current[1])
    current = current - learning_rate * grad  # Gradient DESCENT
    path.append(current.copy())
 
path = np.array(path)
ax2.plot(path[:, 0], path[:, 1], 'r.-', linewidth=2, markersize=10,
         label='Gradient Descent Path')
ax2.plot(path[0, 0], path[0, 1], 'go', markersize=12, label='Start')
ax2.plot(2, 1, 'g*', markersize=15, label='Minimum')
 
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Gradient Descent: Moving Opposite to Gradient')
ax2.legend()
ax2.set_aspect('equal')
 
plt.tight_layout()
plt.savefig('gradient_geometry.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("Key observations:")
print("- Gradient vectors point perpendicular to contour lines")
print("- Gradients point toward higher function values (ascent)")
print("- Gradient descent moves OPPOSITE to gradient (descent)")
print("- Path curves toward minimum, following steepest descent")

Sign Convention

Be careful: ∇f points toward HIGHER values (ascent). For MINIMIZATION (the usual case in ML), we move in direction -∇f. Gradient DESCENT subtracts the gradient: θ_new = θ_old - α∇L(θ_old).

Directional Derivatives

Partial derivatives measure change along coordinate axes. But what if we want to know how a function changes along an arbitrary direction? This calls for directional derivatives.

Definition (Directional Derivative):

The directional derivative of f at x in direction u (a unit vector, ‖u‖ = 1) is:

$$D_\mathbf{u} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h}$$

The Key Theorem:

If f is differentiable at x, the directional derivative is:

$$D_\mathbf{u} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = |\nabla f| \cos(\theta)$$

where θ is the angle between ∇f and u.

Interpretation:

The directional derivative is the projection of the gradient onto the direction u. This means:

D_u f = ‖∇f‖ when u = ∇f/‖∇f‖ (direction of steepest ascent)
D_u f = -‖∇f‖ when u = -∇f/‖∇f‖ (direction of steepest descent)
D_u f = 0 when u ⊥ ∇f (moving along a level set)

Special Cases:

Partial derivatives are directional derivatives along coordinate axes:

∂f/∂x₁ = D_e₁f where e₁ = (1, 0, ..., 0)
∂f/∂x₂ = D_e₂f where e₂ = (0, 1, ..., 0)

So partial derivatives are just special cases of directional derivatives!

Directional Derivative Values by Direction
Direction u	D_u f	Interpretation
∇f / ‖∇f‖	‖∇f‖ (maximum)	Steepest ascent
-∇f / ‖∇f‖	-‖∇f‖ (minimum)	Steepest descent
Any u ⊥ ∇f	0	Tangent to level set
Arbitrary unit u	‖∇f‖ cos(θ)	Between -‖∇f‖ and ‖∇f‖

Why Directional Derivatives Matter for ML:

Theoretical: Proves that gradient descent moves in the locally optimal direction
Line Search: When minimizing along a direction d, we're computing D_d f to find step size
Second-Order Methods: Curvature along d involves second directional derivatives
Understanding Momentum: Momentum uses a combination of current gradient and past directions—directional derivatives explain why these directions might be beneficial

Non-Unit Directions

If v is not a unit vector, we can still compute D_v f = ∇f · v (without requiring ‖v‖ = 1). This measures rate of change per unit parameter change in the v direction. For optimization, the step θ := θ - αv uses this form where v = ∇f.

Gradient Computation in Neural Networks

Computing gradients in neural networks presents unique challenges due to the composition of many functions and the enormous number of parameters. The solution is backpropagation—an efficient algorithm exploiting the chain rule.

The Chain Rule for Compositions:

If y = f(g(x)), then:

$$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x_i}$$

For deeper compositions y = f₃(f₂(f₁(x))):

$$\nabla_\mathbf{x} y = \left(\frac{\partial f_3}{\partial f_2}\right) \left(\frac{\partial f_2}{\partial f_1}\right) \left(\nabla_\mathbf{x} f_1\right)$$

Backpropagation Intuition:

Forward Pass: Compute all intermediate values from input to loss
Backward Pass: Propagate gradients from loss back to parameters
Chain Rule: At each layer, multiply incoming gradient by local gradient

Computational Efficiency:

A naive approach to computing ∂L/∂θᵢ separately for each of the millions of parameters would be catastrophically slow. Backpropagation computes all gradients in one backward pass, with cost proportional to the forward pass. This is why neural network training is feasible.

backpropagation_manual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
 
class ManualBackpropNetwork:
    """
    A simple 2-layer neural network with manual gradient computation.
    Demonstrates backpropagation without automatic differentiation.
    
    Architecture: Input -> Linear -> ReLU -> Linear -> MSE Loss
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier initialization
        self.W1 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / input_dim)
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(output_dim, hidden_dim) * np.sqrt(2.0 / hidden_dim)
        self.b2 = np.zeros(output_dim)
        
        # Cache for backward pass
        self.cache = {}
    
    def forward(self, X):
        """Forward pass, caching intermediate values."""
        # Layer 1: z1 = W1 @ x + b1 (for each sample)
        z1 = X @ self.W1.T + self.b1  # Shape: (batch, hidden)
        
        # ReLU activation
        h1 = np.maximum(0, z1)
        
        # Layer 2: z2 = W2 @ h1 + b2
        z2 = h1 @ self.W2.T + self.b2  # Shape: (batch, output)
        
        # Cache for backprop
        self.cache = {'X': X, 'z1': z1, 'h1': h1, 'z2': z2}
        
        return z2
    
    def mse_loss(self, predictions, targets):
        """Mean squared error loss."""
        return np.mean((predictions - targets)**2)
    
    def backward(self, predictions, targets):
        """
        Backward pass: compute gradients of loss w.r.t. all parameters.
        
        This is the core of backpropagation:
        1. Start with gradient of loss w.r.t. output
        2. Propagate backward through each layer using chain rule
        """
        batch_size = predictions.shape[0]
        
        # Retrieve cached values
        X = self.cache['X']
        z1 = self.cache['z1']
        h1 = self.cache['h1']
        
        # ========================================
        # Gradient of MSE Loss w.r.t. predictions
        # L = (1/n) * sum((pred - target)^2)
        # dL/dpred = (2/n) * (pred - target)
        # ========================================
        dL_dz2 = (2.0 / batch_size) * (predictions - targets)
        
        # ========================================
        # Layer 2 gradients (Linear: z2 = W2 @ h1 + b2)
        # dL/dW2 = dL/dz2 @ h1^T  (summed over batch)
        # dL/db2 = sum(dL/dz2)
        # dL/dh1 = W2^T @ dL/dz2  (for propagation)
        # ========================================
        dL_dW2 = dL_dz2.T @ h1  # Shape: (output, hidden)
        dL_db2 = np.sum(dL_dz2, axis=0)
        dL_dh1 = dL_dz2 @ self.W2  # Shape: (batch, hidden)
        
        # ========================================
        # ReLU gradient (h1 = max(0, z1))
        # dh1/dz1 = 1 if z1 > 0, else 0
        # ========================================
        dL_dz1 = dL_dh1 * (z1 > 0).astype(float)  # Element-wise
        
        # ========================================
        # Layer 1 gradients (Linear: z1 = W1 @ x + b1)
        # ========================================
        dL_dW1 = dL_dz1.T @ X  # Shape: (hidden, input)
        dL_db1 = np.sum(dL_dz1, axis=0)
        
        return {
            'dW1': dL_dW1, 'db1': dL_db1,
            'dW2': dL_dW2, 'db2': dL_db2
        }
    
    def gradient_descent_step(self, gradients, learning_rate):
        """Update parameters using computed gradients."""
        self.W1 -= learning_rate * gradients['dW1']
        self.b1 -= learning_rate * gradients['db1']
        self.W2 -= learning_rate * gradients['dW2']
        self.b2 -= learning_rate * gradients['db2']
 
 
# Demonstration
np.random.seed(42)
 
# Create network
net = ManualBackpropNetwork(input_dim=4, hidden_dim=8, output_dim=2)
 
# Synthetic data
X_train = np.random.randn(32, 4)  # 32 samples, 4 features
y_train = np.random.randn(32, 2)  # 32 samples, 2 outputs
 
# Training loop
print("Manual Backpropagation Training:")
print("-" * 50)
 
for epoch in range(100):
    # Forward pass
    predictions = net.forward(X_train)
    loss = net.mse_loss(predictions, y_train)
    
    # Backward pass
    gradients = net.backward(predictions, y_train)
    
    # Update
    net.gradient_descent_step(gradients, learning_rate=0.01)
    
    if epoch % 20 == 0:
        grad_norm = np.sqrt(sum(np.sum(g**2) for g in gradients.values()))
        print(f"Epoch {epoch:3d}: Loss = {loss:.6f}, ||∇L|| = {grad_norm:.6f}")
 
print("\nThis is exactly what PyTorch/TensorFlow do automatically!")

Automatic Differentiation

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement automatic differentiation, computing exact gradients for arbitrary computational graphs. You define the forward pass; the framework automatically generates the backward pass. Understanding manual backprop helps you debug when autograd gives unexpected results.

Gradient Descent: The Foundation of ML Training

Armed with the gradient, we can now state the fundamental algorithm of machine learning optimization: gradient descent.

Gradient Descent Algorithm:

Input: Initial parameters θ₀, learning rate α, loss function L

For t = 0, 1, 2, ... until convergence:
    1. Compute gradient: g_t = ∇L(θ_t)
    2. Update parameters: θ_{t+1} = θ_t - α · g_t

Why This Works:

-∇L points toward lower loss (descent direction)
Small steps in this direction decrease L (for small enough α)
Repeated steps navigate toward a local minimum

Learning Rate α:

The learning rate controls step size:

Too large: Overshoot minima, possibly diverge
Too small: Extremely slow convergence
Just right: Fast, stable convergence

Finding good learning rates is a major practical challenge. Modern methods (Adam, learning rate schedulers) help automate this.

Convergence Guarantees:

For smooth, convex functions with appropriate learning rate:

Gradient descent converges to the global minimum
Rate: O(1/t) for general convex, O(ρᵗ) for strongly convex (exponentially fast)

For non-convex functions (neural networks):

Converges to a critical point (∇L ≈ 0)
No guarantee it's a global minimum
Saddle points are more common than local minima in high dimensions

Gradient Descent Variants

•Batch GD: Use entire dataset per update. Stable but slow for large data.
•Stochastic GD (SGD): Use single sample. Fast but noisy gradients.
•Mini-batch GD: Use subset (32-512 samples). Best of both worlds.
•Momentum: Accumulate velocity. Smooths oscillations, accelerates.
•Adam: Adaptive learning rates per parameter. Most popular in practice.

Common Pitfalls

•Vanishing gradients: Gradients approach zero, learning stops
•Exploding gradients: Gradients grow unbounced, training diverges
•Saddle points: Zero gradient but not a minimum
•Poor conditioning: Highly elongated loss surface, slow progress
•Local minima: Stuck in suboptimal solutions (rare in high-dim)

The Surprising Effectiveness

Despite theoretical concerns about local minima and saddle points, gradient descent works remarkably well for neural networks in practice. Research suggests that in high dimensions, most local minima have loss values close to the global minimum, and saddle points can be escaped via noise in SGD.

Gradient Pathologies and Solutions

While gradients power machine learning, they can misbehave in various ways. Understanding these pathologies is essential for debugging training issues.

Vanishing Gradients:

Problem: In deep networks, gradients can become exponentially small as they propagate backward. If |∂f/∂x| < 1 at each layer, the gradient shrinks as 0.9¹⁰⁰ ≈ 10⁻⁵.

This plagues:

Deep networks with sigmoid/tanh activations
Recurrent networks over long sequences

Solutions:

ReLU activation (gradient = 1 for positive inputs)
Residual connections (gradient bypasses layers)
Careful initialization (Xavier, He)
Batch/layer normalization

Exploding Gradients:

Problem: The opposite—gradients grow exponentially. If |∂f/∂x| > 1 at each layer, gradient becomes 1.1¹⁰⁰ ≈ 13,780.

This causes:

NaN values in parameters
Training instability
Divergence

Solutions:

Gradient clipping (cap gradient magnitude)
Careful initialization
Appropriate learning rates

Zero Gradients (Saturation):

Problem: Some regions have exactly zero gradient, stopping learning.

Sigmoid/tanh at extreme values (saturation)
ReLU for negative inputs ("dead neurons")
Max pooling for non-maximum values

Solutions:

Leaky ReLU (small gradient for negative values)
PReLU, ELU (learned or smooth variants)
Careful initialization to avoid saturation

Activation Functions and Their Gradient Properties
Activation	Formula	Gradient	Issue
Sigmoid	σ(z) = 1/(1+e^(-z))	σ(z)(1-σ(z))	Vanishes for \|z\| >> 0
Tanh	tanh(z)	1 - tanh²(z)	Vanishes for \|z\| >> 0
ReLU	max(0, z)	1 if z > 0, else 0	Dead neurons for z < 0
Leaky ReLU	max(αz, z) with α=0.01	1 if z > 0, else α	Small gradient everywhere
GELU	z·Φ(z)	Smooth approximation	No dead neurons

Gradient Checking

When implementing custom layers or operations, always verify gradients numerically: compute ∂L/∂θ by finite differences and compare to your analytical gradient. Relative error should be < 10⁻⁵. This catches bugs that would otherwise cause silent training failures.

Summary: Mastering Gradients

The gradient is the workhorse of machine learning optimization. We've covered it from multiple angles—definition, geometry, computation, and application.

Key Concepts:

Key Takeaways

•Partial derivatives measure rate of change along coordinate axes, treating other variables as constant.
•The gradient ∇f collects all partial derivatives into a vector in ℝⁿ, encoding both direction and magnitude of steepest ascent.
•Geometric meaning: ∇f points perpendicular to level sets, in the direction of maximum increase. -∇f points toward lower values.
•Directional derivatives D_u f = ∇f · u measure change in any direction; maximized when u aligns with ∇f.
•Backpropagation efficiently computes gradients in neural networks using the chain rule, enabling training of million-parameter models.
•Gradient descent iteratively moves opposite to the gradient: θ ← θ - α∇L, navigating toward minima.
•Gradient pathologies (vanishing, exploding, dead neurons) can obstruct learning; solutions include ReLU, normalization, clipping, and residual connections.

What's Next:

Gradients tell us about functions from ℝⁿ to ℝ (scalar outputs). But what about functions that output vectors—like neural network layers? The next page introduces Jacobian matrices, which generalize gradients to vector-valued functions. Jacobians are essential for understanding how errors propagate through network layers during backpropagation.

Page Complete

You now deeply understand gradient vectors—from definition through geometry to practical computation. Gradients are the foundation of every neural network training algorithm. Next, we generalize to Jacobian matrices for vector-valued functions.

Gradient Vectors — The Direction of Steepest Ascent

The Central Question of Optimization

What You Will Learn

Partial Derivatives: Measuring Change Along Axes

Before we can construct the gradient, we need to understand how to measure change in a multivariable function along individual coordinate directions. This is the role of partial derivatives.

Definition (Partial Derivative):

The partial derivative of f(x) = f(x₁, x₂, ..., xₙ) with respect to the variable xᵢ is:

$$\frac{\partial f}{\partial x_i}(\mathbf{x}) = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$

The Core Idea:

Alternative Notations:

You'll encounter several equivalent notations:

Leibniz notation: ∂f/∂xᵢ
Subscript notation: fₓᵢ or f_i
Gradient component: [∇f]ᵢ = ∂f/∂xᵢ
Index notation: ∂ᵢf

Computing Partial Derivatives:

The procedure is straightforward:

Identify which variable you're differentiating with respect to
Treat all other variables as constants
Apply standard single-variable differentiation rules

Example:

For f(x, y, z) = x²y + yz³ + sin(xy):

∂f/∂x = 2xy + y·cos(xy)
∂f/∂y = x² + z³ + x·cos(xy)
∂f/∂z = 3yz²

partial_derivatives.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def f(x, y, z):
    """Example function: f(x, y, z) = x²y + yz³ + sin(xy)"""
    return x**2 * y + y * z**3 + np.sin(x * y)
 
def partial_f_partial_x(x, y, z):
    """Analytical partial derivative with respect to x."""
    return 2*x*y + y * np.cos(x*y)
 
def partial_f_partial_y(x, y, z):
    """Analytical partial derivative with respect to y."""
    return x**2 + z**3 + x * np.cos(x*y)
 
def partial_f_partial_z(x, y, z):
    """Analytical partial derivative with respect to z."""
    return 3 * y * z**2
 
def numerical_partial_derivative(f, point, variable_index, h=1e-7):
    """
    Compute partial derivative numerically using central differences.
    
    Parameters:
    - f: function taking coordinates as separate arguments
    - point: tuple of (x, y, z, ...)
    - variable_index: which variable to differentiate (0-indexed)
    - h: step size for finite differences
    
    Returns:
    - Numerical approximation of partial derivative at point
    """
    point_plus = list(point)
    point_minus = list(point)
    
    point_plus[variable_index] += h
    point_minus[variable_index] -= h
    
    # Central difference formula: (f(x+h) - f(x-h)) / 2h
    return (f(*point_plus) - f(*point_minus)) / (2 * h)
 
# Test at a specific point
test_point = (2.0, 3.0, 1.0)  # x=2, y=3, z=1
 
print("Partial Derivatives at (x=2, y=3, z=1):")
print("-" * 50)
 
# Compute analytical derivatives
analytical = [
    partial_f_partial_x(*test_point),
    partial_f_partial_y(*test_point),
    partial_f_partial_z(*test_point),
]
 
# Compute numerical derivatives
numerical = [
    numerical_partial_derivative(f, test_point, i) 
    for i in range(3)
]
 
variables = ['x', 'y', 'z']
for i, var in enumerate(variables):
    print(f"∂f/∂{var}:")
    print(f"  Analytical: {analytical[i]:.8f}")
    print(f"  Numerical:  {numerical[i]:.8f}")
    print(f"  Error:      {abs(analytical[i] - numerical[i]):.2e}")
    print()
 
print("The gradient vector at this point is:")
print(f"∇f = ({analytical[0]:.4f}, {analytical[1]:.4f}, {analytical[2]:.4f})")

Numerical vs Analytical Gradients

The Gradient Vector

Definition (Gradient):

The gradient of a scalar-valued function f: ℝⁿ → ℝ is the vector of all partial derivatives:

$$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

The symbol ∇ is called nabla (or sometimes del). The gradient ∇f(x) is a vector in ℝⁿ—it has the same dimension as the input space.

Why a Vector?

Magnitude: How rapidly the function changes (‖∇f‖)
Direction: The direction of steepest increase
Orthogonality: Perpendicular to level sets

Gradient Properties:

Essential Gradient Properties

•Linearity: ∇(αf + βg) = α∇f + β∇g for scalars α, β
•Product Rule: ∇(fg) = f∇g + g∇f
•Chain Rule: ∇(g∘f) = g'(f) · ∇f for g: ℝ → ℝ
•Zero at Extrema: If x* is a local minimum or maximum of f (with no constraints), then ∇f(x*) = 0
•Perpendicular to Level Sets: ∇f is orthogonal to the level surface f(x) = c at every point on that surface

ML Examples of Gradients:

Linear Regression (MSE Loss):

For loss L(w) = (1/n)Σᵢ(wᵀxᵢ - yᵢ)², the gradient is:

$$\nabla_\mathbf{w} L = \frac{2}{n} \sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i - y_i) \mathbf{x}_i = \frac{2}{n} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y})$$

Logistic Regression (Cross-Entropy Loss):

For loss L(w) = -(1/n)Σᵢ[yᵢ log(σ(wᵀxᵢ)) + (1-yᵢ)log(1-σ(wᵀxᵢ))]:

$$\nabla_\mathbf{w} L = \frac{1}{n} \sum_{i=1}^{n} (\sigma(\mathbf{w}^\top \mathbf{x}_i) - y_i) \mathbf{x}_i$$

Notice how both gradients have the intuitive form: error × input.

Geometric Interpretation: Direction of Steepest Ascent

The Fundamental Theorem:

The gradient ∇f(x) points in the direction of steepest ascent of f at x. The rate of increase in that direction equals the magnitude ‖∇f(x)‖.

Conversely, -∇f(x) points in the direction of steepest descent.

Proof Sketch:

Orthogonality to Level Sets:

Intuitive Picture:

Imagine a topographic map with contour lines (level sets of elevation):

Contour lines connect points of equal height
The gradient points directly uphill, perpendicular to contour lines
Steeper terrain (closer contours) means larger ‖∇f‖
Flat regions have ∇f ≈ 0

gradient_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import matplotlib.pyplot as plt
 
def f(x, y):
    """A quadratic function (like MSE loss)."""
    return (x - 2)**2 + (y - 1)**2
 
def gradient_f(x, y):
    """Gradient of f."""
    return np.array([2*(x - 2), 2*(y - 1)])
 
# Create a grid for contour plot
x = np.linspace(-1, 5, 100)
y = np.linspace(-2, 4, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
 
# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Plot 1: Contour plot with gradient vectors
ax1 = axes[0]
contours = ax1.contour(X, Y, Z, levels=15, cmap='viridis')
ax1.clabel(contours, inline=True, fontsize=8)
 
# Add gradient vectors at several points
points = [
    (0, 0), (1, 0), (0, 2), (1, 2), (3, 0),
    (4, 1), (3, 2), (0, 3), (4, 3)
]
 
for px, py in points:
    grad = gradient_f(px, py)
    grad_normalized = grad / (np.linalg.norm(grad) + 1e-8)
    
    # Plot gradient vector (direction of steepest ascent)
    ax1.arrow(px, py, grad_normalized[0]*0.4, grad_normalized[1]*0.4,
              head_width=0.12, head_length=0.08, fc='red', ec='red')
    ax1.plot(px, py, 'ko', markersize=5)
 
# Mark minimum
ax1.plot(2, 1, 'g*', markersize=15, label='Minimum (2, 1)')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Gradient Vectors (Red) Perpendicular to Contours')
ax1.legend()
ax1.set_aspect('equal')
 
# Plot 2: Gradient descent path
ax2 = axes[1]
contours2 = ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.7)
 
# Gradient descent from starting point
start = np.array([4.5, 3.5])
learning_rate = 0.1
path = [start.copy()]
 
current = start.copy()
for i in range(20):
    grad = gradient_f(current[0], current[1])
    current = current - learning_rate * grad  # Gradient DESCENT
    path.append(current.copy())
 
path = np.array(path)
ax2.plot(path[:, 0], path[:, 1], 'r.-', linewidth=2, markersize=10,
         label='Gradient Descent Path')
ax2.plot(path[0, 0], path[0, 1], 'go', markersize=12, label='Start')
ax2.plot(2, 1, 'g*', markersize=15, label='Minimum')
 
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Gradient Descent: Moving Opposite to Gradient')
ax2.legend()
ax2.set_aspect('equal')
 
plt.tight_layout()
plt.savefig('gradient_geometry.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("Key observations:")
print("- Gradient vectors point perpendicular to contour lines")
print("- Gradients point toward higher function values (ascent)")
print("- Gradient descent moves OPPOSITE to gradient (descent)")
print("- Path curves toward minimum, following steepest descent")

Sign Convention

Be careful: ∇f points toward HIGHER values (ascent). For MINIMIZATION (the usual case in ML), we move in direction -∇f. Gradient DESCENT subtracts the gradient: θ_new = θ_old - α∇L(θ_old).

Directional Derivatives

Partial derivatives measure change along coordinate axes. But what if we want to know how a function changes along an arbitrary direction? This calls for directional derivatives.

Definition (Directional Derivative):

The directional derivative of f at x in direction u (a unit vector, ‖u‖ = 1) is:

$$D_\mathbf{u} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h}$$

The Key Theorem:

If f is differentiable at x, the directional derivative is:

$$D_\mathbf{u} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = |\nabla f| \cos(\theta)$$

where θ is the angle between ∇f and u.

Interpretation:

The directional derivative is the projection of the gradient onto the direction u. This means:

D_u f = ‖∇f‖ when u = ∇f/‖∇f‖ (direction of steepest ascent)
D_u f = -‖∇f‖ when u = -∇f/‖∇f‖ (direction of steepest descent)
D_u f = 0 when u ⊥ ∇f (moving along a level set)

Special Cases:

Partial derivatives are directional derivatives along coordinate axes:

∂f/∂x₁ = D_e₁f where e₁ = (1, 0, ..., 0)
∂f/∂x₂ = D_e₂f where e₂ = (0, 1, ..., 0)

So partial derivatives are just special cases of directional derivatives!

Directional Derivative Values by Direction
Direction u	D_u f	Interpretation
∇f / ‖∇f‖	‖∇f‖ (maximum)	Steepest ascent
-∇f / ‖∇f‖	-‖∇f‖ (minimum)	Steepest descent
Any u ⊥ ∇f	0	Tangent to level set
Arbitrary unit u	‖∇f‖ cos(θ)	Between -‖∇f‖ and ‖∇f‖

Why Directional Derivatives Matter for ML:

Theoretical: Proves that gradient descent moves in the locally optimal direction
Line Search: When minimizing along a direction d, we're computing D_d f to find step size
Second-Order Methods: Curvature along d involves second directional derivatives
Understanding Momentum: Momentum uses a combination of current gradient and past directions—directional derivatives explain why these directions might be beneficial

Non-Unit Directions

Gradient Computation in Neural Networks

The Chain Rule for Compositions:

If y = f(g(x)), then:

$$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x_i}$$

For deeper compositions y = f₃(f₂(f₁(x))):

$$\nabla_\mathbf{x} y = \left(\frac{\partial f_3}{\partial f_2}\right) \left(\frac{\partial f_2}{\partial f_1}\right) \left(\nabla_\mathbf{x} f_1\right)$$

Backpropagation Intuition:

Forward Pass: Compute all intermediate values from input to loss
Backward Pass: Propagate gradients from loss back to parameters
Chain Rule: At each layer, multiply incoming gradient by local gradient

Computational Efficiency:

backpropagation_manual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
 
class ManualBackpropNetwork:
    """
    A simple 2-layer neural network with manual gradient computation.
    Demonstrates backpropagation without automatic differentiation.
    
    Architecture: Input -> Linear -> ReLU -> Linear -> MSE Loss
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier initialization
        self.W1 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / input_dim)
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(output_dim, hidden_dim) * np.sqrt(2.0 / hidden_dim)
        self.b2 = np.zeros(output_dim)
        
        # Cache for backward pass
        self.cache = {}
    
    def forward(self, X):
        """Forward pass, caching intermediate values."""
        # Layer 1: z1 = W1 @ x + b1 (for each sample)
        z1 = X @ self.W1.T + self.b1  # Shape: (batch, hidden)
        
        # ReLU activation
        h1 = np.maximum(0, z1)
        
        # Layer 2: z2 = W2 @ h1 + b2
        z2 = h1 @ self.W2.T + self.b2  # Shape: (batch, output)
        
        # Cache for backprop
        self.cache = {'X': X, 'z1': z1, 'h1': h1, 'z2': z2}
        
        return z2
    
    def mse_loss(self, predictions, targets):
        """Mean squared error loss."""
        return np.mean((predictions - targets)**2)
    
    def backward(self, predictions, targets):
        """
        Backward pass: compute gradients of loss w.r.t. all parameters.
        
        This is the core of backpropagation:
        1. Start with gradient of loss w.r.t. output
        2. Propagate backward through each layer using chain rule
        """
        batch_size = predictions.shape[0]
        
        # Retrieve cached values
        X = self.cache['X']
        z1 = self.cache['z1']
        h1 = self.cache['h1']
        
        # ========================================
        # Gradient of MSE Loss w.r.t. predictions
        # L = (1/n) * sum((pred - target)^2)
        # dL/dpred = (2/n) * (pred - target)
        # ========================================
        dL_dz2 = (2.0 / batch_size) * (predictions - targets)
        
        # ========================================
        # Layer 2 gradients (Linear: z2 = W2 @ h1 + b2)
        # dL/dW2 = dL/dz2 @ h1^T  (summed over batch)
        # dL/db2 = sum(dL/dz2)
        # dL/dh1 = W2^T @ dL/dz2  (for propagation)
        # ========================================
        dL_dW2 = dL_dz2.T @ h1  # Shape: (output, hidden)
        dL_db2 = np.sum(dL_dz2, axis=0)
        dL_dh1 = dL_dz2 @ self.W2  # Shape: (batch, hidden)
        
        # ========================================
        # ReLU gradient (h1 = max(0, z1))
        # dh1/dz1 = 1 if z1 > 0, else 0
        # ========================================
        dL_dz1 = dL_dh1 * (z1 > 0).astype(float)  # Element-wise
        
        # ========================================
        # Layer 1 gradients (Linear: z1 = W1 @ x + b1)
        # ========================================
        dL_dW1 = dL_dz1.T @ X  # Shape: (hidden, input)
        dL_db1 = np.sum(dL_dz1, axis=0)
        
        return {
            'dW1': dL_dW1, 'db1': dL_db1,
            'dW2': dL_dW2, 'db2': dL_db2
        }
    
    def gradient_descent_step(self, gradients, learning_rate):
        """Update parameters using computed gradients."""
        self.W1 -= learning_rate * gradients['dW1']
        self.b1 -= learning_rate * gradients['db1']
        self.W2 -= learning_rate * gradients['dW2']
        self.b2 -= learning_rate * gradients['db2']
 
 
# Demonstration
np.random.seed(42)
 
# Create network
net = ManualBackpropNetwork(input_dim=4, hidden_dim=8, output_dim=2)
 
# Synthetic data
X_train = np.random.randn(32, 4)  # 32 samples, 4 features
y_train = np.random.randn(32, 2)  # 32 samples, 2 outputs
 
# Training loop
print("Manual Backpropagation Training:")
print("-" * 50)
 
for epoch in range(100):
    # Forward pass
    predictions = net.forward(X_train)
    loss = net.mse_loss(predictions, y_train)
    
    # Backward pass
    gradients = net.backward(predictions, y_train)
    
    # Update
    net.gradient_descent_step(gradients, learning_rate=0.01)
    
    if epoch % 20 == 0:
        grad_norm = np.sqrt(sum(np.sum(g**2) for g in gradients.values()))
        print(f"Epoch {epoch:3d}: Loss = {loss:.6f}, ||∇L|| = {grad_norm:.6f}")
 
print("\nThis is exactly what PyTorch/TensorFlow do automatically!")

Automatic Differentiation

Gradient Descent: The Foundation of ML Training

Armed with the gradient, we can now state the fundamental algorithm of machine learning optimization: gradient descent.

Gradient Descent Algorithm:

Input: Initial parameters θ₀, learning rate α, loss function L

For t = 0, 1, 2, ... until convergence:
    1. Compute gradient: g_t = ∇L(θ_t)
    2. Update parameters: θ_{t+1} = θ_t - α · g_t

Why This Works:

-∇L points toward lower loss (descent direction)
Small steps in this direction decrease L (for small enough α)
Repeated steps navigate toward a local minimum

Learning Rate α:

The learning rate controls step size:

Too large: Overshoot minima, possibly diverge
Too small: Extremely slow convergence
Just right: Fast, stable convergence

Finding good learning rates is a major practical challenge. Modern methods (Adam, learning rate schedulers) help automate this.

Convergence Guarantees:

For smooth, convex functions with appropriate learning rate:

Gradient descent converges to the global minimum
Rate: O(1/t) for general convex, O(ρᵗ) for strongly convex (exponentially fast)

For non-convex functions (neural networks):

Converges to a critical point (∇L ≈ 0)
No guarantee it's a global minimum
Saddle points are more common than local minima in high dimensions

Gradient Descent Variants

•Batch GD: Use entire dataset per update. Stable but slow for large data.
•Stochastic GD (SGD): Use single sample. Fast but noisy gradients.
•Mini-batch GD: Use subset (32-512 samples). Best of both worlds.
•Momentum: Accumulate velocity. Smooths oscillations, accelerates.
•Adam: Adaptive learning rates per parameter. Most popular in practice.

Common Pitfalls

•Vanishing gradients: Gradients approach zero, learning stops
•Exploding gradients: Gradients grow unbounced, training diverges
•Saddle points: Zero gradient but not a minimum
•Poor conditioning: Highly elongated loss surface, slow progress
•Local minima: Stuck in suboptimal solutions (rare in high-dim)

The Surprising Effectiveness

Gradient Pathologies and Solutions

While gradients power machine learning, they can misbehave in various ways. Understanding these pathologies is essential for debugging training issues.

Vanishing Gradients:

Problem: In deep networks, gradients can become exponentially small as they propagate backward. If |∂f/∂x| < 1 at each layer, the gradient shrinks as 0.9¹⁰⁰ ≈ 10⁻⁵.

This plagues:

Deep networks with sigmoid/tanh activations
Recurrent networks over long sequences

Solutions:

ReLU activation (gradient = 1 for positive inputs)
Residual connections (gradient bypasses layers)
Careful initialization (Xavier, He)
Batch/layer normalization

Exploding Gradients:

Problem: The opposite—gradients grow exponentially. If |∂f/∂x| > 1 at each layer, gradient becomes 1.1¹⁰⁰ ≈ 13,780.

This causes:

NaN values in parameters
Training instability
Divergence

Solutions:

Gradient clipping (cap gradient magnitude)
Careful initialization
Appropriate learning rates

Zero Gradients (Saturation):

Problem: Some regions have exactly zero gradient, stopping learning.

Sigmoid/tanh at extreme values (saturation)
ReLU for negative inputs ("dead neurons")
Max pooling for non-maximum values

Solutions:

Leaky ReLU (small gradient for negative values)
PReLU, ELU (learned or smooth variants)
Careful initialization to avoid saturation

Activation Functions and Their Gradient Properties
Activation	Formula	Gradient	Issue
Sigmoid	σ(z) = 1/(1+e^(-z))	σ(z)(1-σ(z))	Vanishes for \|z\| >> 0
Tanh	tanh(z)	1 - tanh²(z)	Vanishes for \|z\| >> 0
ReLU	max(0, z)	1 if z > 0, else 0	Dead neurons for z < 0
Leaky ReLU	max(αz, z) with α=0.01	1 if z > 0, else α	Small gradient everywhere
GELU	z·Φ(z)	Smooth approximation	No dead neurons

Gradient Checking

Summary: Mastering Gradients

The gradient is the workhorse of machine learning optimization. We've covered it from multiple angles—definition, geometry, computation, and application.

Key Concepts:

Key Takeaways

•Partial derivatives measure rate of change along coordinate axes, treating other variables as constant.
•The gradient ∇f collects all partial derivatives into a vector in ℝⁿ, encoding both direction and magnitude of steepest ascent.
•Geometric meaning: ∇f points perpendicular to level sets, in the direction of maximum increase. -∇f points toward lower values.
•Directional derivatives D_u f = ∇f · u measure change in any direction; maximized when u aligns with ∇f.
•Backpropagation efficiently computes gradients in neural networks using the chain rule, enabling training of million-parameter models.
•Gradient descent iteratively moves opposite to the gradient: θ ← θ - α∇L, navigating toward minima.
•Gradient pathologies (vanishing, exploding, dead neurons) can obstruct learning; solutions include ReLU, normalization, clipping, and residual connections.

What's Next:

Page Complete