Multivariate Calculus - Learning Module

Loading content...

0/245

Taylor Series Expansion — Approximating Functions Locally

Why Polynomials Approximate Everything

Consider a remarkable fact: near any point where a function is smooth, you can approximate it arbitrarily well using just polynomials. The exponential function? A polynomial. Sine and cosine? Polynomials. The complex loss landscape of a neural network? Locally, a polynomial.

This is the power of Taylor series—one of the most useful tools in all of applied mathematics. By expanding a function as a polynomial, we can:

Approximate complex functions with simple, analytically tractable forms
Analyze behavior near critical points (minima, maxima, saddle points)
Design optimization algorithms that exploit local curvature
Understand why gradient descent converges (or fails to converge)

In machine learning, Taylor expansions underlie:

Second-order optimization methods (Newton's method, natural gradient)
Convergence proofs for gradient descent
Analysis of loss landscape structure
Approximation theory for neural networks

This page develops Taylor series for multivariable functions, building toward the crucial second-order approximation that captures curvature through the Hessian matrix.

What You Will Learn

By the end of this page, you will understand Taylor series in multiple dimensions: the general formula, first and second-order approximations, multi-index notation, convergence considerations, and applications to analyzing optimization landscapes.

Review: Single-Variable Taylor Series

Before tackling multiple dimensions, let's solidify our understanding of the single-variable case.

Taylor Series Expansion:

For a function f: ℝ → ℝ that is infinitely differentiable at point a:

$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x - a)^n$$

Expanded:

$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots$$

Key Terms:

Zeroth order (n=0): f(a) — the function value
First order (n=1): f'(a)(x-a) — linear approximation (tangent line)
Second order (n=2): f''(a)(x-a)²/2 — curvature correction
Higher orders: Increasingly fine corrections

Truncated Expansions (Taylor Polynomials):

In practice, we truncate after finite terms:

First-order (linear): f(x) ≈ f(a) + f'(a)(x-a)
Second-order (quadratic): f(x) ≈ f(a) + f'(a)(x-a) + f''(a)(x-a)²/2

The truncation error is O((x-a)^{n+1}) for an n-th order expansion.

Classic Examples:

Function	Taylor Series at a=0	Convergence
eˣ	1 + x + x²/2! + x³/3! + ...	All ℝ
sin(x)	x - x³/3! + x⁵/5! - ...	All ℝ
1/(1-x)	1 + x + x² + x³ + ...
ln(1+x)	x - x²/2 + x³/3 - ...

Why Factorials?

The factorial 1/n! arises because differentiating (x-a)ⁿ yields n! times a constant. The 1/n! cancels this, ensuring that the n-th derivative of the Taylor polynomial at a equals f⁽ⁿ⁾(a). It's designed so all derivatives match.

Taylor Series in Multiple Dimensions

The Taylor series extends naturally to functions of multiple variables, though the notation becomes more complex.

First-Order (Linear) Approximation:

For f: ℝⁿ → ℝ near point a:

$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$

This is the tangent hyperplane to the surface f(x) at a.

Second-Order (Quadratic) Approximation:

$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$

where H_f(a) is the Hessian matrix of second partial derivatives:

$$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \ \vdots & \vdots & \ddots \end{bmatrix}$$

General Form (Multi-Index Notation):

For the full infinite series, we use multi-indices α = (α₁, ..., αₙ) with |α| = α₁+...+αₙ:

$$f(\mathbf{x}) = \sum_{|\alpha| = 0}^{\infty} \frac{D^\alpha f(\mathbf{a})}{\alpha!} (\mathbf{x} - \mathbf{a})^\alpha$$

where α! = α₁!·α₂!·...·αₙ! and (x - a)^α = (x₁-a₁)^{α₁}...(xₙ-aₙ)^{αₙ}.

Don't worry about this notation—we'll primarily use first and second-order approximations.

Taylor Expansion Components
Order	Term	Geometric Meaning	ML Application
0	f(a)	Function value at point	Current loss value
1	∇f(a)ᵀ(Δx)	Linear (tangent plane)	Gradient descent direction
2	½ΔxᵀHΔx	Quadratic (curvature)	Newton's method, learning rate
3+	Higher-order terms	Fine corrections	Rarely used in optimization

Reading the Second-Order Term

The quadratic term ½ΔxᵀHΔx is a quadratic form. Its sign depends on the eigenvalues of H. Positive eigenvalues = upward curving (minimum direction). Negative eigenvalues = downward curving (maximum direction). Mixed = saddle point.

Derivation: Why This Formula?

Let's understand why the multivariate Taylor series takes its form by building it up from directional derivatives.

Reduction to One Dimension:

Consider moving from a toward x along the straight line:

$$\mathbf{r}(t) = \mathbf{a} + t(\mathbf{x} - \mathbf{a}), \quad t \in [0, 1]$$

Define g(t) = f(r(t)). This is a single-variable function!

Apply the 1D Taylor series:

$$g(1) = g(0) + g'(0) + \frac{1}{2}g''(0) + \cdots$$

But g(0) = f(a) and g(1) = f(x). So:

$$f(\mathbf{x}) = f(\mathbf{a}) + g'(0) + \frac{1}{2}g''(0) + \cdots$$

Computing g'(0):

By the chain rule:

$$g'(t) = \nabla f(\mathbf{r}(t))^\top \mathbf{r}'(t) = \nabla f(\mathbf{r}(t))^\top (\mathbf{x} - \mathbf{a})$$

At t=0:

$$g'(0) = \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$

Computing g''(0):

Differentiating again:

$$g''(t) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{r}(t)) (\mathbf{x} - \mathbf{a})$$

At t=0:

$$g''(0) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$

Assembling the Pieces:

Putting it together:

$$f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a}) + O(|\mathbf{x}-\mathbf{a}|^3)$$

This derivation shows the multivariate Taylor series is just the 1D version applied along any straight-line path!

taylor_series_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def f(x, y):
    """
    Example function: f(x, y) = sin(x) * cos(y)
    """
    return np.sin(x) * np.cos(y)
 
def gradient_f(x, y):
    """Gradient of f."""
    df_dx = np.cos(x) * np.cos(y)
    df_dy = -np.sin(x) * np.sin(y)
    return np.array([df_dx, df_dy])
 
def hessian_f(x, y):
    """Hessian of f."""
    d2f_dx2 = -np.sin(x) * np.cos(y)
    d2f_dxdy = -np.cos(x) * np.sin(y)
    d2f_dy2 = -np.sin(x) * np.cos(y)
    return np.array([
        [d2f_dx2, d2f_dxdy],
        [d2f_dxdy, d2f_dy2]
    ])
 
def taylor_0(x, y, a, b):
    """Zeroth-order: constant."""
    return f(a, b) * np.ones_like(x)
 
def taylor_1(x, y, a, b):
    """First-order: linear (tangent plane)."""
    grad = gradient_f(a, b)
    return f(a, b) + grad[0]*(x-a) + grad[1]*(y-b)
 
def taylor_2(x, y, a, b):
    """Second-order: quadratic."""
    grad = gradient_f(a, b)
    H = hessian_f(a, b)
    
    linear = grad[0]*(x-a) + grad[1]*(y-b)
    
    # Quadratic term: 0.5 * [dx, dy] @ H @ [dx, dy]^T
    dx, dy = x - a, y - b
    quadratic = 0.5 * (H[0,0]*dx**2 + 2*H[0,1]*dx*dy + H[1,1]*dy**2)
    
    return f(a, b) + linear + quadratic
 
# Expansion point
a, b = 1.0, 0.5
 
# Create a grid
x = np.linspace(a - 1.5, a + 1.5, 100)
y = np.linspace(b - 1.5, b + 1.5, 100)
X, Y = np.meshgrid(x, y)
 
# Compute all approximations
Z_true = f(X, Y)
Z_0 = taylor_0(X, Y, a, b)
Z_1 = taylor_1(X, Y, a, b)
Z_2 = taylor_2(X, Y, a, b)
 
# Plotting
fig = plt.figure(figsize=(16, 12))
 
# 3D surface comparisons
for idx, (Z_approx, title) in enumerate([
    (Z_true, "True Function: sin(x)cos(y)"),
    (Z_1, "1st Order (Tangent Plane)"),
    (Z_2, "2nd Order (Quadratic)"),
], 1):
    ax = fig.add_subplot(2, 3, idx, projection='3d')
    ax.plot_surface(X, Y, Z_approx, cmap='viridis', alpha=0.8)
    ax.plot_surface(X, Y, Z_true, cmap='plasma', alpha=0.3)
    ax.scatter([a], [b], [f(a, b)], color='red', s=100, label='Expansion point')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('f(x,y)')
    ax.set_title(title)
 
# Error plots (2D heatmaps)
for idx, (Z_approx, title, order) in enumerate([
    (Z_0, "0th Order Error", 0),
    (Z_1, "1st Order Error", 1),
    (Z_2, "2nd Order Error", 2),
], 4):
    ax = fig.add_subplot(2, 3, idx)
    error = np.abs(Z_true - Z_approx)
    im = ax.contourf(X, Y, np.log10(error + 1e-10), levels=20, cmap='hot')
    ax.plot(a, b, 'c*', markersize=15)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{title} (log₁₀ scale)')
    plt.colorbar(im, ax=ax)
 
plt.tight_layout()
plt.savefig('taylor_approximations.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Quantify errors at various distances
print("\nError Analysis (expansion at ({:.1f}, {:.1f})):".format(a, b))
print("-" * 60)
 
distances = [0.1, 0.3, 0.5, 1.0]
for dist in distances:
    # Sample points at distance 'dist' from (a, b)
    theta_samples = np.linspace(0, 2*np.pi, 100)
    errors_0, errors_1, errors_2 = [], [], []
    
    for theta in theta_samples:
        px, py = a + dist*np.cos(theta), b + dist*np.sin(theta)
        true_val = f(px, py)
        errors_0.append(abs(true_val - taylor_0(px, py, a, b)))
        errors_1.append(abs(true_val - taylor_1(px, py, a, b)))
        errors_2.append(abs(true_val - taylor_2(px, py, a, b)))
    
    print(f"Distance {dist:.1f}: Max errors = {max(errors_0):.4f} (0th), "
          f"{max(errors_1):.4f} (1st), {max(errors_2):.4f} (2nd)")

The Hessian Matrix: Capturing Curvature

The second-order term involves the Hessian matrix—the matrix of all second partial derivatives. The Hessian captures curvature information about the function.

Definition (Hessian Matrix):

$$\mathbf{H}_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

Key Properties:

Symmetry: If f has continuous second partial derivatives, ∂²f/∂xᵢ∂xⱼ = ∂²f/∂xⱼ∂xᵢ (Schwarz's theorem). Thus H = Hᵀ.
Dimension: n × n for f: ℝⁿ → ℝ
Eigenvalues: All real (since symmetric). They determine curvature magnitude along principal axes.
Relationship to Jacobian: The Hessian is the Jacobian of the gradient: Hf = J{∇f}

Curvature Interpretation:

The eigenvalues λ₁, ..., λₙ of H at a critical point (where ∇f = 0) determine the point's nature:

All λᵢ > 0: Local minimum (bowl curving up)
All λᵢ < 0: Local maximum (bowl curving down)
Mixed signs: Saddle point (curves up in some directions, down in others)
Some λᵢ = 0: Degenerate critical point (flat in some direction)

Hessian Classification of Critical Points

•Positive Definite H (all eigenvalues > 0): Strict local minimum. The function curves upward in every direction.
•Negative Definite H (all eigenvalues < 0): Strict local maximum. The function curves downward in every direction.
•Indefinite H (both positive and negative eigenvalues): Saddle point. The function increases in some directions, decreases in others.
•Positive Semidefinite (eigenvalues ≥ 0, some zero): Could be minimum or saddle; need higher-order terms.
•Negative Semidefinite (eigenvalues ≤ 0, some zero): Could be maximum or saddle; need higher-order terms.

Hessian Examples for Common Loss Functions
Function	Hessian	Properties
f(x) = x²/2	H = 1	Positive, convex
f(x) = ‖x‖²/2	H = I	Positive definite, isotropically convex
f(x) = xᵀAx (A sym.)	H = 2A	Definiteness matches A
MSE Loss (linear reg.)	H = 2XᵀX/n	Positive semidefinite
Cross-entropy (logistic)	H = Xᵀdiag(p(1-p))X	Positive semidefinite

Saddle Points Dominate High Dimensions

In high-dimensional loss landscapes (like neural networks), saddle points vastly outnumber local minima. At a random critical point in n dimensions, each eigenvalue is ~50% likely positive, ~50% negative. So the probability of all positive (local min) is 2⁻ⁿ—vanishingly small as n grows!

Quadratic Approximation in Optimization

The second-order Taylor expansion provides a quadratic model of the loss function. This model is the foundation for second-order optimization methods.

The Quadratic Model:

Near current parameters θ, approximate the loss:

$$L(\mathbf{\theta} + \Delta\mathbf{\theta}) \approx L(\mathbf{\theta}) + \nabla L^\top \Delta\mathbf{\theta} + \frac{1}{2} \Delta\mathbf{\theta}^\top \mathbf{H} \Delta\mathbf{\theta}$$

Newton's Method:

Set derivative of quadratic model to zero:

$$\nabla_{\Delta\theta}\left[ L + \nabla L^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta \right] = \nabla L + \mathbf{H} \Delta\theta = \mathbf{0}$$

Solving: Δθ = -H⁻¹∇L

Newton Update: θ ← θ - H⁻¹∇L

Comparison to Gradient Descent:

Method	Update	Rate	Cost
Gradient Descent	θ - α∇L	Linear O(1/t)	O(n) per step
Newton's Method	θ - H⁻¹∇L	Quadratic O(c^{2^t})	O(n³) per step

Newton converges much faster (quadratically!) near the minimum, but each step costs O(n³) to invert the Hessian—often prohibitive for large n.

Practical Variants:

Quasi-Newton (BFGS, L-BFGS): Approximate H⁻¹ incrementally. O(n²) or O(n) per step.
Gauss-Newton: Approximate H using gradient outer products. For least squares.
Natural Gradient: Use Fisher information matrix instead of Hessian.
Second-order with diagonal: Approximate H as diagonal (cheap inverse).
Shampoo, K-FAC: Structured approximations for neural networks.

newton_vs_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import matplotlib.pyplot as plt
 
def rosenbrock(x, y, a=1, b=100):
    """The Rosenbrock function: a challenging optimization benchmark."""
    return (a - x)**2 + b * (y - x**2)**2
 
def rosenbrock_gradient(x, y, a=1, b=100):
    """Gradient of Rosenbrock function."""
    dx = -2*(a - x) - 4*b*x*(y - x**2)
    dy = 2*b*(y - x**2)
    return np.array([dx, dy])
 
def rosenbrock_hessian(x, y, a=1, b=100):
    """Hessian of Rosenbrock function."""
    dxx = 2 - 4*b*(y - 3*x**2)
    dxy = -4*b*x
    dyy = 2*b
    return np.array([[dxx, dxy], [dxy, dyy]])
 
def gradient_descent(f, grad, x0, lr=0.001, max_iter=10000, tol=1e-8):
    """Standard gradient descent."""
    x = x0.copy()
    path = [x.copy()]
    
    for i in range(max_iter):
        g = grad(x[0], x[1])
        x = x - lr * g
        path.append(x.copy())
        
        if np.linalg.norm(g) < tol:
            break
    
    return np.array(path)
 
def newton_method(f, grad, hessian, x0, max_iter=100, tol=1e-8):
    """Newton's method with Hessian."""
    x = x0.copy()
    path = [x.copy()]
    
    for i in range(max_iter):
        g = grad(x[0], x[1])
        H = hessian(x[0], x[1])
        
        # Newton step: delta = -H^{-1} @ g
        try:
            delta = -np.linalg.solve(H, g)
        except np.linalg.LinAlgError:
            # Hessian singular, fall back to gradient step
            delta = -0.01 * g
        
        # Damped Newton (line search would be better)
        alpha = 1.0
        while alpha > 1e-10:
            x_new = x + alpha * delta
            if f(x_new[0], x_new[1]) < f(x[0], x[1]):
                break
            alpha *= 0.5
        
        x = x + alpha * delta
        path.append(x.copy())
        
        if np.linalg.norm(g) < tol:
            break
    
    return np.array(path)
 
# Starting point (far from minimum at (1, 1))
x0 = np.array([-1.5, 2.0])
 
# Run both methods
path_gd = gradient_descent(rosenbrock, rosenbrock_gradient, x0, lr=0.001, max_iter=5000)
path_newton = newton_method(rosenbrock, rosenbrock_gradient, rosenbrock_hessian, x0, max_iter=50)
 
print(f"Gradient Descent: {len(path_gd)} iterations")
print(f"Newton's Method: {len(path_newton)} iterations")
print(f"\nGD final point: {path_gd[-1]}")
print(f"Newton final point: {path_newton[-1]}")
 
# Plotting
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Create contour plot background
x = np.linspace(-2, 2, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
 
for ax, path, title in [(axes[0], path_gd, f'Gradient Descent ({len(path_gd)} steps)'),
                         (axes[1], path_newton, f"Newton's Method ({len(path_newton)} steps)")]:
    ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.7)
    ax.plot(path[:, 0], path[:, 1], 'r.-', linewidth=1.5, markersize=4)
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
    ax.plot(1, 1, 'r*', markersize=15, label='Minimum (1, 1)')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(title)
    ax.legend()
    ax.set_aspect('equal')
 
plt.tight_layout()
plt.savefig('newton_vs_gd.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Convergence plot
fig, ax = plt.subplots(figsize=(10, 6))
 
losses_gd = [rosenbrock(p[0], p[1]) for p in path_gd]
losses_newton = [rosenbrock(p[0], p[1]) for p in path_newton]
 
ax.semilogy(losses_gd[:500], 'b-', label='Gradient Descent', alpha=0.7)
ax.semilogy(losses_newton, 'r-', label="Newton's Method", linewidth=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss (log scale)')
ax.set_title('Convergence Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.savefig('convergence_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Why Not Always Use Newton's Method?

Newton's method has three issues: (1) Computing H costs O(n²) and inverting costs O(n³)—prohibitive for neural nets with n = 10⁸ parameters. (2) H might be indefinite (not positive definite), causing Newton to move toward saddle points. (3) The quadratic model is only accurate locally; far from minimum, pure Newton can diverge.

Convergence Analysis Using Taylor Series

Taylor series provide the mathematical framework to analyze why and when gradient descent converges.

Setting Up the Analysis:

Let f be a smooth function we're minimizing. Gradient descent updates:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)$$

Taylor expansion of f at the new point:

$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) + \nabla f(\mathbf{x}k)^\top (\mathbf{x}{k+1} - \mathbf{x}k) + O(|\mathbf{x}{k+1} - \mathbf{x}_k|^2)$$

Substituting the GD update:

$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) - \alpha |\nabla f(\mathbf{x}_k)|^2 + O(\alpha^2)$$

Key Insight:

For small α, the first-order term dominates: each step decreases f by approximately α‖∇f‖². This proves gradient descent makes progress (when gradient is nonzero and α is small enough).

Optimal Learning Rate:

Including the second-order term:

$$f(\mathbf{x}_{k+1}) \approx f(\mathbf{x}_k) - \alpha |\nabla f|^2 + \frac{\alpha^2}{2} \nabla f^\top \mathbf{H} \nabla f$$

Minimizing over α, the optimal (for quadratic approximation) is:

$$\alpha^* = \frac{|\nabla f|^2}{\nabla f^\top \mathbf{H} \nabla f}$$

This is the exact line search result for quadratic functions.

Condition Number and Convergence:

Define the condition number κ = λ_max / λ_min (ratio of largest to smallest eigenvalue of H).

For a quadratic function, GD has convergence rate:

$$f(\mathbf{x}_k) - f^* \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(\mathbf{x}_0) - f^*)$$

Large κ (ill-conditioned) means slow convergence. This is why poorly conditioned problems are hard for gradient descent.

Learning Rate Guidelines from Taylor Analysis
Function Property	Maximum Safe α	Convergence Rate
L-smooth (‖∇²f‖ ≤ L)	α < 2/L	O(1/k)
μ-strongly convex	α ≤ 2/(L+μ)	O((1-μ/L)^k) linear
Quadratic, condition κ	α = 2/(λ_max + λ_min)	O(((κ-1)/(κ+1))^{2k})
Non-convex, L-smooth	α < 1/L	O(1/√k) to critical point

Smoothness Constant L

A function is L-smooth if its gradient is L-Lipschitz: ‖∇f(x) - ∇f(y)‖ ≤ L‖x - y‖. Equivalently, the Hessian eigenvalues are bounded by L. The safe learning rate is α < 2/L because larger steps can overshoot and increase the loss.

Practical Applications of Taylor Expansions

Taylor series appear throughout machine learning beyond just optimization analysis. Here are key applications:

1. Activation Function Approximations:

Many activations are defined or approximated via Taylor series:

GELU: Gaussian Error Linear Unit uses ≈ x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
Swish: x·σ(x) ≈ x/2 + x²/4 - x⁴/48 + ... near 0
Softplus: log(1 + eˣ) ≈ x + log(2) for large x; ≈ eˣ for small x

2. Loss Landscape Visualization:

To visualize high-dimensional loss landscapes, project onto 2 random directions u, v:

$$\text{plot } f(\mathbf{\theta} + \alpha \mathbf{u} + \beta \mathbf{v})$$

Taylor expansion predicts this is approximately quadratic near θ:

$$\approx f(\mathbf{\theta}) + \alpha \nabla f \cdot \mathbf{u} + \beta \nabla f \cdot \mathbf{v} + \frac{1}{2}(\alpha^2 \mathbf{u}^\top\mathbf{H}\mathbf{u} + \cdots)$$

3. Fisher Information and Natural Gradient:

The Fisher information matrix F is the expected Hessian of the log-likelihood:

$$\mathbf{F} = \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^\top]$$

Natural gradient uses F instead of H: θ ← θ - αF⁻¹∇L. This accounts for the geometry of probability distributions.

4. Laplace Approximation:

In Bayesian deep learning, the posterior p(θ|data) is intractable. The Laplace approximation fits a Gaussian using the second-order Taylor expansion at the mode:

$$p(\theta|\text{data}) \approx \mathcal{N}(\theta^*, \mathbf{H}^{-1})$$

where θ* is the MAP estimate and H is the Hessian of the negative log-posterior.

Advanced Uses of Taylor Expansions

•Influence Functions: Second-order Taylor expansion of loss to understand how training points affect predictions
•Learning Rate Warmup: Gradient noise is high early; Taylor analysis suggests starting with small α
•Sharpness-Aware Minimization: Uses Taylor expansion to find flatter minima with better generalization
•Neural Tangent Kernel: Infinite-width limit uses first-order Taylor expansion in weights, showing linearization
•Gradient Penalty (WGAN-GP): Regularizes gradient norm, equivalent to controlling Lipschitz constant (Taylor coefficient)

Taylor Thinking

Whenever you encounter a nonlinear function in ML, ask: 'What does its Taylor expansion look like?' The first-order term reveals gradient-based behavior. The second-order term reveals curvature and conditioning. This thinking illuminates many ML phenomena.

Summary: Taylor Series as a Universal Tool

Taylor series provide the mathematical lens through which we understand local function behavior—essential for optimization theory and practice.

Key Concepts:

Key Takeaways

•Multivariate Taylor expansion approximates functions as polynomials: f(x) ≈ f(a) + ∇f·Δx + ½ΔxᵀHΔx + ...
•First-order approximation (tangent plane) is what gradient descent exploits—moving in -∇f decreases f locally.
•Second-order approximation adds curvature via the Hessian H. This enables Newton's method and explains condition-number-dependent convergence.
•Hessian eigenvalues classify critical points: all positive = minimum, all negative = maximum, mixed = saddle.
•Newton's method (θ ← θ - H⁻¹∇f) minimizes the quadratic model exactly, giving quadratic convergence near minima but at O(n³) cost.
•Convergence analysis uses Taylor expansion to derive learning rate bounds (α < 2/L) and condition-number-dependent rates.
•Practical applications include activation approximations, loss landscape visualization, natural gradient, and Laplace approximation.

What's Next:

We've now seen how the Hessian appears in the second-order Taylor expansion. The final page dives deeper into second-order approximations and the Hessian matrix specifically: properties, computation, eigenvalue analysis for critical point classification, and practical approaches for leveraging curvature in large-scale optimization.

Page Complete

You now understand Taylor series in multiple dimensions and their central role in optimization. The second-order approximation with the Hessian matrix enables sophisticated optimization algorithms and deep analysis of learning dynamics. Next: a focused study of second-order approximations and Hessian properties.

Taylor Series Expansion — Approximating Functions Locally

Why Polynomials Approximate Everything

This is the power of Taylor series—one of the most useful tools in all of applied mathematics. By expanding a function as a polynomial, we can:

Approximate complex functions with simple, analytically tractable forms
Analyze behavior near critical points (minima, maxima, saddle points)
Design optimization algorithms that exploit local curvature
Understand why gradient descent converges (or fails to converge)

In machine learning, Taylor expansions underlie:

Second-order optimization methods (Newton's method, natural gradient)
Convergence proofs for gradient descent
Analysis of loss landscape structure
Approximation theory for neural networks

This page develops Taylor series for multivariable functions, building toward the crucial second-order approximation that captures curvature through the Hessian matrix.

What You Will Learn

Review: Single-Variable Taylor Series

Before tackling multiple dimensions, let's solidify our understanding of the single-variable case.

Taylor Series Expansion:

For a function f: ℝ → ℝ that is infinitely differentiable at point a:

$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x - a)^n$$

Expanded:

$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots$$

Key Terms:

Zeroth order (n=0): f(a) — the function value
First order (n=1): f'(a)(x-a) — linear approximation (tangent line)
Second order (n=2): f''(a)(x-a)²/2 — curvature correction
Higher orders: Increasingly fine corrections

Truncated Expansions (Taylor Polynomials):

In practice, we truncate after finite terms:

First-order (linear): f(x) ≈ f(a) + f'(a)(x-a)
Second-order (quadratic): f(x) ≈ f(a) + f'(a)(x-a) + f''(a)(x-a)²/2

The truncation error is O((x-a)^{n+1}) for an n-th order expansion.

Classic Examples:

Function	Taylor Series at a=0	Convergence
eˣ	1 + x + x²/2! + x³/3! + ...	All ℝ
sin(x)	x - x³/3! + x⁵/5! - ...	All ℝ
1/(1-x)	1 + x + x² + x³ + ...
ln(1+x)	x - x²/2 + x³/3 - ...

Why Factorials?

Taylor Series in Multiple Dimensions

The Taylor series extends naturally to functions of multiple variables, though the notation becomes more complex.

First-Order (Linear) Approximation:

For f: ℝⁿ → ℝ near point a:

$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$

This is the tangent hyperplane to the surface f(x) at a.

Second-Order (Quadratic) Approximation:

$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$

where H_f(a) is the Hessian matrix of second partial derivatives:

General Form (Multi-Index Notation):

For the full infinite series, we use multi-indices α = (α₁, ..., αₙ) with |α| = α₁+...+αₙ:

$$f(\mathbf{x}) = \sum_{|\alpha| = 0}^{\infty} \frac{D^\alpha f(\mathbf{a})}{\alpha!} (\mathbf{x} - \mathbf{a})^\alpha$$

where α! = α₁!·α₂!·...·αₙ! and (x - a)^α = (x₁-a₁)^{α₁}...(xₙ-aₙ)^{αₙ}.

Don't worry about this notation—we'll primarily use first and second-order approximations.

Taylor Expansion Components
Order	Term	Geometric Meaning	ML Application
0	f(a)	Function value at point	Current loss value
1	∇f(a)ᵀ(Δx)	Linear (tangent plane)	Gradient descent direction
2	½ΔxᵀHΔx	Quadratic (curvature)	Newton's method, learning rate
3+	Higher-order terms	Fine corrections	Rarely used in optimization

Reading the Second-Order Term

Derivation: Why This Formula?

Let's understand why the multivariate Taylor series takes its form by building it up from directional derivatives.

Reduction to One Dimension:

Consider moving from a toward x along the straight line:

$$\mathbf{r}(t) = \mathbf{a} + t(\mathbf{x} - \mathbf{a}), \quad t \in [0, 1]$$

Define g(t) = f(r(t)). This is a single-variable function!

Apply the 1D Taylor series:

$$g(1) = g(0) + g'(0) + \frac{1}{2}g''(0) + \cdots$$

But g(0) = f(a) and g(1) = f(x). So:

$$f(\mathbf{x}) = f(\mathbf{a}) + g'(0) + \frac{1}{2}g''(0) + \cdots$$

Computing g'(0):

By the chain rule:

$$g'(t) = \nabla f(\mathbf{r}(t))^\top \mathbf{r}'(t) = \nabla f(\mathbf{r}(t))^\top (\mathbf{x} - \mathbf{a})$$

At t=0:

$$g'(0) = \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$

Computing g''(0):

Differentiating again:

$$g''(t) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{r}(t)) (\mathbf{x} - \mathbf{a})$$

At t=0:

$$g''(0) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$

Assembling the Pieces:

Putting it together:

This derivation shows the multivariate Taylor series is just the 1D version applied along any straight-line path!

taylor_series_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def f(x, y):
    """
    Example function: f(x, y) = sin(x) * cos(y)
    """
    return np.sin(x) * np.cos(y)
 
def gradient_f(x, y):
    """Gradient of f."""
    df_dx = np.cos(x) * np.cos(y)
    df_dy = -np.sin(x) * np.sin(y)
    return np.array([df_dx, df_dy])
 
def hessian_f(x, y):
    """Hessian of f."""
    d2f_dx2 = -np.sin(x) * np.cos(y)
    d2f_dxdy = -np.cos(x) * np.sin(y)
    d2f_dy2 = -np.sin(x) * np.cos(y)
    return np.array([
        [d2f_dx2, d2f_dxdy],
        [d2f_dxdy, d2f_dy2]
    ])
 
def taylor_0(x, y, a, b):
    """Zeroth-order: constant."""
    return f(a, b) * np.ones_like(x)
 
def taylor_1(x, y, a, b):
    """First-order: linear (tangent plane)."""
    grad = gradient_f(a, b)
    return f(a, b) + grad[0]*(x-a) + grad[1]*(y-b)
 
def taylor_2(x, y, a, b):
    """Second-order: quadratic."""
    grad = gradient_f(a, b)
    H = hessian_f(a, b)
    
    linear = grad[0]*(x-a) + grad[1]*(y-b)
    
    # Quadratic term: 0.5 * [dx, dy] @ H @ [dx, dy]^T
    dx, dy = x - a, y - b
    quadratic = 0.5 * (H[0,0]*dx**2 + 2*H[0,1]*dx*dy + H[1,1]*dy**2)
    
    return f(a, b) + linear + quadratic
 
# Expansion point
a, b = 1.0, 0.5
 
# Create a grid
x = np.linspace(a - 1.5, a + 1.5, 100)
y = np.linspace(b - 1.5, b + 1.5, 100)
X, Y = np.meshgrid(x, y)
 
# Compute all approximations
Z_true = f(X, Y)
Z_0 = taylor_0(X, Y, a, b)
Z_1 = taylor_1(X, Y, a, b)
Z_2 = taylor_2(X, Y, a, b)
 
# Plotting
fig = plt.figure(figsize=(16, 12))
 
# 3D surface comparisons
for idx, (Z_approx, title) in enumerate([
    (Z_true, "True Function: sin(x)cos(y)"),
    (Z_1, "1st Order (Tangent Plane)"),
    (Z_2, "2nd Order (Quadratic)"),
], 1):
    ax = fig.add_subplot(2, 3, idx, projection='3d')
    ax.plot_surface(X, Y, Z_approx, cmap='viridis', alpha=0.8)
    ax.plot_surface(X, Y, Z_true, cmap='plasma', alpha=0.3)
    ax.scatter([a], [b], [f(a, b)], color='red', s=100, label='Expansion point')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('f(x,y)')
    ax.set_title(title)
 
# Error plots (2D heatmaps)
for idx, (Z_approx, title, order) in enumerate([
    (Z_0, "0th Order Error", 0),
    (Z_1, "1st Order Error", 1),
    (Z_2, "2nd Order Error", 2),
], 4):
    ax = fig.add_subplot(2, 3, idx)
    error = np.abs(Z_true - Z_approx)
    im = ax.contourf(X, Y, np.log10(error + 1e-10), levels=20, cmap='hot')
    ax.plot(a, b, 'c*', markersize=15)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{title} (log₁₀ scale)')
    plt.colorbar(im, ax=ax)
 
plt.tight_layout()
plt.savefig('taylor_approximations.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Quantify errors at various distances
print("\nError Analysis (expansion at ({:.1f}, {:.1f})):".format(a, b))
print("-" * 60)
 
distances = [0.1, 0.3, 0.5, 1.0]
for dist in distances:
    # Sample points at distance 'dist' from (a, b)
    theta_samples = np.linspace(0, 2*np.pi, 100)
    errors_0, errors_1, errors_2 = [], [], []
    
    for theta in theta_samples:
        px, py = a + dist*np.cos(theta), b + dist*np.sin(theta)
        true_val = f(px, py)
        errors_0.append(abs(true_val - taylor_0(px, py, a, b)))
        errors_1.append(abs(true_val - taylor_1(px, py, a, b)))
        errors_2.append(abs(true_val - taylor_2(px, py, a, b)))
    
    print(f"Distance {dist:.1f}: Max errors = {max(errors_0):.4f} (0th), "
          f"{max(errors_1):.4f} (1st), {max(errors_2):.4f} (2nd)")

The Hessian Matrix: Capturing Curvature

The second-order term involves the Hessian matrix—the matrix of all second partial derivatives. The Hessian captures curvature information about the function.

Definition (Hessian Matrix):

Key Properties:

Symmetry: If f has continuous second partial derivatives, ∂²f/∂xᵢ∂xⱼ = ∂²f/∂xⱼ∂xᵢ (Schwarz's theorem). Thus H = Hᵀ.
Dimension: n × n for f: ℝⁿ → ℝ
Eigenvalues: All real (since symmetric). They determine curvature magnitude along principal axes.
Relationship to Jacobian: The Hessian is the Jacobian of the gradient: Hf = J{∇f}

Curvature Interpretation:

The eigenvalues λ₁, ..., λₙ of H at a critical point (where ∇f = 0) determine the point's nature:

All λᵢ > 0: Local minimum (bowl curving up)
All λᵢ < 0: Local maximum (bowl curving down)
Mixed signs: Saddle point (curves up in some directions, down in others)
Some λᵢ = 0: Degenerate critical point (flat in some direction)

Hessian Classification of Critical Points

•Positive Definite H (all eigenvalues > 0): Strict local minimum. The function curves upward in every direction.
•Negative Definite H (all eigenvalues < 0): Strict local maximum. The function curves downward in every direction.
•Indefinite H (both positive and negative eigenvalues): Saddle point. The function increases in some directions, decreases in others.
•Positive Semidefinite (eigenvalues ≥ 0, some zero): Could be minimum or saddle; need higher-order terms.
•Negative Semidefinite (eigenvalues ≤ 0, some zero): Could be maximum or saddle; need higher-order terms.

Hessian Examples for Common Loss Functions
Function	Hessian	Properties
f(x) = x²/2	H = 1	Positive, convex
f(x) = ‖x‖²/2	H = I	Positive definite, isotropically convex
f(x) = xᵀAx (A sym.)	H = 2A	Definiteness matches A
MSE Loss (linear reg.)	H = 2XᵀX/n	Positive semidefinite
Cross-entropy (logistic)	H = Xᵀdiag(p(1-p))X	Positive semidefinite

Saddle Points Dominate High Dimensions

Quadratic Approximation in Optimization

The second-order Taylor expansion provides a quadratic model of the loss function. This model is the foundation for second-order optimization methods.

The Quadratic Model:

Near current parameters θ, approximate the loss:

$$L(\mathbf{\theta} + \Delta\mathbf{\theta}) \approx L(\mathbf{\theta}) + \nabla L^\top \Delta\mathbf{\theta} + \frac{1}{2} \Delta\mathbf{\theta}^\top \mathbf{H} \Delta\mathbf{\theta}$$

Newton's Method:

Set derivative of quadratic model to zero:

$$\nabla_{\Delta\theta}\left[ L + \nabla L^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta \right] = \nabla L + \mathbf{H} \Delta\theta = \mathbf{0}$$

Solving: Δθ = -H⁻¹∇L

Newton Update: θ ← θ - H⁻¹∇L

Comparison to Gradient Descent:

Method	Update	Rate	Cost
Gradient Descent	θ - α∇L	Linear O(1/t)	O(n) per step
Newton's Method	θ - H⁻¹∇L	Quadratic O(c^{2^t})	O(n³) per step

Newton converges much faster (quadratically!) near the minimum, but each step costs O(n³) to invert the Hessian—often prohibitive for large n.

Practical Variants:

Quasi-Newton (BFGS, L-BFGS): Approximate H⁻¹ incrementally. O(n²) or O(n) per step.
Gauss-Newton: Approximate H using gradient outer products. For least squares.
Natural Gradient: Use Fisher information matrix instead of Hessian.
Second-order with diagonal: Approximate H as diagonal (cheap inverse).
Shampoo, K-FAC: Structured approximations for neural networks.

newton_vs_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import matplotlib.pyplot as plt
 
def rosenbrock(x, y, a=1, b=100):
    """The Rosenbrock function: a challenging optimization benchmark."""
    return (a - x)**2 + b * (y - x**2)**2
 
def rosenbrock_gradient(x, y, a=1, b=100):
    """Gradient of Rosenbrock function."""
    dx = -2*(a - x) - 4*b*x*(y - x**2)
    dy = 2*b*(y - x**2)
    return np.array([dx, dy])
 
def rosenbrock_hessian(x, y, a=1, b=100):
    """Hessian of Rosenbrock function."""
    dxx = 2 - 4*b*(y - 3*x**2)
    dxy = -4*b*x
    dyy = 2*b
    return np.array([[dxx, dxy], [dxy, dyy]])
 
def gradient_descent(f, grad, x0, lr=0.001, max_iter=10000, tol=1e-8):
    """Standard gradient descent."""
    x = x0.copy()
    path = [x.copy()]
    
    for i in range(max_iter):
        g = grad(x[0], x[1])
        x = x - lr * g
        path.append(x.copy())
        
        if np.linalg.norm(g) < tol:
            break
    
    return np.array(path)
 
def newton_method(f, grad, hessian, x0, max_iter=100, tol=1e-8):
    """Newton's method with Hessian."""
    x = x0.copy()
    path = [x.copy()]
    
    for i in range(max_iter):
        g = grad(x[0], x[1])
        H = hessian(x[0], x[1])
        
        # Newton step: delta = -H^{-1} @ g
        try:
            delta = -np.linalg.solve(H, g)
        except np.linalg.LinAlgError:
            # Hessian singular, fall back to gradient step
            delta = -0.01 * g
        
        # Damped Newton (line search would be better)
        alpha = 1.0
        while alpha > 1e-10:
            x_new = x + alpha * delta
            if f(x_new[0], x_new[1]) < f(x[0], x[1]):
                break
            alpha *= 0.5
        
        x = x + alpha * delta
        path.append(x.copy())
        
        if np.linalg.norm(g) < tol:
            break
    
    return np.array(path)
 
# Starting point (far from minimum at (1, 1))
x0 = np.array([-1.5, 2.0])
 
# Run both methods
path_gd = gradient_descent(rosenbrock, rosenbrock_gradient, x0, lr=0.001, max_iter=5000)
path_newton = newton_method(rosenbrock, rosenbrock_gradient, rosenbrock_hessian, x0, max_iter=50)
 
print(f"Gradient Descent: {len(path_gd)} iterations")
print(f"Newton's Method: {len(path_newton)} iterations")
print(f"\nGD final point: {path_gd[-1]}")
print(f"Newton final point: {path_newton[-1]}")
 
# Plotting
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Create contour plot background
x = np.linspace(-2, 2, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
 
for ax, path, title in [(axes[0], path_gd, f'Gradient Descent ({len(path_gd)} steps)'),
                         (axes[1], path_newton, f"Newton's Method ({len(path_newton)} steps)")]:
    ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.7)
    ax.plot(path[:, 0], path[:, 1], 'r.-', linewidth=1.5, markersize=4)
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
    ax.plot(1, 1, 'r*', markersize=15, label='Minimum (1, 1)')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(title)
    ax.legend()
    ax.set_aspect('equal')
 
plt.tight_layout()
plt.savefig('newton_vs_gd.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Convergence plot
fig, ax = plt.subplots(figsize=(10, 6))
 
losses_gd = [rosenbrock(p[0], p[1]) for p in path_gd]
losses_newton = [rosenbrock(p[0], p[1]) for p in path_newton]
 
ax.semilogy(losses_gd[:500], 'b-', label='Gradient Descent', alpha=0.7)
ax.semilogy(losses_newton, 'r-', label="Newton's Method", linewidth=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss (log scale)')
ax.set_title('Convergence Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.savefig('convergence_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Why Not Always Use Newton's Method?

Convergence Analysis Using Taylor Series

Taylor series provide the mathematical framework to analyze why and when gradient descent converges.

Setting Up the Analysis:

Let f be a smooth function we're minimizing. Gradient descent updates:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)$$

Taylor expansion of f at the new point:

$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) + \nabla f(\mathbf{x}k)^\top (\mathbf{x}{k+1} - \mathbf{x}k) + O(|\mathbf{x}{k+1} - \mathbf{x}_k|^2)$$

Substituting the GD update:

$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) - \alpha |\nabla f(\mathbf{x}_k)|^2 + O(\alpha^2)$$

Key Insight:

For small α, the first-order term dominates: each step decreases f by approximately α‖∇f‖². This proves gradient descent makes progress (when gradient is nonzero and α is small enough).

Optimal Learning Rate:

Including the second-order term:

$$f(\mathbf{x}_{k+1}) \approx f(\mathbf{x}_k) - \alpha |\nabla f|^2 + \frac{\alpha^2}{2} \nabla f^\top \mathbf{H} \nabla f$$

Minimizing over α, the optimal (for quadratic approximation) is:

$$\alpha^* = \frac{|\nabla f|^2}{\nabla f^\top \mathbf{H} \nabla f}$$

This is the exact line search result for quadratic functions.

Condition Number and Convergence:

Define the condition number κ = λ_max / λ_min (ratio of largest to smallest eigenvalue of H).

For a quadratic function, GD has convergence rate:

$$f(\mathbf{x}_k) - f^* \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(\mathbf{x}_0) - f^*)$$

Large κ (ill-conditioned) means slow convergence. This is why poorly conditioned problems are hard for gradient descent.

Learning Rate Guidelines from Taylor Analysis
Function Property	Maximum Safe α	Convergence Rate
L-smooth (‖∇²f‖ ≤ L)	α < 2/L	O(1/k)
μ-strongly convex	α ≤ 2/(L+μ)	O((1-μ/L)^k) linear
Quadratic, condition κ	α = 2/(λ_max + λ_min)	O(((κ-1)/(κ+1))^{2k})
Non-convex, L-smooth	α < 1/L	O(1/√k) to critical point

Smoothness Constant L

Practical Applications of Taylor Expansions

Taylor series appear throughout machine learning beyond just optimization analysis. Here are key applications:

1. Activation Function Approximations:

Many activations are defined or approximated via Taylor series:

GELU: Gaussian Error Linear Unit uses ≈ x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
Swish: x·σ(x) ≈ x/2 + x²/4 - x⁴/48 + ... near 0
Softplus: log(1 + eˣ) ≈ x + log(2) for large x; ≈ eˣ for small x

2. Loss Landscape Visualization:

To visualize high-dimensional loss landscapes, project onto 2 random directions u, v:

$$\text{plot } f(\mathbf{\theta} + \alpha \mathbf{u} + \beta \mathbf{v})$$

Taylor expansion predicts this is approximately quadratic near θ:

$$\approx f(\mathbf{\theta}) + \alpha \nabla f \cdot \mathbf{u} + \beta \nabla f \cdot \mathbf{v} + \frac{1}{2}(\alpha^2 \mathbf{u}^\top\mathbf{H}\mathbf{u} + \cdots)$$

3. Fisher Information and Natural Gradient:

The Fisher information matrix F is the expected Hessian of the log-likelihood:

$$\mathbf{F} = \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^\top]$$

Natural gradient uses F instead of H: θ ← θ - αF⁻¹∇L. This accounts for the geometry of probability distributions.

4. Laplace Approximation:

In Bayesian deep learning, the posterior p(θ|data) is intractable. The Laplace approximation fits a Gaussian using the second-order Taylor expansion at the mode:

$$p(\theta|\text{data}) \approx \mathcal{N}(\theta^*, \mathbf{H}^{-1})$$

where θ* is the MAP estimate and H is the Hessian of the negative log-posterior.

Advanced Uses of Taylor Expansions

•Influence Functions: Second-order Taylor expansion of loss to understand how training points affect predictions
•Learning Rate Warmup: Gradient noise is high early; Taylor analysis suggests starting with small α
•Sharpness-Aware Minimization: Uses Taylor expansion to find flatter minima with better generalization
•Neural Tangent Kernel: Infinite-width limit uses first-order Taylor expansion in weights, showing linearization
•Gradient Penalty (WGAN-GP): Regularizes gradient norm, equivalent to controlling Lipschitz constant (Taylor coefficient)

Taylor Thinking

Summary: Taylor Series as a Universal Tool

Taylor series provide the mathematical lens through which we understand local function behavior—essential for optimization theory and practice.

Key Concepts:

Key Takeaways

•Multivariate Taylor expansion approximates functions as polynomials: f(x) ≈ f(a) + ∇f·Δx + ½ΔxᵀHΔx + ...
•First-order approximation (tangent plane) is what gradient descent exploits—moving in -∇f decreases f locally.
•Second-order approximation adds curvature via the Hessian H. This enables Newton's method and explains condition-number-dependent convergence.
•Hessian eigenvalues classify critical points: all positive = minimum, all negative = maximum, mixed = saddle.
•Newton's method (θ ← θ - H⁻¹∇f) minimizes the quadratic model exactly, giving quadratic convergence near minima but at O(n³) cost.
•Convergence analysis uses Taylor expansion to derive learning rate bounds (α < 2/L) and condition-number-dependent rates.
•Practical applications include activation approximations, loss landscape visualization, natural gradient, and Laplace approximation.

What's Next:

Page Complete