Loading content...
Consider a remarkable fact: near any point where a function is smooth, you can approximate it arbitrarily well using just polynomials. The exponential function? A polynomial. Sine and cosine? Polynomials. The complex loss landscape of a neural network? Locally, a polynomial.
This is the power of Taylor series—one of the most useful tools in all of applied mathematics. By expanding a function as a polynomial, we can:
In machine learning, Taylor expansions underlie:
This page develops Taylor series for multivariable functions, building toward the crucial second-order approximation that captures curvature through the Hessian matrix.
By the end of this page, you will understand Taylor series in multiple dimensions: the general formula, first and second-order approximations, multi-index notation, convergence considerations, and applications to analyzing optimization landscapes.
Before tackling multiple dimensions, let's solidify our understanding of the single-variable case.
Taylor Series Expansion:
For a function f: ℝ → ℝ that is infinitely differentiable at point a:
$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x - a)^n$$
Expanded:
$$f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots$$
Key Terms:
Truncated Expansions (Taylor Polynomials):
In practice, we truncate after finite terms:
The truncation error is O((x-a)^{n+1}) for an n-th order expansion.
Classic Examples:
| Function | Taylor Series at a=0 | Convergence |
|---|---|---|
| eˣ | 1 + x + x²/2! + x³/3! + ... | All ℝ |
| sin(x) | x - x³/3! + x⁵/5! - ... | All ℝ |
| 1/(1-x) | 1 + x + x² + x³ + ... | |
| ln(1+x) | x - x²/2 + x³/3 - ... |
The factorial 1/n! arises because differentiating (x-a)ⁿ yields n! times a constant. The 1/n! cancels this, ensuring that the n-th derivative of the Taylor polynomial at a equals f⁽ⁿ⁾(a). It's designed so all derivatives match.
The Taylor series extends naturally to functions of multiple variables, though the notation becomes more complex.
First-Order (Linear) Approximation:
For f: ℝⁿ → ℝ near point a:
$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$
This is the tangent hyperplane to the surface f(x) at a.
Second-Order (Quadratic) Approximation:
$$f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$
where H_f(a) is the Hessian matrix of second partial derivatives:
$$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \ \vdots & \vdots & \ddots \end{bmatrix}$$
General Form (Multi-Index Notation):
For the full infinite series, we use multi-indices α = (α₁, ..., αₙ) with |α| = α₁+...+αₙ:
$$f(\mathbf{x}) = \sum_{|\alpha| = 0}^{\infty} \frac{D^\alpha f(\mathbf{a})}{\alpha!} (\mathbf{x} - \mathbf{a})^\alpha$$
where α! = α₁!·α₂!·...·αₙ! and (x - a)^α = (x₁-a₁)^{α₁}...(xₙ-aₙ)^{αₙ}.
Don't worry about this notation—we'll primarily use first and second-order approximations.
| Order | Term | Geometric Meaning | ML Application |
|---|---|---|---|
| 0 | f(a) | Function value at point | Current loss value |
| 1 | ∇f(a)ᵀ(Δx) | Linear (tangent plane) | Gradient descent direction |
| 2 | ½ΔxᵀHΔx | Quadratic (curvature) | Newton's method, learning rate |
| 3+ | Higher-order terms | Fine corrections | Rarely used in optimization |
The quadratic term ½ΔxᵀHΔx is a quadratic form. Its sign depends on the eigenvalues of H. Positive eigenvalues = upward curving (minimum direction). Negative eigenvalues = downward curving (maximum direction). Mixed = saddle point.
Let's understand why the multivariate Taylor series takes its form by building it up from directional derivatives.
Reduction to One Dimension:
Consider moving from a toward x along the straight line:
$$\mathbf{r}(t) = \mathbf{a} + t(\mathbf{x} - \mathbf{a}), \quad t \in [0, 1]$$
Define g(t) = f(r(t)). This is a single-variable function!
Apply the 1D Taylor series:
$$g(1) = g(0) + g'(0) + \frac{1}{2}g''(0) + \cdots$$
But g(0) = f(a) and g(1) = f(x). So:
$$f(\mathbf{x}) = f(\mathbf{a}) + g'(0) + \frac{1}{2}g''(0) + \cdots$$
Computing g'(0):
By the chain rule:
$$g'(t) = \nabla f(\mathbf{r}(t))^\top \mathbf{r}'(t) = \nabla f(\mathbf{r}(t))^\top (\mathbf{x} - \mathbf{a})$$
At t=0:
$$g'(0) = \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a})$$
Computing g''(0):
Differentiating again:
$$g''(t) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{r}(t)) (\mathbf{x} - \mathbf{a})$$
At t=0:
$$g''(0) = (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})$$
Assembling the Pieces:
Putting it together:
$$f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^\top \mathbf{H}_f(\mathbf{a}) (\mathbf{x} - \mathbf{a}) + O(|\mathbf{x}-\mathbf{a}|^3)$$
This derivation shows the multivariate Taylor series is just the 1D version applied along any straight-line path!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def f(x, y): """ Example function: f(x, y) = sin(x) * cos(y) """ return np.sin(x) * np.cos(y) def gradient_f(x, y): """Gradient of f.""" df_dx = np.cos(x) * np.cos(y) df_dy = -np.sin(x) * np.sin(y) return np.array([df_dx, df_dy]) def hessian_f(x, y): """Hessian of f.""" d2f_dx2 = -np.sin(x) * np.cos(y) d2f_dxdy = -np.cos(x) * np.sin(y) d2f_dy2 = -np.sin(x) * np.cos(y) return np.array([ [d2f_dx2, d2f_dxdy], [d2f_dxdy, d2f_dy2] ]) def taylor_0(x, y, a, b): """Zeroth-order: constant.""" return f(a, b) * np.ones_like(x) def taylor_1(x, y, a, b): """First-order: linear (tangent plane).""" grad = gradient_f(a, b) return f(a, b) + grad[0]*(x-a) + grad[1]*(y-b) def taylor_2(x, y, a, b): """Second-order: quadratic.""" grad = gradient_f(a, b) H = hessian_f(a, b) linear = grad[0]*(x-a) + grad[1]*(y-b) # Quadratic term: 0.5 * [dx, dy] @ H @ [dx, dy]^T dx, dy = x - a, y - b quadratic = 0.5 * (H[0,0]*dx**2 + 2*H[0,1]*dx*dy + H[1,1]*dy**2) return f(a, b) + linear + quadratic # Expansion pointa, b = 1.0, 0.5 # Create a gridx = np.linspace(a - 1.5, a + 1.5, 100)y = np.linspace(b - 1.5, b + 1.5, 100)X, Y = np.meshgrid(x, y) # Compute all approximationsZ_true = f(X, Y)Z_0 = taylor_0(X, Y, a, b)Z_1 = taylor_1(X, Y, a, b)Z_2 = taylor_2(X, Y, a, b) # Plottingfig = plt.figure(figsize=(16, 12)) # 3D surface comparisonsfor idx, (Z_approx, title) in enumerate([ (Z_true, "True Function: sin(x)cos(y)"), (Z_1, "1st Order (Tangent Plane)"), (Z_2, "2nd Order (Quadratic)"),], 1): ax = fig.add_subplot(2, 3, idx, projection='3d') ax.plot_surface(X, Y, Z_approx, cmap='viridis', alpha=0.8) ax.plot_surface(X, Y, Z_true, cmap='plasma', alpha=0.3) ax.scatter([a], [b], [f(a, b)], color='red', s=100, label='Expansion point') ax.set_xlabel('x') ax.set_ylabel('y') ax.set_zlabel('f(x,y)') ax.set_title(title) # Error plots (2D heatmaps)for idx, (Z_approx, title, order) in enumerate([ (Z_0, "0th Order Error", 0), (Z_1, "1st Order Error", 1), (Z_2, "2nd Order Error", 2),], 4): ax = fig.add_subplot(2, 3, idx) error = np.abs(Z_true - Z_approx) im = ax.contourf(X, Y, np.log10(error + 1e-10), levels=20, cmap='hot') ax.plot(a, b, 'c*', markersize=15) ax.set_xlabel('x') ax.set_ylabel('y') ax.set_title(f'{title} (log₁₀ scale)') plt.colorbar(im, ax=ax) plt.tight_layout()plt.savefig('taylor_approximations.png', dpi=150, bbox_inches='tight')plt.show() # Quantify errors at various distancesprint("\nError Analysis (expansion at ({:.1f}, {:.1f})):".format(a, b))print("-" * 60) distances = [0.1, 0.3, 0.5, 1.0]for dist in distances: # Sample points at distance 'dist' from (a, b) theta_samples = np.linspace(0, 2*np.pi, 100) errors_0, errors_1, errors_2 = [], [], [] for theta in theta_samples: px, py = a + dist*np.cos(theta), b + dist*np.sin(theta) true_val = f(px, py) errors_0.append(abs(true_val - taylor_0(px, py, a, b))) errors_1.append(abs(true_val - taylor_1(px, py, a, b))) errors_2.append(abs(true_val - taylor_2(px, py, a, b))) print(f"Distance {dist:.1f}: Max errors = {max(errors_0):.4f} (0th), " f"{max(errors_1):.4f} (1st), {max(errors_2):.4f} (2nd)")The second-order term involves the Hessian matrix—the matrix of all second partial derivatives. The Hessian captures curvature information about the function.
Definition (Hessian Matrix):
$$\mathbf{H}_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$
Key Properties:
Symmetry: If f has continuous second partial derivatives, ∂²f/∂xᵢ∂xⱼ = ∂²f/∂xⱼ∂xᵢ (Schwarz's theorem). Thus H = Hᵀ.
Dimension: n × n for f: ℝⁿ → ℝ
Eigenvalues: All real (since symmetric). They determine curvature magnitude along principal axes.
Relationship to Jacobian: The Hessian is the Jacobian of the gradient: Hf = J{∇f}
Curvature Interpretation:
The eigenvalues λ₁, ..., λₙ of H at a critical point (where ∇f = 0) determine the point's nature:
| Function | Hessian | Properties |
|---|---|---|
| f(x) = x²/2 | H = 1 | Positive, convex |
| f(x) = ‖x‖²/2 | H = I | Positive definite, isotropically convex |
| f(x) = xᵀAx (A sym.) | H = 2A | Definiteness matches A |
| MSE Loss (linear reg.) | H = 2XᵀX/n | Positive semidefinite |
| Cross-entropy (logistic) | H = Xᵀdiag(p(1-p))X | Positive semidefinite |
In high-dimensional loss landscapes (like neural networks), saddle points vastly outnumber local minima. At a random critical point in n dimensions, each eigenvalue is ~50% likely positive, ~50% negative. So the probability of all positive (local min) is 2⁻ⁿ—vanishingly small as n grows!
The second-order Taylor expansion provides a quadratic model of the loss function. This model is the foundation for second-order optimization methods.
The Quadratic Model:
Near current parameters θ, approximate the loss:
$$L(\mathbf{\theta} + \Delta\mathbf{\theta}) \approx L(\mathbf{\theta}) + \nabla L^\top \Delta\mathbf{\theta} + \frac{1}{2} \Delta\mathbf{\theta}^\top \mathbf{H} \Delta\mathbf{\theta}$$
Newton's Method:
Set derivative of quadratic model to zero:
$$\nabla_{\Delta\theta}\left[ L + \nabla L^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta \right] = \nabla L + \mathbf{H} \Delta\theta = \mathbf{0}$$
Solving: Δθ = -H⁻¹∇L
Newton Update: θ ← θ - H⁻¹∇L
Comparison to Gradient Descent:
| Method | Update | Rate | Cost |
|---|---|---|---|
| Gradient Descent | θ - α∇L | Linear O(1/t) | O(n) per step |
| Newton's Method | θ - H⁻¹∇L | Quadratic O(c^{2^t}) | O(n³) per step |
Newton converges much faster (quadratically!) near the minimum, but each step costs O(n³) to invert the Hessian—often prohibitive for large n.
Practical Variants:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npimport matplotlib.pyplot as plt def rosenbrock(x, y, a=1, b=100): """The Rosenbrock function: a challenging optimization benchmark.""" return (a - x)**2 + b * (y - x**2)**2 def rosenbrock_gradient(x, y, a=1, b=100): """Gradient of Rosenbrock function.""" dx = -2*(a - x) - 4*b*x*(y - x**2) dy = 2*b*(y - x**2) return np.array([dx, dy]) def rosenbrock_hessian(x, y, a=1, b=100): """Hessian of Rosenbrock function.""" dxx = 2 - 4*b*(y - 3*x**2) dxy = -4*b*x dyy = 2*b return np.array([[dxx, dxy], [dxy, dyy]]) def gradient_descent(f, grad, x0, lr=0.001, max_iter=10000, tol=1e-8): """Standard gradient descent.""" x = x0.copy() path = [x.copy()] for i in range(max_iter): g = grad(x[0], x[1]) x = x - lr * g path.append(x.copy()) if np.linalg.norm(g) < tol: break return np.array(path) def newton_method(f, grad, hessian, x0, max_iter=100, tol=1e-8): """Newton's method with Hessian.""" x = x0.copy() path = [x.copy()] for i in range(max_iter): g = grad(x[0], x[1]) H = hessian(x[0], x[1]) # Newton step: delta = -H^{-1} @ g try: delta = -np.linalg.solve(H, g) except np.linalg.LinAlgError: # Hessian singular, fall back to gradient step delta = -0.01 * g # Damped Newton (line search would be better) alpha = 1.0 while alpha > 1e-10: x_new = x + alpha * delta if f(x_new[0], x_new[1]) < f(x[0], x[1]): break alpha *= 0.5 x = x + alpha * delta path.append(x.copy()) if np.linalg.norm(g) < tol: break return np.array(path) # Starting point (far from minimum at (1, 1))x0 = np.array([-1.5, 2.0]) # Run both methodspath_gd = gradient_descent(rosenbrock, rosenbrock_gradient, x0, lr=0.001, max_iter=5000)path_newton = newton_method(rosenbrock, rosenbrock_gradient, rosenbrock_hessian, x0, max_iter=50) print(f"Gradient Descent: {len(path_gd)} iterations")print(f"Newton's Method: {len(path_newton)} iterations")print(f"\nGD final point: {path_gd[-1]}")print(f"Newton final point: {path_newton[-1]}") # Plottingfig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Create contour plot backgroundx = np.linspace(-2, 2, 200)y = np.linspace(-1, 3, 200)X, Y = np.meshgrid(x, y)Z = rosenbrock(X, Y) for ax, path, title in [(axes[0], path_gd, f'Gradient Descent ({len(path_gd)} steps)'), (axes[1], path_newton, f"Newton's Method ({len(path_newton)} steps)")]: ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.7) ax.plot(path[:, 0], path[:, 1], 'r.-', linewidth=1.5, markersize=4) ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start') ax.plot(1, 1, 'r*', markersize=15, label='Minimum (1, 1)') ax.set_xlabel('x') ax.set_ylabel('y') ax.set_title(title) ax.legend() ax.set_aspect('equal') plt.tight_layout()plt.savefig('newton_vs_gd.png', dpi=150, bbox_inches='tight')plt.show() # Convergence plotfig, ax = plt.subplots(figsize=(10, 6)) losses_gd = [rosenbrock(p[0], p[1]) for p in path_gd]losses_newton = [rosenbrock(p[0], p[1]) for p in path_newton] ax.semilogy(losses_gd[:500], 'b-', label='Gradient Descent', alpha=0.7)ax.semilogy(losses_newton, 'r-', label="Newton's Method", linewidth=2)ax.set_xlabel('Iteration')ax.set_ylabel('Loss (log scale)')ax.set_title('Convergence Comparison')ax.legend()ax.grid(True, alpha=0.3) plt.savefig('convergence_comparison.png', dpi=150, bbox_inches='tight')plt.show()Newton's method has three issues: (1) Computing H costs O(n²) and inverting costs O(n³)—prohibitive for neural nets with n = 10⁸ parameters. (2) H might be indefinite (not positive definite), causing Newton to move toward saddle points. (3) The quadratic model is only accurate locally; far from minimum, pure Newton can diverge.
Taylor series provide the mathematical framework to analyze why and when gradient descent converges.
Setting Up the Analysis:
Let f be a smooth function we're minimizing. Gradient descent updates:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)$$
Taylor expansion of f at the new point:
$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) + \nabla f(\mathbf{x}k)^\top (\mathbf{x}{k+1} - \mathbf{x}k) + O(|\mathbf{x}{k+1} - \mathbf{x}_k|^2)$$
Substituting the GD update:
$$f(\mathbf{x}_{k+1}) = f(\mathbf{x}_k) - \alpha |\nabla f(\mathbf{x}_k)|^2 + O(\alpha^2)$$
Key Insight:
For small α, the first-order term dominates: each step decreases f by approximately α‖∇f‖². This proves gradient descent makes progress (when gradient is nonzero and α is small enough).
Optimal Learning Rate:
Including the second-order term:
$$f(\mathbf{x}_{k+1}) \approx f(\mathbf{x}_k) - \alpha |\nabla f|^2 + \frac{\alpha^2}{2} \nabla f^\top \mathbf{H} \nabla f$$
Minimizing over α, the optimal (for quadratic approximation) is:
$$\alpha^* = \frac{|\nabla f|^2}{\nabla f^\top \mathbf{H} \nabla f}$$
This is the exact line search result for quadratic functions.
Condition Number and Convergence:
Define the condition number κ = λ_max / λ_min (ratio of largest to smallest eigenvalue of H).
For a quadratic function, GD has convergence rate:
$$f(\mathbf{x}_k) - f^* \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(\mathbf{x}_0) - f^*)$$
Large κ (ill-conditioned) means slow convergence. This is why poorly conditioned problems are hard for gradient descent.
| Function Property | Maximum Safe α | Convergence Rate |
|---|---|---|
| L-smooth (‖∇²f‖ ≤ L) | α < 2/L | O(1/k) |
| μ-strongly convex | α ≤ 2/(L+μ) | O((1-μ/L)^k) linear |
| Quadratic, condition κ | α = 2/(λ_max + λ_min) | O(((κ-1)/(κ+1))^{2k}) |
| Non-convex, L-smooth | α < 1/L | O(1/√k) to critical point |
A function is L-smooth if its gradient is L-Lipschitz: ‖∇f(x) - ∇f(y)‖ ≤ L‖x - y‖. Equivalently, the Hessian eigenvalues are bounded by L. The safe learning rate is α < 2/L because larger steps can overshoot and increase the loss.
Taylor series appear throughout machine learning beyond just optimization analysis. Here are key applications:
1. Activation Function Approximations:
Many activations are defined or approximated via Taylor series:
2. Loss Landscape Visualization:
To visualize high-dimensional loss landscapes, project onto 2 random directions u, v:
$$\text{plot } f(\mathbf{\theta} + \alpha \mathbf{u} + \beta \mathbf{v})$$
Taylor expansion predicts this is approximately quadratic near θ:
$$\approx f(\mathbf{\theta}) + \alpha \nabla f \cdot \mathbf{u} + \beta \nabla f \cdot \mathbf{v} + \frac{1}{2}(\alpha^2 \mathbf{u}^\top\mathbf{H}\mathbf{u} + \cdots)$$
3. Fisher Information and Natural Gradient:
The Fisher information matrix F is the expected Hessian of the log-likelihood:
$$\mathbf{F} = \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^\top]$$
Natural gradient uses F instead of H: θ ← θ - αF⁻¹∇L. This accounts for the geometry of probability distributions.
4. Laplace Approximation:
In Bayesian deep learning, the posterior p(θ|data) is intractable. The Laplace approximation fits a Gaussian using the second-order Taylor expansion at the mode:
$$p(\theta|\text{data}) \approx \mathcal{N}(\theta^*, \mathbf{H}^{-1})$$
where θ* is the MAP estimate and H is the Hessian of the negative log-posterior.
Whenever you encounter a nonlinear function in ML, ask: 'What does its Taylor expansion look like?' The first-order term reveals gradient-based behavior. The second-order term reveals curvature and conditioning. This thinking illuminates many ML phenomena.
Taylor series provide the mathematical lens through which we understand local function behavior—essential for optimization theory and practice.
Key Concepts:
What's Next:
We've now seen how the Hessian appears in the second-order Taylor expansion. The final page dives deeper into second-order approximations and the Hessian matrix specifically: properties, computation, eigenvalue analysis for critical point classification, and practical approaches for leveraging curvature in large-scale optimization.
You now understand Taylor series in multiple dimensions and their central role in optimization. The second-order approximation with the Hessian matrix enables sophisticated optimization algorithms and deep analysis of learning dynamics. Next: a focused study of second-order approximations and Hessian properties.