Loading content...
In your first encounter with calculus, you learned to analyze functions of a single variable: f(x). You computed derivatives, found critical points, and understood how a function changes as its single input varies. But in machine learning, we inhabit a fundamentally different world—one where models depend on hundreds, thousands, or even millions of variables simultaneously.
Consider a simple neural network for image classification. Each pixel in a 224×224 RGB image contributes 3 values, yielding 150,528 input dimensions. The network's weights might add another 25 million parameters. How does the loss function change when we nudge one of these millions of values? How do we navigate this incomprehensibly vast space to find optimal parameters?
The answer lies in multivariate calculus—the mathematical framework for analyzing functions of multiple variables. This isn't merely an extension of single-variable calculus; it's a fundamentally richer theory that enables the optimization machinery powering modern machine learning.
By the end of this page, you will deeply understand multivariable functions: their definition, representation, and geometric interpretation. You'll see how functions map from high-dimensional input spaces to outputs, and develop intuition for thinking about surfaces, level sets, and how functions behave across multiple dimensions—all essential foundations for gradient-based optimization.
A multivariable function (also called a multivariate function or function of several variables) is a function whose domain consists of ordered tuples of real numbers rather than single real numbers. Formally:
Definition (Multivariable Function):
A function f: ℝⁿ → ℝ maps an n-dimensional input vector x = (x₁, x₂, ..., xₙ) to a single real number:
$$f(\mathbf{x}) = f(x_1, x_2, \ldots, x_n)$$
More generally, a function f: ℝⁿ → ℝᵐ maps n-dimensional inputs to m-dimensional outputs, but we'll focus primarily on scalar-valued functions (m = 1) since these represent the loss functions we optimize in machine learning.
The Conceptual Leap:
In single-variable calculus, the graph of f(x) is a curve in 2D. In two-variable calculus, the graph of f(x, y) is a surface in 3D. But what about f(x₁, x₂, ..., x₁₀₀)? We cannot visualize 101-dimensional space, yet the mathematics works identically. This abstraction is powerful: the tools we develop for two or three variables extend naturally to arbitrary dimensions.
Notation Conventions:
Throughout machine learning, you'll encounter several equivalent notations:
In machine learning, the function f typically represents either: (1) a model prediction f(x; θ) = ŷ, mapping input features to predictions, or (2) a loss function L(θ) = (1/n)Σᵢℓ(f(xᵢ; θ), yᵢ), measuring how well parameters θ explain training data. Both are fundamentally multivariable functions, and understanding their behavior requires multivariate calculus.
Understanding the domain and range of multivariable functions is crucial for reasoning about machine learning models and their optimization landscapes.
Domain of a Multivariable Function:
The domain of f: ℝⁿ → ℝ is the set of all input vectors x for which f(x) is defined. Unlike single-variable functions where the domain is a subset of the real line, multivariable domains are subsets of n-dimensional space.
Common Domain Types in ML:
| Domain Type | Mathematical Description | ML Example |
|---|---|---|
| Entire space ℝⁿ | No restrictions on inputs | Linear regression weights (unconstrained) |
| Open ball Bᵣ(c) | {x : ‖x - c‖ < r} | Parameters near initialization for local analysis |
| Closed hypercube [a,b]ⁿ | All coordinates bounded | Normalized features in [0,1] |
| Simplex Δₙ | {x ≥ 0 : Σxᵢ = 1} | Probability distributions (softmax outputs) |
| Positive orthant ℝⁿ₊ | {x : xᵢ > 0 for all i} | Variance parameters, learning rates |
| Constraint manifold | {x : g(x) = 0} | Parameters satisfying normalization constraints |
Range and Level Sets:
The range of f: ℝⁿ → ℝ is the set of all output values the function attains. For loss functions, the range is typically [0, ∞) or ℝ, depending on whether the loss is non-negative.
Level Sets (Contours):
A level set of f is the set of all points where f takes a constant value:
$$L_c = {\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c}$$
For a function of two variables, level sets are curves called contour lines. For three variables, they're surfaces. For n variables, they're (n-1)-dimensional hypersurfaces.
Geometric Intuition:
Imagine a topographic map showing elevation. Each contour line connects points of equal height. Similarly, for a loss function L(θ), level sets connect all parameter configurations with the same loss. The gradient at any point is perpendicular to the level set, pointing toward higher values—a fact that's fundamental to gradient descent.
While we cannot directly visualize functions in high dimensions, we can use projections, slices, and 2D contour plots of selected variable pairs. In practice, machine learning practitioners often visualize loss landscapes by projecting onto random or principal directions, revealing saddle points, valleys, and local minima structure.
Let's ground these abstractions with concrete examples from machine learning. Each example illustrates how real ML models are fundamentally multivariable functions.
Example 1: Linear Regression Prediction
A linear regression model with n features computes:
$$f(\mathbf{x}; \mathbf{w}, b) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b$$
This function takes x ∈ ℝⁿ as input and w ∈ ℝⁿ, b ∈ ℝ as parameters, producing a scalar output. The function is linear in both inputs and parameters.
Example 2: Mean Squared Error Loss
Given m training examples {(xᵢ, yᵢ)}, the MSE loss as a function of weights:
$$L(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} \left( \mathbf{w}^\top \mathbf{x}_i + b - y_i \right)^2$$
This is a function of n+1 parameters, mapping ℝⁿ⁺¹ → ℝ. It's a quadratic function—specifically, a positive definite quadratic form (bowl-shaped with a unique minimum) when the data matrix has full column rank.
Example 3: Softmax Cross-Entropy Loss
For K-class classification with input z (logits):
$$\text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$$
$$L(\mathbf{z}; y) = -\log\left( \text{softmax}(\mathbf{z})y \right) = -z_y + \log\left( \sum{j=1}^{K} e^{z_j} \right)$$
This is a nonlinear, convex function of z ∈ ℝᴷ.
Example 4: Neural Network Prediction
A two-layer neural network:
$$f(\mathbf{x}; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)}) = \mathbf{W}^{(2)} \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}$$
Here σ is a nonlinear activation (ReLU, sigmoid, etc.). The loss landscape as a function of all weight matrices and biases is highly non-convex, with many local minima and saddle points.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # Example 1: Linear function f(x, y) = 2x + 3y + 1def linear_function(x, y): """A simple linear function of two variables.""" return 2*x + 3*y + 1 # Example 2: Quadratic function (like MSE loss surface)def quadratic_loss(w1, w2, x_data=None, y_data=None): """ Simplified MSE loss surface for 2-parameter linear regression. L(w1, w2) = (1/n) * sum((w1*x + w2 - y)^2) For demonstration, we use a simple quadratic bowl: L(w1, w2) = (w1 - 3)^2 + (w2 - 2)^2 Minimum at (3, 2) """ return (w1 - 3)**2 + (w2 - 2)**2 # Example 3: Non-convex function (like neural network loss)def non_convex_loss(w1, w2): """ A non-convex function with multiple local minima. Illustrates the complexity of neural network loss landscapes. """ return np.sin(w1) * np.cos(w2) + 0.1 * (w1**2 + w2**2) # Visualize the quadratic (convex) loss surfacefig = plt.figure(figsize=(15, 5)) # Plot 1: 3D surface of quadratic lossax1 = fig.add_subplot(131, projection='3d')w1 = np.linspace(-2, 8, 100)w2 = np.linspace(-3, 7, 100)W1, W2 = np.meshgrid(w1, w2)L = quadratic_loss(W1, W2) ax1.plot_surface(W1, W2, L, cmap='viridis', alpha=0.8)ax1.set_xlabel('w₁')ax1.set_ylabel('w₂')ax1.set_zlabel('Loss L(w₁, w₂)')ax1.set_title('Quadratic Loss Surface (Convex)') # Plot 2: Contour plot (level sets) of quadratic lossax2 = fig.add_subplot(132)contour = ax2.contour(W1, W2, L, levels=20, cmap='viridis')ax2.clabel(contour, inline=True, fontsize=8)ax2.plot(3, 2, 'r*', markersize=15, label='Minimum at (3, 2)')ax2.set_xlabel('w₁')ax2.set_ylabel('w₂')ax2.set_title('Level Sets (Contours) of Quadratic Loss')ax2.legend() # Plot 3: Non-convex loss surfaceax3 = fig.add_subplot(133, projection='3d')w1_nc = np.linspace(-5, 5, 100)w2_nc = np.linspace(-5, 5, 100)W1_nc, W2_nc = np.meshgrid(w1_nc, w2_nc)L_nc = non_convex_loss(W1_nc, W2_nc) ax3.plot_surface(W1_nc, W2_nc, L_nc, cmap='plasma', alpha=0.8)ax3.set_xlabel('w₁')ax3.set_ylabel('w₂')ax3.set_zlabel('Loss')ax3.set_title('Non-Convex Loss Surface') plt.tight_layout()plt.savefig('loss_surfaces.png', dpi=150, bbox_inches='tight')plt.show() print("Key observations:")print("- Quadratic loss has a unique global minimum (bowl shape)")print("- Contour lines are ellipses centered at the minimum")print("- Non-convex loss has multiple local minima and saddle points")Developing geometric intuition for multivariable functions is essential for understanding optimization behavior in machine learning.
Graphs and Surfaces:
For a function f: ℝ² → ℝ, the graph is the set of all points (x, y, f(x, y)) ∈ ℝ³—a surface in three-dimensional space. For a quadratic function like MSE loss, this surface is a paraboloid (bowl shape). For more complex functions, the surface can have peaks, valleys, saddle points, and ridges.
Cross-Sections and Slices:
We can understand high-dimensional functions by examining lower-dimensional slices:
This technique is invaluable for loss landscape visualization: we often pick two random directions in parameter space and plot the loss along them.
Direction and Distance:
In ℝⁿ, a direction is specified by a unit vector u with ‖u‖ = 1. Moving from point x in direction u by distance t gives:
$$\mathbf{x} + t\mathbf{u}$$
The function value along this ray is g(t) = f(x + tu), a single-variable function. This reduction to one dimension is the conceptual foundation for directional derivatives.
Our 3D intuition can mislead us in high dimensions. For example, in high-dimensional spaces, most critical points of random functions are saddle points rather than local minima. This means gradient descent often escapes apparent 'traps' that don't exist in low dimensions.
Before we can discuss derivatives of multivariable functions, we need to understand continuity in higher dimensions.
Continuity of Multivariable Functions:
A function f: ℝⁿ → ℝ is continuous at a point a if:
$$\lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = f(\mathbf{a})$$
Crucially, convergence x → a must hold for every possible approach path. In single-variable calculus, there are only two directions (left and right). In ℝⁿ, there are infinitely many directions and paths.
Path Dependence of Limits:
Consider the function:
$$f(x, y) = \begin{cases} \frac{xy}{x^2 + y^2} & \text{if } (x, y) \neq (0, 0) \ 0 & \text{if } (x, y) = (0, 0) \end{cases}$$
Approaching (0, 0) along the x-axis (y = 0): limit = 0 Approaching along y = x: f(x, x) = x²/(2x²) = 1/2
Different paths give different limits, so this function is not continuous at (0, 0).
Smoothness and Differentiability:
A function is smooth (C∞) if all partial derivatives of all orders exist and are continuous. Most loss functions in machine learning are smooth except at specific points:
These non-smooth points create challenges for gradient-based optimization, motivating smooth approximations (softplus, smooth L1, etc.).
| Function | Smoothness | Impact on Optimization |
|---|---|---|
| MSE Loss | C∞ (infinitely smooth) | Gradients well-defined everywhere |
| Cross-Entropy Loss | C∞ (for positive inputs) | Smooth, but can have large gradients near 0 |
| ReLU Activation | C⁰ (continuous, non-differentiable at 0) | Subgradients used; gradient = 0 or 1 |
| Sigmoid Activation | C∞ | Smooth but saturates (vanishing gradients) |
| L1 Regularization | C⁰ (non-differentiable at 0) | Promotes sparsity; requires proximal methods |
| Softmax | C∞ | Smooth, gradients involve all outputs |
When functions aren't differentiable at some points, we use subgradients—generalizations of gradients for convex functions. For ReLU at z = 0, any value in [0, 1] is a valid subgradient. In practice, implementations typically use 0 or 1 arbitrarily at the non-differentiable point, and training still works because hitting exactly z = 0 has probability zero.
Understanding the structure of the input space is essential for reasoning about optimization and generalization in machine learning.
Open and Closed Sets:
In ℝⁿ, an open ball of radius r centered at c is:
$$B_r(\mathbf{c}) = {\mathbf{x} \in \mathbb{R}^n : |\mathbf{x} - \mathbf{c}| < r}$$
A set S ⊆ ℝⁿ is open if every point has an open ball entirely contained in S. A set is closed if its complement is open (equivalently, if it contains all its limit points).
Bounded and Compact Sets:
A set is bounded if it fits inside some ball of finite radius. A set is compact if it is both closed and bounded. The Extreme Value Theorem states that a continuous function on a compact set attains its maximum and minimum—this guarantees that minimization problems over compact domains have solutions.
Relevance to ML:
Convex Sets:
A set C ⊆ ℝⁿ is convex if the line segment between any two points in C lies entirely within C:
$$\forall \mathbf{x}, \mathbf{y} \in C, \forall t \in [0, 1]: t\mathbf{x} + (1-t)\mathbf{y} \in C$$
Convexity of the domain is crucial for convex optimization, where local minima are global. Examples of convex sets:
Non-convex domains (e.g., the union of two disjoint balls) can create disconnected feasible regions where gradient descent cannot move between components.
A powerful perspective is to view multivariable functions as transformations that map one space to another. This viewpoint is central to understanding neural networks.
Layer-by-Layer Transformations:
In a neural network, each layer is a function:
$$\mathbf{h}^{(\ell)} = \sigma(\mathbf{W}^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)})$$
This maps from ℝⁿ⁽ˡ⁻¹⁾ (previous layer's dimension) to ℝⁿ⁽ˡ⁾ (current layer's dimension). The full network is the composition of these layer functions:
$$f = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}$$
The Chain Rule for Compositions:
When we compose functions, their derivatives multiply. If g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ, then:
$$\frac{\partial (f \circ g)}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}}$$
This is the matrix chain rule, and it's the mathematical foundation of backpropagation. We'll explore this in depth when we study Jacobian matrices.
Representations and Embeddings:
The intermediate values h⁽ˡ⁾ are learned representations of the input. Each layer transforms the representation, ideally making the final classification or regression task easier. Functions can:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as np def layer_function(W, b, h_prev, activation='relu'): """ A single neural network layer as a multivariable function. Maps: R^(input_dim) -> R^(output_dim) Parameters: - W: weight matrix of shape (output_dim, input_dim) - b: bias vector of shape (output_dim,) - h_prev: input from previous layer, shape (input_dim,) - activation: nonlinearity to apply Returns: - h_next: output of this layer, shape (output_dim,) """ # Linear transformation: z = W @ h + b z = W @ h_prev + b # Apply nonlinearity if activation == 'relu': return np.maximum(0, z) elif activation == 'sigmoid': return 1 / (1 + np.exp(-z)) elif activation == 'linear': return z else: raise ValueError(f"Unknown activation: {activation}") def neural_network_forward(x, layers): """ Full neural network as composition of layer functions. f = f^(L) ∘ f^(L-1) ∘ ... ∘ f^(1) Parameters: - x: input vector - layers: list of (W, b, activation) tuples Returns: - output: final network output - activations: list of all intermediate representations """ activations = [x] # h^(0) = x h = x for i, (W, b, activation) in enumerate(layers): h = layer_function(W, b, h, activation) activations.append(h) print(f"Layer {i+1}: R^{len(activations[-2])} -> R^{len(h)}") return h, activations # Example: 3-layer network for 10-class classification# Input: 784 dimensions (flattened 28x28 image)# Hidden: 256 -> 128 dimensions# Output: 10 dimensions (class logits) np.random.seed(42) layers = [ (np.random.randn(256, 784) * 0.01, np.zeros(256), 'relu'), # R^784 -> R^256 (np.random.randn(128, 256) * 0.01, np.zeros(128), 'relu'), # R^256 -> R^128 (np.random.randn(10, 128) * 0.01, np.zeros(10), 'linear'), # R^128 -> R^10] # Random input (like a flattened image)x = np.random.randn(784) output, activations = neural_network_forward(x, layers) print(f"\nInput dimension: {len(x)}")print(f"Output dimension: {len(output)}")print(f"Number of intermediate representations: {len(activations)}") # Count total parameterstotal_params = sum(W.size + b.size for W, b, _ in layers)print(f"\nTotal parameters: {total_params:,}")print("The loss function is a multivariable function of all these parameters!")We've laid the groundwork for multivariate calculus by establishing what multivariable functions are, how to think about them geometrically, and why they're central to machine learning.
Key Concepts Covered:
What's Next:
Now that we understand what multivariable functions are, we're ready to ask: How do they change? The next page introduces gradient vectors—the fundamental tool for measuring how a function changes in all directions simultaneously. The gradient generalizes the derivative to multiple dimensions and provides the direction of steepest ascent, making it the cornerstone of optimization algorithms like gradient descent.
You now have a solid foundation in multivariable functions. You understand their definition, geometric interpretation, and central role in machine learning. Next, we'll explore gradient vectors—the direction of steepest ascent that drives every gradient-based optimization algorithm.