Multivariate Calculus - Learning Module

Loading content...

0/245

Multivariable Functions — Foundations and Representations

Beyond Single-Variable Calculus

In your first encounter with calculus, you learned to analyze functions of a single variable: f(x). You computed derivatives, found critical points, and understood how a function changes as its single input varies. But in machine learning, we inhabit a fundamentally different world—one where models depend on hundreds, thousands, or even millions of variables simultaneously.

Consider a simple neural network for image classification. Each pixel in a 224×224 RGB image contributes 3 values, yielding 150,528 input dimensions. The network's weights might add another 25 million parameters. How does the loss function change when we nudge one of these millions of values? How do we navigate this incomprehensibly vast space to find optimal parameters?

The answer lies in multivariate calculus—the mathematical framework for analyzing functions of multiple variables. This isn't merely an extension of single-variable calculus; it's a fundamentally richer theory that enables the optimization machinery powering modern machine learning.

What You Will Learn

By the end of this page, you will deeply understand multivariable functions: their definition, representation, and geometric interpretation. You'll see how functions map from high-dimensional input spaces to outputs, and develop intuition for thinking about surfaces, level sets, and how functions behave across multiple dimensions—all essential foundations for gradient-based optimization.

Defining Multivariable Functions

A multivariable function (also called a multivariate function or function of several variables) is a function whose domain consists of ordered tuples of real numbers rather than single real numbers. Formally:

Definition (Multivariable Function):

A function f: ℝⁿ → ℝ maps an n-dimensional input vector x = (x₁, x₂, ..., xₙ) to a single real number:

$$f(\mathbf{x}) = f(x_1, x_2, \ldots, x_n)$$

More generally, a function f: ℝⁿ → ℝᵐ maps n-dimensional inputs to m-dimensional outputs, but we'll focus primarily on scalar-valued functions (m = 1) since these represent the loss functions we optimize in machine learning.

The Conceptual Leap:

In single-variable calculus, the graph of f(x) is a curve in 2D. In two-variable calculus, the graph of f(x, y) is a surface in 3D. But what about f(x₁, x₂, ..., x₁₀₀)? We cannot visualize 101-dimensional space, yet the mathematics works identically. This abstraction is powerful: the tools we develop for two or three variables extend naturally to arbitrary dimensions.

Notation Conventions:

Throughout machine learning, you'll encounter several equivalent notations:

Explicit variables: f(x₁, x₂, ..., xₙ)
Vector notation: f(x) where x ∈ ℝⁿ
Parameter notation: f(x; θ) to distinguish inputs from model parameters
Loss notation: L(θ) or J(θ) for loss as a function of parameters

The ML Perspective

In machine learning, the function f typically represents either: (1) a model prediction f(x; θ) = ŷ, mapping input features to predictions, or (2) a loss function L(θ) = (1/n)Σᵢℓ(f(xᵢ; θ), yᵢ), measuring how well parameters θ explain training data. Both are fundamentally multivariable functions, and understanding their behavior requires multivariate calculus.

Domain, Range, and Dimensionality

Understanding the domain and range of multivariable functions is crucial for reasoning about machine learning models and their optimization landscapes.

Domain of a Multivariable Function:

The domain of f: ℝⁿ → ℝ is the set of all input vectors x for which f(x) is defined. Unlike single-variable functions where the domain is a subset of the real line, multivariable domains are subsets of n-dimensional space.

Common Domain Types in ML:

Domain Types in Machine Learning Functions
Domain Type	Mathematical Description	ML Example
Entire space ℝⁿ	No restrictions on inputs	Linear regression weights (unconstrained)
Open ball Bᵣ(c)	{x : ‖x - c‖ < r}	Parameters near initialization for local analysis
Closed hypercube [a,b]ⁿ	All coordinates bounded	Normalized features in [0,1]
Simplex Δₙ	{x ≥ 0 : Σxᵢ = 1}	Probability distributions (softmax outputs)
Positive orthant ℝⁿ₊	{x : xᵢ > 0 for all i}	Variance parameters, learning rates
Constraint manifold	{x : g(x) = 0}	Parameters satisfying normalization constraints

Range and Level Sets:

The range of f: ℝⁿ → ℝ is the set of all output values the function attains. For loss functions, the range is typically [0, ∞) or ℝ, depending on whether the loss is non-negative.

Level Sets (Contours):

A level set of f is the set of all points where f takes a constant value:

$$L_c = {\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c}$$

For a function of two variables, level sets are curves called contour lines. For three variables, they're surfaces. For n variables, they're (n-1)-dimensional hypersurfaces.

Geometric Intuition:

Imagine a topographic map showing elevation. Each contour line connects points of equal height. Similarly, for a loss function L(θ), level sets connect all parameter configurations with the same loss. The gradient at any point is perpendicular to the level set, pointing toward higher values—a fact that's fundamental to gradient descent.

Visualizing High Dimensions

While we cannot directly visualize functions in high dimensions, we can use projections, slices, and 2D contour plots of selected variable pairs. In practice, machine learning practitioners often visualize loss landscapes by projecting onto random or principal directions, revealing saddle points, valleys, and local minima structure.

Examples of Multivariable Functions in ML

Let's ground these abstractions with concrete examples from machine learning. Each example illustrates how real ML models are fundamentally multivariable functions.

Example 1: Linear Regression Prediction

A linear regression model with n features computes:

$$f(\mathbf{x}; \mathbf{w}, b) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b$$

This function takes x ∈ ℝⁿ as input and w ∈ ℝⁿ, b ∈ ℝ as parameters, producing a scalar output. The function is linear in both inputs and parameters.

Example 2: Mean Squared Error Loss

Given m training examples {(xᵢ, yᵢ)}, the MSE loss as a function of weights:

$$L(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} \left( \mathbf{w}^\top \mathbf{x}_i + b - y_i \right)^2$$

This is a function of n+1 parameters, mapping ℝⁿ⁺¹ → ℝ. It's a quadratic function—specifically, a positive definite quadratic form (bowl-shaped with a unique minimum) when the data matrix has full column rank.

Example 3: Softmax Cross-Entropy Loss

For K-class classification with input z (logits):

$$\text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}}$$

$$L(\mathbf{z}; y) = -\log\left( \text{softmax}(\mathbf{z})y \right) = -z_y + \log\left( \sum{j=1}^{K} e^{z_j} \right)$$

This is a nonlinear, convex function of z ∈ ℝᴷ.

Example 4: Neural Network Prediction

A two-layer neural network:

$$f(\mathbf{x}; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)}) = \mathbf{W}^{(2)} \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}$$

Here σ is a nonlinear activation (ReLU, sigmoid, etc.). The loss landscape as a function of all weight matrices and biases is highly non-convex, with many local minima and saddle points.

multivariable_functions_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
# Example 1: Linear function f(x, y) = 2x + 3y + 1
def linear_function(x, y):
    """A simple linear function of two variables."""
    return 2*x + 3*y + 1
 
# Example 2: Quadratic function (like MSE loss surface)
def quadratic_loss(w1, w2, x_data=None, y_data=None):
    """
    Simplified MSE loss surface for 2-parameter linear regression.
    L(w1, w2) = (1/n) * sum((w1*x + w2 - y)^2)
    
    For demonstration, we use a simple quadratic bowl:
    L(w1, w2) = (w1 - 3)^2 + (w2 - 2)^2
    Minimum at (3, 2)
    """
    return (w1 - 3)**2 + (w2 - 2)**2
 
# Example 3: Non-convex function (like neural network loss)
def non_convex_loss(w1, w2):
    """
    A non-convex function with multiple local minima.
    Illustrates the complexity of neural network loss landscapes.
    """
    return np.sin(w1) * np.cos(w2) + 0.1 * (w1**2 + w2**2)
 
# Visualize the quadratic (convex) loss surface
fig = plt.figure(figsize=(15, 5))
 
# Plot 1: 3D surface of quadratic loss
ax1 = fig.add_subplot(131, projection='3d')
w1 = np.linspace(-2, 8, 100)
w2 = np.linspace(-3, 7, 100)
W1, W2 = np.meshgrid(w1, w2)
L = quadratic_loss(W1, W2)
 
ax1.plot_surface(W1, W2, L, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w₁')
ax1.set_ylabel('w₂')
ax1.set_zlabel('Loss L(w₁, w₂)')
ax1.set_title('Quadratic Loss Surface (Convex)')
 
# Plot 2: Contour plot (level sets) of quadratic loss
ax2 = fig.add_subplot(132)
contour = ax2.contour(W1, W2, L, levels=20, cmap='viridis')
ax2.clabel(contour, inline=True, fontsize=8)
ax2.plot(3, 2, 'r*', markersize=15, label='Minimum at (3, 2)')
ax2.set_xlabel('w₁')
ax2.set_ylabel('w₂')
ax2.set_title('Level Sets (Contours) of Quadratic Loss')
ax2.legend()
 
# Plot 3: Non-convex loss surface
ax3 = fig.add_subplot(133, projection='3d')
w1_nc = np.linspace(-5, 5, 100)
w2_nc = np.linspace(-5, 5, 100)
W1_nc, W2_nc = np.meshgrid(w1_nc, w2_nc)
L_nc = non_convex_loss(W1_nc, W2_nc)
 
ax3.plot_surface(W1_nc, W2_nc, L_nc, cmap='plasma', alpha=0.8)
ax3.set_xlabel('w₁')
ax3.set_ylabel('w₂')
ax3.set_zlabel('Loss')
ax3.set_title('Non-Convex Loss Surface')
 
plt.tight_layout()
plt.savefig('loss_surfaces.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("Key observations:")
print("- Quadratic loss has a unique global minimum (bowl shape)")
print("- Contour lines are ellipses centered at the minimum")
print("- Non-convex loss has multiple local minima and saddle points")

Geometric Interpretation of Multivariable Functions

Developing geometric intuition for multivariable functions is essential for understanding optimization behavior in machine learning.

Graphs and Surfaces:

For a function f: ℝ² → ℝ, the graph is the set of all points (x, y, f(x, y)) ∈ ℝ³—a surface in three-dimensional space. For a quadratic function like MSE loss, this surface is a paraboloid (bowl shape). For more complex functions, the surface can have peaks, valleys, saddle points, and ridges.

Cross-Sections and Slices:

We can understand high-dimensional functions by examining lower-dimensional slices:

Fixing all but one variable: f(x₁) = f(x₁, c₂, c₃, ..., cₙ) gives a 1D curve
Fixing all but two variables: f(x₁, x₂) = f(x₁, x₂, c₃, ..., cₙ) gives a 2D surface

This technique is invaluable for loss landscape visualization: we often pick two random directions in parameter space and plot the loss along them.

Direction and Distance:

In ℝⁿ, a direction is specified by a unit vector u with ‖u‖ = 1. Moving from point x in direction u by distance t gives:

$$\mathbf{x} + t\mathbf{u}$$

The function value along this ray is g(t) = f(x + tu), a single-variable function. This reduction to one dimension is the conceptual foundation for directional derivatives.

Critical Geometric Concepts

•Contour Lines/Surfaces: Points where f takes the same value. In optimization, we move perpendicular to contours to decrease loss most rapidly.
•Gradient Direction: The gradient vector ∇f points in the direction of steepest increase of f. Moving opposite to the gradient (gradient descent) decreases f fastest.
•Saddle Points: Points where the function increases in some directions and decreases in others. These are the primary obstruction in non-convex optimization.
•Ridges and Valleys: Extended regions where the function is nearly flat along one direction but curves steeply in others. Common in neural network loss landscapes.
•Basins of Attraction: Regions from which gradient descent converges to a particular local minimum.

High-Dimensional Intuition Can Fail

Our 3D intuition can mislead us in high dimensions. For example, in high-dimensional spaces, most critical points of random functions are saddle points rather than local minima. This means gradient descent often escapes apparent 'traps' that don't exist in low dimensions.

Continuity and Smoothness

Before we can discuss derivatives of multivariable functions, we need to understand continuity in higher dimensions.

Continuity of Multivariable Functions:

A function f: ℝⁿ → ℝ is continuous at a point a if:

$$\lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = f(\mathbf{a})$$

Crucially, convergence x → a must hold for every possible approach path. In single-variable calculus, there are only two directions (left and right). In ℝⁿ, there are infinitely many directions and paths.

Path Dependence of Limits:

Consider the function:

$$f(x, y) = \begin{cases} \frac{xy}{x^2 + y^2} & \text{if } (x, y) \neq (0, 0) \ 0 & \text{if } (x, y) = (0, 0) \end{cases}$$

Approaching (0, 0) along the x-axis (y = 0): limit = 0 Approaching along y = x: f(x, x) = x²/(2x²) = 1/2

Different paths give different limits, so this function is not continuous at (0, 0).

Smoothness and Differentiability:

A function is smooth (C∞) if all partial derivatives of all orders exist and are continuous. Most loss functions in machine learning are smooth except at specific points:

ReLU activation: Non-differentiable at z = 0
L1 loss / absolute value: Non-differentiable at residual = 0
Hinge loss: Non-differentiable at margin = 1

These non-smooth points create challenges for gradient-based optimization, motivating smooth approximations (softplus, smooth L1, etc.).

Smoothness of Common ML Functions
Function	Smoothness	Impact on Optimization
MSE Loss	C∞ (infinitely smooth)	Gradients well-defined everywhere
Cross-Entropy Loss	C∞ (for positive inputs)	Smooth, but can have large gradients near 0
ReLU Activation	C⁰ (continuous, non-differentiable at 0)	Subgradients used; gradient = 0 or 1
Sigmoid Activation	C∞	Smooth but saturates (vanishing gradients)
L1 Regularization	C⁰ (non-differentiable at 0)	Promotes sparsity; requires proximal methods
Softmax	C∞	Smooth, gradients involve all outputs

Subgradients to the Rescue

When functions aren't differentiable at some points, we use subgradients—generalizations of gradients for convex functions. For ReLU at z = 0, any value in [0, 1] is a valid subgradient. In practice, implementations typically use 0 or 1 arbitrarily at the non-differentiable point, and training still works because hitting exactly z = 0 has probability zero.

Input Space Structure and Topology

Understanding the structure of the input space is essential for reasoning about optimization and generalization in machine learning.

Open and Closed Sets:

In ℝⁿ, an open ball of radius r centered at c is:

$$B_r(\mathbf{c}) = {\mathbf{x} \in \mathbb{R}^n : |\mathbf{x} - \mathbf{c}| < r}$$

A set S ⊆ ℝⁿ is open if every point has an open ball entirely contained in S. A set is closed if its complement is open (equivalently, if it contains all its limit points).

Bounded and Compact Sets:

A set is bounded if it fits inside some ball of finite radius. A set is compact if it is both closed and bounded. The Extreme Value Theorem states that a continuous function on a compact set attains its maximum and minimum—this guarantees that minimization problems over compact domains have solutions.

Relevance to ML:

Topological Concepts in Practice

•Unconstrained optimization (most neural networks): Domain is all of ℝⁿ, which is open but not bounded. The loss might not have a global minimum (could go to -∞ or only approach minimum asymptotically).
•Regularization adds implicit constraints: Weight decay keeps parameters from growing unboundedly, effectively confining optimization to a bounded region.
•Constrained optimization (SVMs, normalized weights): Domain might be a compact set, guaranteeing existence of optimal solutions.
•Probability simplex: The set of valid probability distributions {p ≥ 0, Σpᵢ = 1} is compact, ensuring entropy always has a maximum.
•Gradient clipping: Constrains the gradient magnitude, effectively limiting step sizes to bounded regions.

Convex Sets:

A set C ⊆ ℝⁿ is convex if the line segment between any two points in C lies entirely within C:

$$\forall \mathbf{x}, \mathbf{y} \in C, \forall t \in [0, 1]: t\mathbf{x} + (1-t)\mathbf{y} \in C$$

Convexity of the domain is crucial for convex optimization, where local minima are global. Examples of convex sets:

The entire space ℝⁿ
Any hyperplane or half-space
Balls (both open and closed)
The intersection of convex sets
The probability simplex
The cone of positive semidefinite matrices

Non-convex domains (e.g., the union of two disjoint balls) can create disconnected feasible regions where gradient descent cannot move between components.

Understanding Functions as Mappings

A powerful perspective is to view multivariable functions as transformations that map one space to another. This viewpoint is central to understanding neural networks.

Layer-by-Layer Transformations:

In a neural network, each layer is a function:

$$\mathbf{h}^{(\ell)} = \sigma(\mathbf{W}^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)})$$

This maps from ℝⁿ⁽ˡ⁻¹⁾ (previous layer's dimension) to ℝⁿ⁽ˡ⁾ (current layer's dimension). The full network is the composition of these layer functions:

$$f = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}$$

The Chain Rule for Compositions:

When we compose functions, their derivatives multiply. If g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ, then:

$$\frac{\partial (f \circ g)}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}}$$

This is the matrix chain rule, and it's the mathematical foundation of backpropagation. We'll explore this in depth when we study Jacobian matrices.

Representations and Embeddings:

The intermediate values h⁽ˡ⁾ are learned representations of the input. Each layer transforms the representation, ideally making the final classification or regression task easier. Functions can:

Expand dimensions: Map ℝⁿ → ℝᵐ where m > n (kernel methods, first hidden layers)
Contract dimensions: Map ℝⁿ → ℝᵐ where m < n (pooling, bottleneck architectures)
Preserve dimensions: Map ℝⁿ → ℝⁿ (residual connections, same-size hidden layers)

function_composition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
 
def layer_function(W, b, h_prev, activation='relu'):
    """
    A single neural network layer as a multivariable function.
    
    Maps: R^(input_dim) -> R^(output_dim)
    
    Parameters:
    - W: weight matrix of shape (output_dim, input_dim)
    - b: bias vector of shape (output_dim,)
    - h_prev: input from previous layer, shape (input_dim,)
    - activation: nonlinearity to apply
    
    Returns:
    - h_next: output of this layer, shape (output_dim,)
    """
    # Linear transformation: z = W @ h + b
    z = W @ h_prev + b
    
    # Apply nonlinearity
    if activation == 'relu':
        return np.maximum(0, z)
    elif activation == 'sigmoid':
        return 1 / (1 + np.exp(-z))
    elif activation == 'linear':
        return z
    else:
        raise ValueError(f"Unknown activation: {activation}")
 
def neural_network_forward(x, layers):
    """
    Full neural network as composition of layer functions.
    
    f = f^(L) ∘ f^(L-1) ∘ ... ∘ f^(1)
    
    Parameters:
    - x: input vector
    - layers: list of (W, b, activation) tuples
    
    Returns:
    - output: final network output
    - activations: list of all intermediate representations
    """
    activations = [x]  # h^(0) = x
    h = x
    
    for i, (W, b, activation) in enumerate(layers):
        h = layer_function(W, b, h, activation)
        activations.append(h)
        print(f"Layer {i+1}: R^{len(activations[-2])} -> R^{len(h)}")
    
    return h, activations
 
# Example: 3-layer network for 10-class classification
# Input: 784 dimensions (flattened 28x28 image)
# Hidden: 256 -> 128 dimensions
# Output: 10 dimensions (class logits)
 
np.random.seed(42)
 
layers = [
    (np.random.randn(256, 784) * 0.01, np.zeros(256), 'relu'),   # R^784 -> R^256
    (np.random.randn(128, 256) * 0.01, np.zeros(128), 'relu'),   # R^256 -> R^128
    (np.random.randn(10, 128) * 0.01, np.zeros(10), 'linear'),   # R^128 -> R^10
]
 
# Random input (like a flattened image)
x = np.random.randn(784)
 
output, activations = neural_network_forward(x, layers)
 
print(f"\nInput dimension: {len(x)}")
print(f"Output dimension: {len(output)}")
print(f"Number of intermediate representations: {len(activations)}")
 
# Count total parameters
total_params = sum(W.size + b.size for W, b, _ in layers)
print(f"\nTotal parameters: {total_params:,}")
print("The loss function is a multivariable function of all these parameters!")

Summary: Foundations Established

We've laid the groundwork for multivariate calculus by establishing what multivariable functions are, how to think about them geometrically, and why they're central to machine learning.

Key Concepts Covered:

Key Takeaways

•Multivariable functions map vectors x ∈ ℝⁿ to scalars (or to higher-dimensional outputs). In ML, inputs are feature vectors, and outputs are predictions or loss values.
•Domain and range in high dimensions involve sets like hypercubes, balls, simplices, and manifolds—each with implications for optimization feasibility.
•Level sets (contours) are hypersurfaces of constant function value. Optimization moves perpendicular to these contours.
•Continuity requires limits to be path-independent. Smoothness enables gradient-based optimization.
•Topological structure—open, closed, bounded, compact, convex sets—determines existence of optima and behavior of optimization algorithms.
•Functions as transformations gives insight into neural networks as compositions of layer-wise mappings, each transforming representations.

What's Next:

Now that we understand what multivariable functions are, we're ready to ask: How do they change? The next page introduces gradient vectors—the fundamental tool for measuring how a function changes in all directions simultaneously. The gradient generalizes the derivative to multiple dimensions and provides the direction of steepest ascent, making it the cornerstone of optimization algorithms like gradient descent.

Page Complete

You now have a solid foundation in multivariable functions. You understand their definition, geometric interpretation, and central role in machine learning. Next, we'll explore gradient vectors—the direction of steepest ascent that drives every gradient-based optimization algorithm.