Loading content...
Imagine you're a cartographer tasked with creating a two-dimensional map from a globe. You face an impossible choice: no flat map can perfectly represent a spherical Earth. Some distortion is inevitable. The question becomes: which distortion is least harmful for your purposes?
Principal Component Analysis faces a remarkably similar challenge. Given high-dimensional data, we must project it onto a lower-dimensional subspace. Some information loss is inevitable. PCA's elegant answer begins with a deceptively simple principle: preserve the directions where data varies most.
This page develops the maximum variance formulation of PCA—the foundational perspective that transforms an abstract data compression problem into a concrete optimization problem with a beautiful closed-form solution.
By the end of this page, you will understand why maximizing variance is a principled approach to dimensionality reduction, derive the optimization problem that defines principal components, see how the solution emerges naturally from eigenvalue decomposition, and gain geometric intuition for what PCA is actually computing.
Before diving into mathematics, let's build intuition for why variance is the right quantity to maximize. Consider a dataset with observations in a high-dimensional space. Each dimension carries some information, but not all dimensions are equally informative.
When data varies strongly along a particular direction, that direction carries discriminative information—it helps distinguish one data point from another. Conversely, if data barely varies along some direction, that direction provides little value for distinguishing points.
Consider a concrete example: suppose you're analyzing customer data with features including age, annual income, and a unique customer ID number. The customer ID might span a huge numerical range, but it's meaningless noise—it doesn't encode any real pattern. Meanwhile, the relationship between age and income might reveal meaningful purchasing behavior clusters.
From an information-theoretic perspective, variance is closely related to entropy for Gaussian-distributed data. Maximizing variance is equivalent to preserving the most entropy—keeping the directions that carry the most 'surprise' or information about the data.
Another way to view this: high-variance directions typically carry signal, while low-variance directions often carry noise. Noise arises from measurement error, rounding, or irrelevant factors. By focusing on high-variance directions, we implicitly perform denoising.
From a compression standpoint, if we must describe data using fewer numbers, we should use those numbers for the dimensions that matter most. Spending bits on near-constant dimensions wastes capacity. Variance tells us where the 'action' is.
| Perspective | Key Insight | What Variance Captures |
|---|---|---|
| Information Theory | Variance relates to entropy for Gaussian data | Information content / surprise |
| Signal Processing | High variance ≈ signal, low variance ≈ noise | Signal-to-noise separation |
| Data Compression | Allocate representation budget to variable dimensions | Where data 'spreads out' |
| Geometric | Find the 'long axes' of the data cloud | Principal directions of spread |
| Statistical | Capture the most variation in fewest dimensions | Dominant modes of variation |
Let's formalize the maximum variance objective. We'll build the mathematical machinery step by step, ensuring every symbol and operation is crystal clear.
Suppose we have a dataset of n observations, each with d features. We represent this as a data matrix $\mathbf{X}$ with dimensions $n \times d$:
$$\mathbf{X} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_d^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_d^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(n)} & x_2^{(n)} & \cdots & x_d^{(n)} \end{bmatrix}$$
Each row is an observation $\mathbf{x}^{(i)} \in \mathbb{R}^d$, and each column represents a feature across all observations.
We assume the data is centered: the mean of each feature is zero. If not, we subtract the mean: $\tilde{\mathbf{x}} = \mathbf{x} - \bar{\mathbf{x}}$. This is critical because PCA finds directions of variance around the centroid. Without centering, we'd conflate the location of the data with its spread.
We want to project data onto a direction defined by a unit vector $\mathbf{w} \in \mathbb{R}^d$ where $\|\mathbf{w}\| = 1$. The projection of a single observation $\mathbf{x}^{(i)}$ onto this direction is:
$$z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} = \sum_{j=1}^{d} w_j x_j^{(i)}$$
This scalar $z^{(i)}$ represents the 'coordinate' of point $\mathbf{x}^{(i)}$ along the direction $\mathbf{w}$. Geometrically, it's the signed length of the projection of $\mathbf{x}^{(i)}$ onto the line through the origin with direction $\mathbf{w}$.
The variance of these projected values across all observations is:
$$\text{Var}(z) = \frac{1}{n} \sum_{i=1}^{n} (z^{(i)})^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)})^2$$
Note: we use $(z^{(i)})^2$ rather than $(z^{(i)} - \bar{z})^2$ because the data is centered, so $\bar{z} = \mathbf{w}^T \bar{\mathbf{x}} = 0$.
Let's rewrite this in matrix form. If we project all observations, we get a vector:
$$\mathbf{z} = \mathbf{X}\mathbf{w}$$
The variance becomes:
$$\text{Var}(z) = \frac{1}{n} \mathbf{z}^T \mathbf{z} = \frac{1}{n} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{w}^T \mathbf{S} \mathbf{w}$$
where $\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X}$ is the sample covariance matrix (for centered data).
The covariance matrix $\mathbf{S}$ completely determines the solution to PCA. It captures all pairwise relationships between features. Understanding its structure—that it's symmetric, positive semi-definite, and has real eigenvalues—is essential for understanding PCA.
We can now state the first principal component problem precisely:
Problem: Find the unit vector $\mathbf{w}_1$ that maximizes the variance of projections:
$$\mathbf{w}1 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1$$
The constraint $\mathbf{w}^T \mathbf{w} = 1$ is crucial. Without it, we could make variance arbitrarily large by scaling $\mathbf{w}$. We want a direction, and directions are represented by unit vectors.
This is a constrained optimization problem. We use the method of Lagrange multipliers. Define the Lagrangian:
$$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$
Taking the gradient with respect to $\mathbf{w}$ and setting it to zero:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0$$
This simplifies to:
$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$
This is an eigenvalue equation! The optimal direction $\mathbf{w}$ must be an eigenvector of the covariance matrix $\mathbf{S}$, and $\lambda$ is the corresponding eigenvalue.
The fact that principal components are eigenvectors of the covariance matrix is not a computational trick—it emerges naturally from the optimization. This is one of the most elegant results in applied mathematics: a statistical question (maximize variance) has a geometric answer (eigenvectors).
The covariance matrix $\mathbf{S}$ (being symmetric positive semi-definite) has $d$ real, non-negative eigenvalues. Let's order them:
$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$$
with corresponding eigenvectors $\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d$.
Now, the variance along eigenvector $\mathbf{w}_k$ is:
$$\mathbf{w}_k^T \mathbf{S} \mathbf{w}_k = \mathbf{w}_k^T (\lambda_k \mathbf{w}_k) = \lambda_k \|\mathbf{w}_k\|^2 = \lambda_k$$
The eigenvalue equals the variance along its eigenvector!
Therefore, the first principal component—the direction of maximum variance—is the eigenvector corresponding to the largest eigenvalue:
$$\mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ with eigenvalue } \lambda_1$$
The first principal component captures the direction of maximum variance. But we typically want more than one component. How do we find the second, third, and subsequent components?
If we simply re-ran the optimization for the second component, we'd get the same answer—the first component already maximizes variance! We need an additional constraint: the second component must be orthogonal to the first.
The problem for the second principal component is:
$$\mathbf{w}2 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1, \quad \mathbf{w}^T \mathbf{w}_1 = 0$$
Using Lagrange multipliers with both constraints:
$$\mathcal{L}(\mathbf{w}, \lambda, \mu) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1) - \mu(\mathbf{w}^T \mathbf{w}_1)$$
Taking the gradient:
$$2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} - \mu\mathbf{w}_1 = 0$$
Multiplying both sides by $\mathbf{w}_1^T$:
$$2\mathbf{w}_1^T \mathbf{S}\mathbf{w} - 2\lambda\mathbf{w}_1^T\mathbf{w} - \mu\mathbf{w}_1^T\mathbf{w}_1 = 0$$
Since $\mathbf{S}$ is symmetric: $\mathbf{w}_1^T \mathbf{S}\mathbf{w} = \mathbf{w}^T \mathbf{S}\mathbf{w}_1 = \mathbf{w}^T (\lambda_1 \mathbf{w}_1) = \lambda_1 (\mathbf{w}^T \mathbf{w}_1) = 0$
Also, $\mathbf{w}_1^T\mathbf{w} = 0$ (orthogonality constraint) and $\mathbf{w}_1^T\mathbf{w}_1 = 1$.
This gives us: $0 - 0 - \mu = 0$, so $\mu = 0$.
With $\mu = 0$, the optimality condition reduces to:
$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$
The same eigenvalue equation! The second principal component is the eigenvector with the second-largest eigenvalue.
This pattern continues: the $k$-th principal component is the eigenvector of $\mathbf{S}$ with the $k$-th largest eigenvalue. All principal components are simply the eigenvectors of the covariance matrix, ordered by eigenvalue magnitude. This is why computing PCA reduces to eigendecomposition.
There's a beautiful theorem underlying this: eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are automatically orthogonal. Since $\mathbf{S}$ is symmetric (real and $\mathbf{S} = \mathbf{S}^T$), its eigenvectors form an orthonormal basis.
Proof sketch: Let $\mathbf{S}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\mathbf{S}\mathbf{w}_j = \lambda_j \mathbf{w}_j$ with $\lambda_i eq \lambda_j$.
Compute: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = \mathbf{w}_j^T (\lambda_i \mathbf{w}_i) = \lambda_i (\mathbf{w}_j^T \mathbf{w}_i)$
Also: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = (\mathbf{S}\mathbf{w}_j)^T \mathbf{w}_i = (\lambda_j \mathbf{w}_j)^T \mathbf{w}_i = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$
Equating: $\lambda_i (\mathbf{w}_j^T \mathbf{w}_i) = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$
Since $\lambda_i eq \lambda_j$, we must have $\mathbf{w}_j^T \mathbf{w}_i = 0$. QED.
The algebra tells us that principal components are eigenvectors of the covariance matrix. But what does this mean geometrically? The visual intuition is remarkably powerful.
Visualize your $d$-dimensional data as a cloud of points. If features are correlated, this cloud is elongated—stretched more in some directions than others. It forms an ellipsoid shape (for roughly Gaussian data).
The principal components are precisely the principal axes of this ellipsoid:
The eigenvalues are the squared semi-axis lengths: $\sqrt{\lambda_k}$ is the standard deviation along the $k$-th principal axis.
| Mathematical Object | Geometric Meaning | Statistical Meaning |
|---|---|---|
| Eigenvector $\mathbf{w}_k$ | $k$-th principal axis of data ellipsoid | Direction of $k$-th most variance |
| Eigenvalue $\lambda_k$ | Squared length of $k$-th semi-axis | Variance along $k$-th principal direction |
| $\sqrt{\lambda_k}$ | Length of $k$-th semi-axis | Standard deviation along $k$-th PC |
| Projection $z_k = \mathbf{w}_k^T \mathbf{x}$ | Coordinate along $k$-th axis | $k$-th principal component score |
Another way to view PCA: it rotates the coordinate system to align with the data's natural axes. In the original coordinates, features may be correlated—the data cloud is tilted. In the PCA coordinate system, the axes align with directions of maximum spread, and the transformed features are uncorrelated.
Let $\mathbf{W} = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_d]$ be the matrix of all eigenvectors (as columns). The transformation:
$$\mathbf{Z} = \mathbf{X}\mathbf{W}$$
gives us data in the new coordinate system. The covariance of $\mathbf{Z}$ is diagonal:
$$\text{Cov}(\mathbf{Z}) = \mathbf{W}^T \mathbf{S} \mathbf{W} = \mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$$
This is the decorrelation property of PCA: the transformed features have zero covariance with each other.
PCA transforms correlated features into uncorrelated principal components. This is valuable because many machine learning algorithms (like linear regression) suffer when features are highly correlated (multicollinearity). PCA-transformed features are guaranteed to be uncorrelated.
The covariance matrix is the heart of PCA. Understanding its structure reveals why PCA works and what its limitations are.
For centered data, the covariance matrix is:
$$\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T$$
Each entry $S_{jk}$ is the covariance between features $j$ and $k$:
$$S_{jk} = \frac{1}{n} \sum_{i=1}^{n} x_j^{(i)} x_k^{(i)}$$
The diagonal entries $S_{jj}$ are the variances of individual features.
A critical practical consideration: if you have fewer observations than features ($n < d$), the covariance matrix is rank deficient. It will have at most $n - 1$ non-zero eigenvalues (subtracting 1 for centering).
This means:
This is common in genomics (thousands of genes, hundreds of samples), text analysis (millions of words, thousands of documents), and image analysis (millions of pixels, limited images).
When $d \gg n$, PCA can still be computed but interpretation changes. Many eigenvalues will be zero or near-zero, not because those directions are noise-free, but because we don't have enough data to estimate variance in those directions. This is a fundamental limitation, not a bug in PCA.
Understanding the computational aspects of PCA is essential for applying it to real-world datasets. The choice of algorithm depends critically on the dimensions of your problem.
The direct approach:
Total: $O(nd^2 + d^3)$
This works well when $d$ is moderate (hundreds to low thousands).
Instead of explicitly forming $\mathbf{S}$, we can use the Singular Value Decomposition of $\mathbf{X}$ directly:
$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$
The right singular vectors $\mathbf{V}$ are exactly the eigenvectors of $\mathbf{X}^T\mathbf{X}$ (proportional to $\mathbf{S}$). The singular values $\sigma_k$ relate to eigenvalues by $\lambda_k = \sigma_k^2 / n$.
Advantages of SVD:
| Method | Time Complexity | Space Complexity | Best When |
|---|---|---|---|
| Full Eigendecomposition | $O(nd^2 + d^3)$ | $O(d^2)$ | $d$ is small |
| Full SVD | $O(\min(nd^2, n^2d))$ | $O(nd)$ | $n$ and $d$ moderate |
| Truncated SVD | $O(ndk)$ | $O((n+d)k)$ | Only need $k \ll d$ components |
| Randomized SVD | $O(nd \log k)$ | $O((n+d)k)$ | Large datasets, approximate OK |
| Incremental PCA | $O(ndk)$ per pass | $O(dk)$ | Data doesn't fit in memory |
When $n \ll d$ (more features than samples), there's a clever trick. Instead of the $d \times d$ covariance matrix, compute the $n \times n$ Gram matrix:
$$\mathbf{G} = \frac{1}{n} \mathbf{X}\mathbf{X}^T$$
The eigenvectors of $\mathbf{G}$ can be converted to eigenvectors of $\mathbf{S}$:
If $\mathbf{G}\mathbf{u} = \lambda \mathbf{u}$, then $\mathbf{S}(\mathbf{X}^T\mathbf{u}) = \lambda(\mathbf{X}^T\mathbf{u})$
So $\mathbf{w} = \mathbf{X}^T\mathbf{u}$ (normalized) is the eigenvector of $\mathbf{S}$.
Complexity: $O(n^2d + n^3)$ — much better when $n \ll d$.
For most applications, use your library's PCA implementation (scikit-learn, NumPy, etc.). These handle the algorithm selection automatically. Only worry about computational details if you hit memory or time limits—then consider randomized or incremental variants.
We've developed the maximum variance formulation of PCA from first principles. Let's consolidate the key insights:
The maximum variance formulation tells us what PCA computes. In the next page, we'll explore an equivalent but different perspective: minimum reconstruction error. This view reveals PCA as the optimal linear projection for data compression—a perspective that connects to autoencoders and helps us understand what information PCA discards.