Pca Theory - Learning Module

Loading content...

0/278

Maximum Variance Formulation

The Quest for Meaningful Directions

Imagine you're a cartographer tasked with creating a two-dimensional map from a globe. You face an impossible choice: no flat map can perfectly represent a spherical Earth. Some distortion is inevitable. The question becomes: which distortion is least harmful for your purposes?

Principal Component Analysis faces a remarkably similar challenge. Given high-dimensional data, we must project it onto a lower-dimensional subspace. Some information loss is inevitable. PCA's elegant answer begins with a deceptively simple principle: preserve the directions where data varies most.

This page develops the maximum variance formulation of PCA—the foundational perspective that transforms an abstract data compression problem into a concrete optimization problem with a beautiful closed-form solution.

What You Will Learn

By the end of this page, you will understand why maximizing variance is a principled approach to dimensionality reduction, derive the optimization problem that defines principal components, see how the solution emerges naturally from eigenvalue decomposition, and gain geometric intuition for what PCA is actually computing.

Why Variance Matters in Dimensionality Reduction

Before diving into mathematics, let's build intuition for why variance is the right quantity to maximize. Consider a dataset with observations in a high-dimensional space. Each dimension carries some information, but not all dimensions are equally informative.

The Information Density Perspective

When data varies strongly along a particular direction, that direction carries discriminative information—it helps distinguish one data point from another. Conversely, if data barely varies along some direction, that direction provides little value for distinguishing points.

Consider a concrete example: suppose you're analyzing customer data with features including age, annual income, and a unique customer ID number. The customer ID might span a huge numerical range, but it's meaningless noise—it doesn't encode any real pattern. Meanwhile, the relationship between age and income might reveal meaningful purchasing behavior clusters.

Variance as Information

From an information-theoretic perspective, variance is closely related to entropy for Gaussian-distributed data. Maximizing variance is equivalent to preserving the most entropy—keeping the directions that carry the most 'surprise' or information about the data.

The Signal-to-Noise Perspective

Another way to view this: high-variance directions typically carry signal, while low-variance directions often carry noise. Noise arises from measurement error, rounding, or irrelevant factors. By focusing on high-variance directions, we implicitly perform denoising.

The Compression Perspective

From a compression standpoint, if we must describe data using fewer numbers, we should use those numbers for the dimensions that matter most. Spending bits on near-constant dimensions wastes capacity. Variance tells us where the 'action' is.

Perspectives on Why Variance Maximization Makes Sense
Perspective	Key Insight	What Variance Captures
Information Theory	Variance relates to entropy for Gaussian data	Information content / surprise
Signal Processing	High variance ≈ signal, low variance ≈ noise	Signal-to-noise separation
Data Compression	Allocate representation budget to variable dimensions	Where data 'spreads out'
Geometric	Find the 'long axes' of the data cloud	Principal directions of spread
Statistical	Capture the most variation in fewest dimensions	Dominant modes of variation

Setting Up the Mathematical Framework

Let's formalize the maximum variance objective. We'll build the mathematical machinery step by step, ensuring every symbol and operation is crystal clear.

Data Representation

Suppose we have a dataset of n observations, each with d features. We represent this as a data matrix $\mathbf{X}$ with dimensions $n \times d$:

$$\mathbf{X} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_d^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_d^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(n)} & x_2^{(n)} & \cdots & x_d^{(n)} \end{bmatrix}$$

Each row is an observation $\mathbf{x}^{(i)} \in \mathbb{R}^d$, and each column represents a feature across all observations.

Centering Assumption

We assume the data is centered: the mean of each feature is zero. If not, we subtract the mean: $\tilde{\mathbf{x}} = \mathbf{x} - \bar{\mathbf{x}}$. This is critical because PCA finds directions of variance around the centroid. Without centering, we'd conflate the location of the data with its spread.

The Projection Operator

We want to project data onto a direction defined by a unit vector $\mathbf{w} \in \mathbb{R}^d$ where $\|\mathbf{w}\| = 1$. The projection of a single observation $\mathbf{x}^{(i)}$ onto this direction is:

$$z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} = \sum_{j=1}^{d} w_j x_j^{(i)}$$

This scalar $z^{(i)}$ represents the 'coordinate' of point $\mathbf{x}^{(i)}$ along the direction $\mathbf{w}$. Geometrically, it's the signed length of the projection of $\mathbf{x}^{(i)}$ onto the line through the origin with direction $\mathbf{w}$.

Variance of Projected Data

The variance of these projected values across all observations is:

$$\text{Var}(z) = \frac{1}{n} \sum_{i=1}^{n} (z^{(i)})^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)})^2$$

Note: we use $(z^{(i)})^2$ rather than $(z^{(i)} - \bar{z})^2$ because the data is centered, so $\bar{z} = \mathbf{w}^T \bar{\mathbf{x}} = 0$.

Matrix Formulation

Let's rewrite this in matrix form. If we project all observations, we get a vector:

$$\mathbf{z} = \mathbf{X}\mathbf{w}$$

The variance becomes:

$$\text{Var}(z) = \frac{1}{n} \mathbf{z}^T \mathbf{z} = \frac{1}{n} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{w}^T \mathbf{S} \mathbf{w}$$

where $\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X}$ is the sample covariance matrix (for centered data).

The Covariance Matrix is Central

The covariance matrix $\mathbf{S}$ completely determines the solution to PCA. It captures all pairwise relationships between features. Understanding its structure—that it's symmetric, positive semi-definite, and has real eigenvalues—is essential for understanding PCA.

Formulating the Optimization Problem

We can now state the first principal component problem precisely:

The First Principal Component

Problem: Find the unit vector $\mathbf{w}_1$ that maximizes the variance of projections:

$$\mathbf{w}1 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1$$

The constraint $\mathbf{w}^T \mathbf{w} = 1$ is crucial. Without it, we could make variance arbitrarily large by scaling $\mathbf{w}$. We want a direction, and directions are represented by unit vectors.

Solving with Lagrange Multipliers

This is a constrained optimization problem. We use the method of Lagrange multipliers. Define the Lagrangian:

$$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$

Taking the gradient with respect to $\mathbf{w}$ and setting it to zero:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0$$

This simplifies to:

$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$

This is an eigenvalue equation! The optimal direction $\mathbf{w}$ must be an eigenvector of the covariance matrix $\mathbf{S}$, and $\lambda$ is the corresponding eigenvalue.

The Eigenvalue Connection

The fact that principal components are eigenvectors of the covariance matrix is not a computational trick—it emerges naturally from the optimization. This is one of the most elegant results in applied mathematics: a statistical question (maximize variance) has a geometric answer (eigenvectors).

Which Eigenvector Maximizes Variance?

The covariance matrix $\mathbf{S}$ (being symmetric positive semi-definite) has $d$ real, non-negative eigenvalues. Let's order them:

$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$$

with corresponding eigenvectors $\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d$.

Now, the variance along eigenvector $\mathbf{w}_k$ is:

$$\mathbf{w}_k^T \mathbf{S} \mathbf{w}_k = \mathbf{w}_k^T (\lambda_k \mathbf{w}_k) = \lambda_k \|\mathbf{w}_k\|^2 = \lambda_k$$

The eigenvalue equals the variance along its eigenvector!

Therefore, the first principal component—the direction of maximum variance—is the eigenvector corresponding to the largest eigenvalue:

$$\mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ with eigenvalue } \lambda_1$$

Finding Subsequent Principal Components

The first principal component captures the direction of maximum variance. But we typically want more than one component. How do we find the second, third, and subsequent components?

The Orthogonality Requirement

If we simply re-ran the optimization for the second component, we'd get the same answer—the first component already maximizes variance! We need an additional constraint: the second component must be orthogonal to the first.

The problem for the second principal component is:

$$\mathbf{w}2 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1, \quad \mathbf{w}^T \mathbf{w}_1 = 0$$

Solving with Two Constraints

Using Lagrange multipliers with both constraints:

$$\mathcal{L}(\mathbf{w}, \lambda, \mu) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1) - \mu(\mathbf{w}^T \mathbf{w}_1)$$

Taking the gradient:

$$2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} - \mu\mathbf{w}_1 = 0$$

Multiplying both sides by $\mathbf{w}_1^T$:

$$2\mathbf{w}_1^T \mathbf{S}\mathbf{w} - 2\lambda\mathbf{w}_1^T\mathbf{w} - \mu\mathbf{w}_1^T\mathbf{w}_1 = 0$$

Since $\mathbf{S}$ is symmetric: $\mathbf{w}_1^T \mathbf{S}\mathbf{w} = \mathbf{w}^T \mathbf{S}\mathbf{w}_1 = \mathbf{w}^T (\lambda_1 \mathbf{w}_1) = \lambda_1 (\mathbf{w}^T \mathbf{w}_1) = 0$

Also, $\mathbf{w}_1^T\mathbf{w} = 0$ (orthogonality constraint) and $\mathbf{w}_1^T\mathbf{w}_1 = 1$.

This gives us: $0 - 0 - \mu = 0$, so $\mu = 0$.

With $\mu = 0$, the optimality condition reduces to:

$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$

The same eigenvalue equation! The second principal component is the eigenvector with the second-largest eigenvalue.

The Elegant Recursive Structure

This pattern continues: the $k$-th principal component is the eigenvector of $\mathbf{S}$ with the $k$-th largest eigenvalue. All principal components are simply the eigenvectors of the covariance matrix, ordered by eigenvalue magnitude. This is why computing PCA reduces to eigendecomposition.

Why Orthogonality Automatically Emerges

There's a beautiful theorem underlying this: eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are automatically orthogonal. Since $\mathbf{S}$ is symmetric (real and $\mathbf{S} = \mathbf{S}^T$), its eigenvectors form an orthonormal basis.

Proof sketch: Let $\mathbf{S}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\mathbf{S}\mathbf{w}_j = \lambda_j \mathbf{w}_j$ with $\lambda_i eq \lambda_j$.

Compute: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = \mathbf{w}_j^T (\lambda_i \mathbf{w}_i) = \lambda_i (\mathbf{w}_j^T \mathbf{w}_i)$

Also: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = (\mathbf{S}\mathbf{w}_j)^T \mathbf{w}_i = (\lambda_j \mathbf{w}_j)^T \mathbf{w}_i = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$

Equating: $\lambda_i (\mathbf{w}_j^T \mathbf{w}_i) = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$

Since $\lambda_i eq \lambda_j$, we must have $\mathbf{w}_j^T \mathbf{w}_i = 0$. QED.

Geometric Interpretation

The algebra tells us that principal components are eigenvectors of the covariance matrix. But what does this mean geometrically? The visual intuition is remarkably powerful.

The Data Cloud

Visualize your $d$-dimensional data as a cloud of points. If features are correlated, this cloud is elongated—stretched more in some directions than others. It forms an ellipsoid shape (for roughly Gaussian data).

Principal Components as Ellipsoid Axes

The principal components are precisely the principal axes of this ellipsoid:

First PC ($\mathbf{w}_1$): The direction along which the ellipsoid is longest—the major axis
Second PC ($\mathbf{w}_2$): The direction perpendicular to $\mathbf{w}_1$ along which the ellipsoid is longest—the next major axis
And so on...

The eigenvalues are the squared semi-axis lengths: $\sqrt{\lambda_k}$ is the standard deviation along the $k$-th principal axis.

Geometric Interpretation of PCA Components
Mathematical Object	Geometric Meaning	Statistical Meaning
Eigenvector $\mathbf{w}_k$	$k$-th principal axis of data ellipsoid	Direction of $k$-th most variance
Eigenvalue $\lambda_k$	Squared length of $k$-th semi-axis	Variance along $k$-th principal direction
$\sqrt{\lambda_k}$	Length of $k$-th semi-axis	Standard deviation along $k$-th PC
Projection $z_k = \mathbf{w}_k^T \mathbf{x}$	Coordinate along $k$-th axis	$k$-th principal component score

The Rotation Interpretation

Another way to view PCA: it rotates the coordinate system to align with the data's natural axes. In the original coordinates, features may be correlated—the data cloud is tilted. In the PCA coordinate system, the axes align with directions of maximum spread, and the transformed features are uncorrelated.

Let $\mathbf{W} = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_d]$ be the matrix of all eigenvectors (as columns). The transformation:

$$\mathbf{Z} = \mathbf{X}\mathbf{W}$$

gives us data in the new coordinate system. The covariance of $\mathbf{Z}$ is diagonal:

$$\text{Cov}(\mathbf{Z}) = \mathbf{W}^T \mathbf{S} \mathbf{W} = \mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$$

This is the decorrelation property of PCA: the transformed features have zero covariance with each other.

PCA as Decorrelation

PCA transforms correlated features into uncorrelated principal components. This is valuable because many machine learning algorithms (like linear regression) suffer when features are highly correlated (multicollinearity). PCA-transformed features are guaranteed to be uncorrelated.

The Covariance Matrix: A Deep Dive

The covariance matrix is the heart of PCA. Understanding its structure reveals why PCA works and what its limitations are.

Definition and Properties

For centered data, the covariance matrix is:

$$\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T$$

Each entry $S_{jk}$ is the covariance between features $j$ and $k$:

$$S_{jk} = \frac{1}{n} \sum_{i=1}^{n} x_j^{(i)} x_k^{(i)}$$

The diagonal entries $S_{jj}$ are the variances of individual features.

Key Properties of the Covariance Matrix

•Symmetric: $\mathbf{S} = \mathbf{S}^T$. Covariance of $j$ with $k$ equals covariance of $k$ with $j$.
•Positive Semi-Definite: $\mathbf{w}^T \mathbf{S} \mathbf{w} \geq 0$ for all $\mathbf{w}$. Variance is never negative.
•Real Eigenvalues: Symmetry guarantees all eigenvalues are real numbers.
•Non-Negative Eigenvalues: Positive semi-definiteness guarantees $\lambda_k \geq 0$ for all $k$.
•Orthogonal Eigenvectors: Eigenvectors can be chosen to form an orthonormal basis.
•Rank ≤ min(n, d): If $n < d$, at least $d - n$ eigenvalues are zero.

The Rank Deficiency Issue

A critical practical consideration: if you have fewer observations than features ($n < d$), the covariance matrix is rank deficient. It will have at most $n - 1$ non-zero eigenvalues (subtracting 1 for centering).

This means:

You can extract at most $n - 1$ meaningful principal components
The remaining components correspond to eigenvalue 0—directions where the data has no variance
The data lies in an $(n-1)$-dimensional subspace of the $d$-dimensional feature space

This is common in genomics (thousands of genes, hundreds of samples), text analysis (millions of words, thousands of documents), and image analysis (millions of pixels, limited images).

High-Dimensional Caution

When $d \gg n$, PCA can still be computed but interpretation changes. Many eigenvalues will be zero or near-zero, not because those directions are noise-free, but because we don't have enough data to estimate variance in those directions. This is a fundamental limitation, not a bug in PCA.

Computational Considerations

Understanding the computational aspects of PCA is essential for applying it to real-world datasets. The choice of algorithm depends critically on the dimensions of your problem.

Approach 1: Full Eigendecomposition

The direct approach:

Compute covariance matrix $\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ — $O(nd^2)$
Eigendecompose $\mathbf{S}$ — $O(d^3)$

Total: $O(nd^2 + d^3)$

This works well when $d$ is moderate (hundreds to low thousands).

Approach 2: SVD-Based Computation

Instead of explicitly forming $\mathbf{S}$, we can use the Singular Value Decomposition of $\mathbf{X}$ directly:

$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

The right singular vectors $\mathbf{V}$ are exactly the eigenvectors of $\mathbf{X}^T\mathbf{X}$ (proportional to $\mathbf{S}$). The singular values $\sigma_k$ relate to eigenvalues by $\lambda_k = \sigma_k^2 / n$.

Advantages of SVD:

More numerically stable
Works directly on $\mathbf{X}$ without forming $\mathbf{X}^T\mathbf{X}$ (avoids squaring condition number)
Many efficient implementations available (LAPACK, etc.)

Computational Complexity of PCA Methods
Method	Time Complexity	Space Complexity	Best When
Full Eigendecomposition	$O(nd^2 + d^3)$	$O(d^2)$	$d$ is small
Full SVD	$O(\min(nd^2, n^2d))$	$O(nd)$	$n$ and $d$ moderate
Truncated SVD	$O(ndk)$	$O((n+d)k)$	Only need $k \ll d$ components
Randomized SVD	$O(nd \log k)$	$O((n+d)k)$	Large datasets, approximate OK
Incremental PCA	$O(ndk)$ per pass	$O(dk)$	Data doesn't fit in memory

The Gram Matrix Trick

When $n \ll d$ (more features than samples), there's a clever trick. Instead of the $d \times d$ covariance matrix, compute the $n \times n$ Gram matrix:

$$\mathbf{G} = \frac{1}{n} \mathbf{X}\mathbf{X}^T$$

The eigenvectors of $\mathbf{G}$ can be converted to eigenvectors of $\mathbf{S}$:

If $\mathbf{G}\mathbf{u} = \lambda \mathbf{u}$, then $\mathbf{S}(\mathbf{X}^T\mathbf{u}) = \lambda(\mathbf{X}^T\mathbf{u})$

So $\mathbf{w} = \mathbf{X}^T\mathbf{u}$ (normalized) is the eigenvector of $\mathbf{S}$.

Complexity: $O(n^2d + n^3)$ — much better when $n \ll d$.

Practical Recommendation

For most applications, use your library's PCA implementation (scikit-learn, NumPy, etc.). These handle the algorithm selection automatically. Only worry about computational details if you hit memory or time limits—then consider randomized or incremental variants.

Summary and Key Takeaways

We've developed the maximum variance formulation of PCA from first principles. Let's consolidate the key insights:

Key Takeaways

•PCA maximizes projected variance: The first principal component is the direction that captures the most variance in the data.
•The solution is an eigenvalue problem: Maximizing variance subject to unit length constraint yields the eigenvalue equation $\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$.
•Eigenvalues equal variances: The eigenvalue $\lambda_k$ is exactly the variance of data projected onto eigenvector $\mathbf{w}_k$.
•Principal components are ordered: The first PC has the largest eigenvalue, the second has the second-largest, and so on.
•Orthogonality is automatic: Eigenvectors of a symmetric matrix are orthogonal, giving us uncorrelated components.
•PCA decorrelates features: In the PCA coordinate system, the covariance matrix becomes diagonal.
•The covariance matrix encodes everything: All PCA information is contained in $\mathbf{S}$—the raw data isn't needed after computing it.

What's Next

The maximum variance formulation tells us what PCA computes. In the next page, we'll explore an equivalent but different perspective: minimum reconstruction error. This view reveals PCA as the optimal linear projection for data compression—a perspective that connects to autoencoders and helps us understand what information PCA discards.

Maximum Variance Formulation

The Quest for Meaningful Directions

What You Will Learn

Why Variance Matters in Dimensionality Reduction

The Information Density Perspective

Variance as Information

The Signal-to-Noise Perspective

The Compression Perspective

Perspectives on Why Variance Maximization Makes Sense
Perspective	Key Insight	What Variance Captures
Information Theory	Variance relates to entropy for Gaussian data	Information content / surprise
Signal Processing	High variance ≈ signal, low variance ≈ noise	Signal-to-noise separation
Data Compression	Allocate representation budget to variable dimensions	Where data 'spreads out'
Geometric	Find the 'long axes' of the data cloud	Principal directions of spread
Statistical	Capture the most variation in fewest dimensions	Dominant modes of variation

Setting Up the Mathematical Framework

Let's formalize the maximum variance objective. We'll build the mathematical machinery step by step, ensuring every symbol and operation is crystal clear.

Data Representation

Suppose we have a dataset of n observations, each with d features. We represent this as a data matrix $\mathbf{X}$ with dimensions $n \times d$:

Each row is an observation $\mathbf{x}^{(i)} \in \mathbb{R}^d$, and each column represents a feature across all observations.

Centering Assumption

The Projection Operator

$$z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} = \sum_{j=1}^{d} w_j x_j^{(i)}$$

Variance of Projected Data

The variance of these projected values across all observations is:

$$\text{Var}(z) = \frac{1}{n} \sum_{i=1}^{n} (z^{(i)})^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)})^2$$

Note: we use $(z^{(i)})^2$ rather than $(z^{(i)} - \bar{z})^2$ because the data is centered, so $\bar{z} = \mathbf{w}^T \bar{\mathbf{x}} = 0$.

Matrix Formulation

Let's rewrite this in matrix form. If we project all observations, we get a vector:

$$\mathbf{z} = \mathbf{X}\mathbf{w}$$

The variance becomes:

$$\text{Var}(z) = \frac{1}{n} \mathbf{z}^T \mathbf{z} = \frac{1}{n} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{w}^T \mathbf{S} \mathbf{w}$$

where $\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X}$ is the sample covariance matrix (for centered data).

The Covariance Matrix is Central

Formulating the Optimization Problem

We can now state the first principal component problem precisely:

The First Principal Component

Problem: Find the unit vector $\mathbf{w}_1$ that maximizes the variance of projections:

$$\mathbf{w}1 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1$$

Solving with Lagrange Multipliers

This is a constrained optimization problem. We use the method of Lagrange multipliers. Define the Lagrangian:

$$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$

Taking the gradient with respect to $\mathbf{w}$ and setting it to zero:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0$$

This simplifies to:

$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$

This is an eigenvalue equation! The optimal direction $\mathbf{w}$ must be an eigenvector of the covariance matrix $\mathbf{S}$, and $\lambda$ is the corresponding eigenvalue.

The Eigenvalue Connection

Which Eigenvector Maximizes Variance?

The covariance matrix $\mathbf{S}$ (being symmetric positive semi-definite) has $d$ real, non-negative eigenvalues. Let's order them:

$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$$

with corresponding eigenvectors $\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d$.

Now, the variance along eigenvector $\mathbf{w}_k$ is:

$$\mathbf{w}_k^T \mathbf{S} \mathbf{w}_k = \mathbf{w}_k^T (\lambda_k \mathbf{w}_k) = \lambda_k \|\mathbf{w}_k\|^2 = \lambda_k$$

The eigenvalue equals the variance along its eigenvector!

Therefore, the first principal component—the direction of maximum variance—is the eigenvector corresponding to the largest eigenvalue:

$$\mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ with eigenvalue } \lambda_1$$

Finding Subsequent Principal Components

The first principal component captures the direction of maximum variance. But we typically want more than one component. How do we find the second, third, and subsequent components?

The Orthogonality Requirement

The problem for the second principal component is:

$$\mathbf{w}2 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1, \quad \mathbf{w}^T \mathbf{w}_1 = 0$$

Solving with Two Constraints

Using Lagrange multipliers with both constraints:

$$\mathcal{L}(\mathbf{w}, \lambda, \mu) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1) - \mu(\mathbf{w}^T \mathbf{w}_1)$$

Taking the gradient:

$$2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} - \mu\mathbf{w}_1 = 0$$

Multiplying both sides by $\mathbf{w}_1^T$:

$$2\mathbf{w}_1^T \mathbf{S}\mathbf{w} - 2\lambda\mathbf{w}_1^T\mathbf{w} - \mu\mathbf{w}_1^T\mathbf{w}_1 = 0$$

Since $\mathbf{S}$ is symmetric: $\mathbf{w}_1^T \mathbf{S}\mathbf{w} = \mathbf{w}^T \mathbf{S}\mathbf{w}_1 = \mathbf{w}^T (\lambda_1 \mathbf{w}_1) = \lambda_1 (\mathbf{w}^T \mathbf{w}_1) = 0$

Also, $\mathbf{w}_1^T\mathbf{w} = 0$ (orthogonality constraint) and $\mathbf{w}_1^T\mathbf{w}_1 = 1$.

This gives us: $0 - 0 - \mu = 0$, so $\mu = 0$.

With $\mu = 0$, the optimality condition reduces to:

$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$

The same eigenvalue equation! The second principal component is the eigenvector with the second-largest eigenvalue.

The Elegant Recursive Structure

Why Orthogonality Automatically Emerges

Proof sketch: Let $\mathbf{S}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\mathbf{S}\mathbf{w}_j = \lambda_j \mathbf{w}_j$ with $\lambda_i eq \lambda_j$.

Compute: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = \mathbf{w}_j^T (\lambda_i \mathbf{w}_i) = \lambda_i (\mathbf{w}_j^T \mathbf{w}_i)$

Also: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = (\mathbf{S}\mathbf{w}_j)^T \mathbf{w}_i = (\lambda_j \mathbf{w}_j)^T \mathbf{w}_i = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$

Equating: $\lambda_i (\mathbf{w}_j^T \mathbf{w}_i) = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$

Since $\lambda_i eq \lambda_j$, we must have $\mathbf{w}_j^T \mathbf{w}_i = 0$. QED.

Geometric Interpretation

The algebra tells us that principal components are eigenvectors of the covariance matrix. But what does this mean geometrically? The visual intuition is remarkably powerful.

The Data Cloud

Principal Components as Ellipsoid Axes

The principal components are precisely the principal axes of this ellipsoid:

First PC ($\mathbf{w}_1$): The direction along which the ellipsoid is longest—the major axis
Second PC ($\mathbf{w}_2$): The direction perpendicular to $\mathbf{w}_1$ along which the ellipsoid is longest—the next major axis
And so on...

The eigenvalues are the squared semi-axis lengths: $\sqrt{\lambda_k}$ is the standard deviation along the $k$-th principal axis.

Geometric Interpretation of PCA Components
Mathematical Object	Geometric Meaning	Statistical Meaning
Eigenvector $\mathbf{w}_k$	$k$-th principal axis of data ellipsoid	Direction of $k$-th most variance
Eigenvalue $\lambda_k$	Squared length of $k$-th semi-axis	Variance along $k$-th principal direction
$\sqrt{\lambda_k}$	Length of $k$-th semi-axis	Standard deviation along $k$-th PC
Projection $z_k = \mathbf{w}_k^T \mathbf{x}$	Coordinate along $k$-th axis	$k$-th principal component score

The Rotation Interpretation

Let $\mathbf{W} = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_d]$ be the matrix of all eigenvectors (as columns). The transformation:

$$\mathbf{Z} = \mathbf{X}\mathbf{W}$$

gives us data in the new coordinate system. The covariance of $\mathbf{Z}$ is diagonal:

$$\text{Cov}(\mathbf{Z}) = \mathbf{W}^T \mathbf{S} \mathbf{W} = \mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$$

This is the decorrelation property of PCA: the transformed features have zero covariance with each other.

PCA as Decorrelation

The Covariance Matrix: A Deep Dive

The covariance matrix is the heart of PCA. Understanding its structure reveals why PCA works and what its limitations are.

Definition and Properties

For centered data, the covariance matrix is:

$$\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T$$

Each entry $S_{jk}$ is the covariance between features $j$ and $k$:

$$S_{jk} = \frac{1}{n} \sum_{i=1}^{n} x_j^{(i)} x_k^{(i)}$$

The diagonal entries $S_{jj}$ are the variances of individual features.

Key Properties of the Covariance Matrix

•Symmetric: $\mathbf{S} = \mathbf{S}^T$. Covariance of $j$ with $k$ equals covariance of $k$ with $j$.
•Positive Semi-Definite: $\mathbf{w}^T \mathbf{S} \mathbf{w} \geq 0$ for all $\mathbf{w}$. Variance is never negative.
•Real Eigenvalues: Symmetry guarantees all eigenvalues are real numbers.
•Non-Negative Eigenvalues: Positive semi-definiteness guarantees $\lambda_k \geq 0$ for all $k$.
•Orthogonal Eigenvectors: Eigenvectors can be chosen to form an orthonormal basis.
•Rank ≤ min(n, d): If $n < d$, at least $d - n$ eigenvalues are zero.

The Rank Deficiency Issue

This means:

You can extract at most $n - 1$ meaningful principal components
The remaining components correspond to eigenvalue 0—directions where the data has no variance
The data lies in an $(n-1)$-dimensional subspace of the $d$-dimensional feature space

This is common in genomics (thousands of genes, hundreds of samples), text analysis (millions of words, thousands of documents), and image analysis (millions of pixels, limited images).

High-Dimensional Caution

Computational Considerations

Understanding the computational aspects of PCA is essential for applying it to real-world datasets. The choice of algorithm depends critically on the dimensions of your problem.

Approach 1: Full Eigendecomposition

The direct approach:

Compute covariance matrix $\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ — $O(nd^2)$
Eigendecompose $\mathbf{S}$ — $O(d^3)$

Total: $O(nd^2 + d^3)$

This works well when $d$ is moderate (hundreds to low thousands).

Approach 2: SVD-Based Computation

Instead of explicitly forming $\mathbf{S}$, we can use the Singular Value Decomposition of $\mathbf{X}$ directly:

$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

Advantages of SVD:

More numerically stable
Works directly on $\mathbf{X}$ without forming $\mathbf{X}^T\mathbf{X}$ (avoids squaring condition number)
Many efficient implementations available (LAPACK, etc.)

Computational Complexity of PCA Methods
Method	Time Complexity	Space Complexity	Best When
Full Eigendecomposition	$O(nd^2 + d^3)$	$O(d^2)$	$d$ is small
Full SVD	$O(\min(nd^2, n^2d))$	$O(nd)$	$n$ and $d$ moderate
Truncated SVD	$O(ndk)$	$O((n+d)k)$	Only need $k \ll d$ components
Randomized SVD	$O(nd \log k)$	$O((n+d)k)$	Large datasets, approximate OK
Incremental PCA	$O(ndk)$ per pass	$O(dk)$	Data doesn't fit in memory

The Gram Matrix Trick

When $n \ll d$ (more features than samples), there's a clever trick. Instead of the $d \times d$ covariance matrix, compute the $n \times n$ Gram matrix:

$$\mathbf{G} = \frac{1}{n} \mathbf{X}\mathbf{X}^T$$

The eigenvectors of $\mathbf{G}$ can be converted to eigenvectors of $\mathbf{S}$:

If $\mathbf{G}\mathbf{u} = \lambda \mathbf{u}$, then $\mathbf{S}(\mathbf{X}^T\mathbf{u}) = \lambda(\mathbf{X}^T\mathbf{u})$

So $\mathbf{w} = \mathbf{X}^T\mathbf{u}$ (normalized) is the eigenvector of $\mathbf{S}$.

Complexity: $O(n^2d + n^3)$ — much better when $n \ll d$.

Practical Recommendation

Summary and Key Takeaways

We've developed the maximum variance formulation of PCA from first principles. Let's consolidate the key insights:

Key Takeaways

•PCA maximizes projected variance: The first principal component is the direction that captures the most variance in the data.
•The solution is an eigenvalue problem: Maximizing variance subject to unit length constraint yields the eigenvalue equation $\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$.
•Eigenvalues equal variances: The eigenvalue $\lambda_k$ is exactly the variance of data projected onto eigenvector $\mathbf{w}_k$.
•Principal components are ordered: The first PC has the largest eigenvalue, the second has the second-largest, and so on.
•Orthogonality is automatic: Eigenvectors of a symmetric matrix are orthogonal, giving us uncorrelated components.
•PCA decorrelates features: In the PCA coordinate system, the covariance matrix becomes diagonal.
•The covariance matrix encodes everything: All PCA information is contained in $\mathbf{S}$—the raw data isn't needed after computing it.

What's Next