Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

1 / 5

Maximum Variance Formulation

The Quest for Meaningful Directions

Imagine you're a cartographer tasked with creating a two-dimensional map from a globe. You face an impossible choice: no flat map can perfectly represent a spherical Earth. Some distortion is inevitable. The question becomes: which distortion is least harmful for your purposes?\n\nPrincipal Component Analysis faces a remarkably similar challenge. Given high-dimensional data, we must project it onto a lower-dimensional subspace. Some information loss is inevitable. PCA's elegant answer begins with a deceptively simple principle: preserve the directions where data varies most.\n\nThis page develops the maximum variance formulation of PCA—the foundational perspective that transforms an abstract data compression problem into a concrete optimization problem with a beautiful closed-form solution.

What You Will Learn

By the end of this page, you will understand why maximizing variance is a principled approach to dimensionality reduction, derive the optimization problem that defines principal components, see how the solution emerges naturally from eigenvalue decomposition, and gain geometric intuition for what PCA is actually computing.

Why Variance Matters in Dimensionality Reduction

Before diving into mathematics, let's build intuition for why variance is the right quantity to maximize. Consider a dataset with observations in a high-dimensional space. Each dimension carries some information, but not all dimensions are equally informative.\n\n### The Information Density Perspective\n\nWhen data varies strongly along a particular direction, that direction carries discriminative information—it helps distinguish one data point from another. Conversely, if data barely varies along some direction, that direction provides little value for distinguishing points.\n\nConsider a concrete example: suppose you're analyzing customer data with features including age, annual income, and a unique customer ID number. The customer ID might span a huge numerical range, but it's meaningless noise—it doesn't encode any real pattern. Meanwhile, the relationship between age and income might reveal meaningful purchasing behavior clusters.

Variance as Information

From an information-theoretic perspective, variance is closely related to entropy for Gaussian-distributed data. Maximizing variance is equivalent to preserving the most entropy—keeping the directions that carry the most 'surprise' or information about the data.

The Signal-to-Noise Perspective\n\nAnother way to view this: high-variance directions typically carry signal, while low-variance directions often carry noise. Noise arises from measurement error, rounding, or irrelevant factors. By focusing on high-variance directions, we implicitly perform denoising.\n\n### The Compression Perspective\n\nFrom a compression standpoint, if we must describe data using fewer numbers, we should use those numbers for the dimensions that matter most. Spending bits on near-constant dimensions wastes capacity. Variance tells us where the 'action' is.

Perspectives on Why Variance Maximization Makes Sense
Perspective	Key Insight	What Variance Captures
Information Theory	Variance relates to entropy for Gaussian data	Information content / surprise
Signal Processing	High variance ≈ signal, low variance ≈ noise	Signal-to-noise separation
Data Compression	Allocate representation budget to variable dimensions	Where data 'spreads out'
Geometric	Find the 'long axes' of the data cloud	Principal directions of spread
Statistical	Capture the most variation in fewest dimensions	Dominant modes of variation

Setting Up the Mathematical Framework

Let's formalize the maximum variance objective. We'll build the mathematical machinery step by step, ensuring every symbol and operation is crystal clear.\n\n### Data Representation\n\nSuppose we have a dataset of n observations, each with d features. We represent this as a data matrix $\mathbf{X}$ with dimensions $n \times d$:\n\n$$\mathbf{X} = \begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_d^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_d^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(n)} & x_2^{(n)} & \cdots & x_d^{(n)} \end{bmatrix}$$\n\nEach row is an observation $\mathbf{x}^{(i)} \in \mathbb{R}^d$, and each column represents a feature across all observations.

Centering Assumption

We assume the data is centered: the mean of each feature is zero. If not, we subtract the mean: $\tilde{\mathbf{x}} = \mathbf{x} - \bar{\mathbf{x}}$. This is critical because PCA finds directions of variance around the centroid. Without centering, we'd conflate the location of the data with its spread.

The Projection Operator\n\nWe want to project data onto a direction defined by a unit vector $\mathbf{w} \in \mathbb{R}^d$ where $\|\mathbf{w}\| = 1$. The projection of a single observation $\mathbf{x}^{(i)}$ onto this direction is:\n\n$$z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} = \sum_{j=1}^{d} w_j x_j^{(i)}$$\n\nThis scalar $z^{(i)}$ represents the 'coordinate' of point $\mathbf{x}^{(i)}$ along the direction $\mathbf{w}$. Geometrically, it's the signed length of the projection of $\mathbf{x}^{(i)}$ onto the line through the origin with direction $\mathbf{w}$.

Variance of Projected Data\n\nThe variance of these projected values across all observations is:\n\n$$\text{Var}(z) = \frac{1}{n} \sum_{i=1}^{n} (z^{(i)})^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)})^2$$\n\nNote: we use $(z^{(i)})^2$ rather than $(z^{(i)} - \bar{z})^2$ because the data is centered, so $\bar{z} = \mathbf{w}^T \bar{\mathbf{x}} = 0$.\n\n### Matrix Formulation\n\nLet's rewrite this in matrix form. If we project all observations, we get a vector:\n\n$$\mathbf{z} = \mathbf{X}\mathbf{w}$$\n\nThe variance becomes:\n\n$$\text{Var}(z) = \frac{1}{n} \mathbf{z}^T \mathbf{z} = \frac{1}{n} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{w}^T \mathbf{S} \mathbf{w}$$\n\nwhere $\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X}$ is the sample covariance matrix (for centered data).

The Covariance Matrix is Central

The covariance matrix $\mathbf{S}$ completely determines the solution to PCA. It captures all pairwise relationships between features. Understanding its structure—that it's symmetric, positive semi-definite, and has real eigenvalues—is essential for understanding PCA.

Formulating the Optimization Problem

We can now state the first principal component problem precisely:\n\n### The First Principal Component\n\nProblem: Find the unit vector $\mathbf{w}_1$ that maximizes the variance of projections:\n\n$$\mathbf{w}1 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1$$\n\nThe constraint $\mathbf{w}^T \mathbf{w} = 1$ is crucial. Without it, we could make variance arbitrarily large by scaling $\mathbf{w}$. We want a direction, and directions are represented by unit vectors.

Solving with Lagrange Multipliers\n\nThis is a constrained optimization problem. We use the method of Lagrange multipliers. Define the Lagrangian:\n\n$$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$\n\nTaking the gradient with respect to $\mathbf{w}$ and setting it to zero:\n\n$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0$$\n\nThis simplifies to:\n\n$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$\n\nThis is an eigenvalue equation! The optimal direction $\mathbf{w}$ must be an eigenvector of the covariance matrix $\mathbf{S}$, and $\lambda$ is the corresponding eigenvalue.

The Eigenvalue Connection

The fact that principal components are eigenvectors of the covariance matrix is not a computational trick—it emerges naturally from the optimization. This is one of the most elegant results in applied mathematics: a statistical question (maximize variance) has a geometric answer (eigenvectors).

Which Eigenvector Maximizes Variance?\n\nThe covariance matrix $\mathbf{S}$ (being symmetric positive semi-definite) has $d$ real, non-negative eigenvalues. Let's order them:\n\n$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$$\n\nwith corresponding eigenvectors $\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d$.\n\nNow, the variance along eigenvector $\mathbf{w}_k$ is:\n\n$$\mathbf{w}_k^T \mathbf{S} \mathbf{w}_k = \mathbf{w}_k^T (\lambda_k \mathbf{w}_k) = \lambda_k \|\mathbf{w}_k\|^2 = \lambda_k$$\n\nThe eigenvalue equals the variance along its eigenvector!\n\nTherefore, the first principal component—the direction of maximum variance—is the eigenvector corresponding to the largest eigenvalue:\n\n$$\mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ with eigenvalue } \lambda_1$$

Finding Subsequent Principal Components

The first principal component captures the direction of maximum variance. But we typically want more than one component. How do we find the second, third, and subsequent components?\n\n### The Orthogonality Requirement\n\nIf we simply re-ran the optimization for the second component, we'd get the same answer—the first component already maximizes variance! We need an additional constraint: the second component must be orthogonal to the first.\n\nThe problem for the second principal component is:\n\n$$\mathbf{w}2 = \arg\max{\mathbf{w}} \mathbf{w}^T \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1, \quad \mathbf{w}^T \mathbf{w}_1 = 0$$

Solving with Two Constraints\n\nUsing Lagrange multipliers with both constraints:\n\n$$\mathcal{L}(\mathbf{w}, \lambda, \mu) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1) - \mu(\mathbf{w}^T \mathbf{w}_1)$$\n\nTaking the gradient:\n\n$$2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} - \mu\mathbf{w}_1 = 0$$\n\nMultiplying both sides by $\mathbf{w}_1^T$:\n\n$$2\mathbf{w}_1^T \mathbf{S}\mathbf{w} - 2\lambda\mathbf{w}_1^T\mathbf{w} - \mu\mathbf{w}_1^T\mathbf{w}_1 = 0$$\n\nSince $\mathbf{S}$ is symmetric: $\mathbf{w}_1^T \mathbf{S}\mathbf{w} = \mathbf{w}^T \mathbf{S}\mathbf{w}_1 = \mathbf{w}^T (\lambda_1 \mathbf{w}_1) = \lambda_1 (\mathbf{w}^T \mathbf{w}_1) = 0$\n\nAlso, $\mathbf{w}_1^T\mathbf{w} = 0$ (orthogonality constraint) and $\mathbf{w}_1^T\mathbf{w}_1 = 1$.\n\nThis gives us: $0 - 0 - \mu = 0$, so $\mu = 0$.\n\nWith $\mu = 0$, the optimality condition reduces to:\n\n$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$\n\nThe same eigenvalue equation! The second principal component is the eigenvector with the second-largest eigenvalue.

The Elegant Recursive Structure

This pattern continues: the $k$-th principal component is the eigenvector of $\mathbf{S}$ with the $k$-th largest eigenvalue. All principal components are simply the eigenvectors of the covariance matrix, ordered by eigenvalue magnitude. This is why computing PCA reduces to eigendecomposition.

Why Orthogonality Automatically Emerges\n\nThere's a beautiful theorem underlying this: eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are automatically orthogonal. Since $\mathbf{S}$ is symmetric (real and $\mathbf{S} = \mathbf{S}^T$), its eigenvectors form an orthonormal basis.\n\nProof sketch: Let $\mathbf{S}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\mathbf{S}\mathbf{w}_j = \lambda_j \mathbf{w}_j$ with $\lambda_i \neq \lambda_j$.\n\nCompute: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = \mathbf{w}_j^T (\lambda_i \mathbf{w}_i) = \lambda_i (\mathbf{w}_j^T \mathbf{w}_i)$\n\nAlso: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = (\mathbf{S}\mathbf{w}_j)^T \mathbf{w}_i = (\lambda_j \mathbf{w}_j)^T \mathbf{w}_i = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$\n\nEquating: $\lambda_i (\mathbf{w}_j^T \mathbf{w}_i) = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$\n\nSince $\lambda_i \neq \lambda_j$, we must have $\mathbf{w}_j^T \mathbf{w}_i = 0$. QED.

Geometric Interpretation

The algebra tells us that principal components are eigenvectors of the covariance matrix. But what does this mean geometrically? The visual intuition is remarkably powerful.\n\n### The Data Cloud\n\nVisualize your $d$-dimensional data as a cloud of points. If features are correlated, this cloud is elongated—stretched more in some directions than others. It forms an ellipsoid shape (for roughly Gaussian data).

Principal Components as Ellipsoid Axes\n\nThe principal components are precisely the principal axes of this ellipsoid:\n\n- First PC ($\mathbf{w}_1$): The direction along which the ellipsoid is longest—the major axis\n- Second PC ($\mathbf{w}_2$): The direction perpendicular to $\mathbf{w}_1$ along which the ellipsoid is longest—the next major axis\n- And so on...\n\nThe eigenvalues are the squared semi-axis lengths: $\sqrt{\lambda_k}$ is the standard deviation along the $k$-th principal axis.

Geometric Interpretation of PCA Components
Mathematical Object	Geometric Meaning	Statistical Meaning
Eigenvector $\mathbf{w}_k$	$k$-th principal axis of data ellipsoid	Direction of $k$-th most variance
Eigenvalue $\lambda_k$	Squared length of $k$-th semi-axis	Variance along $k$-th principal direction
$\sqrt{\lambda_k}$	Length of $k$-th semi-axis	Standard deviation along $k$-th PC
Projection $z_k = \mathbf{w}_k^T \mathbf{x}$	Coordinate along $k$-th axis	$k$-th principal component score

The Rotation Interpretation\n\nAnother way to view PCA: it rotates the coordinate system to align with the data's natural axes. In the original coordinates, features may be correlated—the data cloud is tilted. In the PCA coordinate system, the axes align with directions of maximum spread, and the transformed features are uncorrelated.\n\nLet $\mathbf{W} = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_d]$ be the matrix of all eigenvectors (as columns). The transformation:\n\n$$\mathbf{Z} = \mathbf{X}\mathbf{W}$$\n\ngives us data in the new coordinate system. The covariance of $\mathbf{Z}$ is diagonal:\n\n$$\text{Cov}(\mathbf{Z}) = \mathbf{W}^T \mathbf{S} \mathbf{W} = \mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$$\n\nThis is the decorrelation property of PCA: the transformed features have zero covariance with each other.

PCA as Decorrelation

PCA transforms correlated features into uncorrelated principal components. This is valuable because many machine learning algorithms (like linear regression) suffer when features are highly correlated (multicollinearity). PCA-transformed features are guaranteed to be uncorrelated.

The Covariance Matrix: A Deep Dive

The covariance matrix is the heart of PCA. Understanding its structure reveals why PCA works and what its limitations are.\n\n### Definition and Properties\n\nFor centered data, the covariance matrix is:\n\n$$\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T$$\n\nEach entry $S_{jk}$ is the covariance between features $j$ and $k$:\n\n$$S_{jk} = \frac{1}{n} \sum_{i=1}^{n} x_j^{(i)} x_k^{(i)}$$\n\nThe diagonal entries $S_{jj}$ are the variances of individual features.

Key Properties of the Covariance Matrix

•Symmetric: $\mathbf{S} = \mathbf{S}^T$. Covariance of $j$ with $k$ equals covariance of $k$ with $j$.
•Positive Semi-Definite: $\mathbf{w}^T \mathbf{S} \mathbf{w} \geq 0$ for all $\mathbf{w}$. Variance is never negative.
•Real Eigenvalues: Symmetry guarantees all eigenvalues are real numbers.
•Non-Negative Eigenvalues: Positive semi-definiteness guarantees $\lambda_k \geq 0$ for all $k$.
•Orthogonal Eigenvectors: Eigenvectors can be chosen to form an orthonormal basis.
•Rank ≤ min(n, d): If $n < d$, at least $d - n$ eigenvalues are zero.

The Rank Deficiency Issue\n\nA critical practical consideration: if you have fewer observations than features ($n < d$), the covariance matrix is rank deficient. It will have at most $n - 1$ non-zero eigenvalues (subtracting 1 for centering).\n\nThis means:\n- You can extract at most $n - 1$ meaningful principal components\n- The remaining components correspond to eigenvalue 0—directions where the data has no variance\n- The data lies in an $(n-1)$-dimensional subspace of the $d$-dimensional feature space\n\nThis is common in genomics (thousands of genes, hundreds of samples), text analysis (millions of words, thousands of documents), and image analysis (millions of pixels, limited images).

High-Dimensional Caution

When $d \gg n$, PCA can still be computed but interpretation changes. Many eigenvalues will be zero or near-zero, not because those directions are noise-free, but because we don't have enough data to estimate variance in those directions. This is a fundamental limitation, not a bug in PCA.

Computational Considerations

Understanding the computational aspects of PCA is essential for applying it to real-world datasets. The choice of algorithm depends critically on the dimensions of your problem.\n\n### Approach 1: Full Eigendecomposition\n\nThe direct approach:\n1. Compute covariance matrix $\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ — $O(nd^2)$\n2. Eigendecompose $\mathbf{S}$ — $O(d^3)$\n\nTotal: $O(nd^2 + d^3)$\n\nThis works well when $d$ is moderate (hundreds to low thousands).

Approach 2: SVD-Based Computation\n\nInstead of explicitly forming $\mathbf{S}$, we can use the Singular Value Decomposition of $\mathbf{X}$ directly:\n\n$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$\n\nThe right singular vectors $\mathbf{V}$ are exactly the eigenvectors of $\mathbf{X}^T\mathbf{X}$ (proportional to $\mathbf{S}$). The singular values $\sigma_k$ relate to eigenvalues by $\lambda_k = \sigma_k^2 / n$.\n\nAdvantages of SVD:\n- More numerically stable\n- Works directly on $\mathbf{X}$ without forming $\mathbf{X}^T\mathbf{X}$ (avoids squaring condition number)\n- Many efficient implementations available (LAPACK, etc.)

Computational Complexity of PCA Methods
Method	Time Complexity	Space Complexity	Best When
Full Eigendecomposition	$O(nd^2 + d^3)$	$O(d^2)$	$d$ is small
Full SVD	$O(\min(nd^2, n^2d))$	$O(nd)$	$n$ and $d$ moderate
Truncated SVD	$O(ndk)$	$O((n+d)k)$	Only need $k \ll d$ components
Randomized SVD	$O(nd \log k)$	$O((n+d)k)$	Large datasets, approximate OK
Incremental PCA	$O(ndk)$ per pass	$O(dk)$	Data doesn't fit in memory

The Gram Matrix Trick\n\nWhen $n \ll d$ (more features than samples), there's a clever trick. Instead of the $d \times d$ covariance matrix, compute the $n \times n$ Gram matrix:\n\n$$\mathbf{G} = \frac{1}{n} \mathbf{X}\mathbf{X}^T$$\n\nThe eigenvectors of $\mathbf{G}$ can be converted to eigenvectors of $\mathbf{S}$:\n\nIf $\mathbf{G}\mathbf{u} = \lambda \mathbf{u}$, then $\mathbf{S}(\mathbf{X}^T\mathbf{u}) = \lambda(\mathbf{X}^T\mathbf{u})$\n\nSo $\mathbf{w} = \mathbf{X}^T\mathbf{u}$ (normalized) is the eigenvector of $\mathbf{S}$.\n\nComplexity: $O(n^2d + n^3)$ — much better when $n \ll d$.

Practical Recommendation

For most applications, use your library's PCA implementation (scikit-learn, NumPy, etc.). These handle the algorithm selection automatically. Only worry about computational details if you hit memory or time limits—then consider randomized or incremental variants.

Summary and Key Takeaways

We've developed the maximum variance formulation of PCA from first principles. Let's consolidate the key insights:

Key Takeaways

•PCA maximizes projected variance: The first principal component is the direction that captures the most variance in the data.
•The solution is an eigenvalue problem: Maximizing variance subject to unit length constraint yields the eigenvalue equation $\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$.
•Eigenvalues equal variances: The eigenvalue $\lambda_k$ is exactly the variance of data projected onto eigenvector $\mathbf{w}_k$.
•Principal components are ordered: The first PC has the largest eigenvalue, the second has the second-largest, and so on.
•Orthogonality is automatic: Eigenvectors of a symmetric matrix are orthogonal, giving us uncorrelated components.
•PCA decorrelates features: In the PCA coordinate system, the covariance matrix becomes diagonal.
•The covariance matrix encodes everything: All PCA information is contained in $\mathbf{S}$—the raw data isn't needed after computing it.

What's Next

The maximum variance formulation tells us what PCA computes. In the next page, we'll explore an equivalent but different perspective: minimum reconstruction error. This view reveals PCA as the optimal linear projection for data compression—a perspective that connects to autoencoders and helps us understand what information PCA discards.

1 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

1 / 5

Maximum Variance Formulation

The Quest for Meaningful Directions

What You Will Learn

Why Variance Matters in Dimensionality Reduction

Variance as Information

The Signal-to-Noise Perspective\n\nAnother way to view this: high-variance directions typically carry signal, while low-variance directions often carry noise. Noise arises from measurement error, rounding, or irrelevant factors. By focusing on high-variance directions, we implicitly perform denoising.\n\n### The Compression Perspective\n\nFrom a compression standpoint, if we must describe data using fewer numbers, we should use those numbers for the dimensions that matter most. Spending bits on near-constant dimensions wastes capacity. Variance tells us where the 'action' is.

Perspectives on Why Variance Maximization Makes Sense
Perspective	Key Insight	What Variance Captures
Information Theory	Variance relates to entropy for Gaussian data	Information content / surprise
Signal Processing	High variance ≈ signal, low variance ≈ noise	Signal-to-noise separation
Data Compression	Allocate representation budget to variable dimensions	Where data 'spreads out'
Geometric	Find the 'long axes' of the data cloud	Principal directions of spread
Statistical	Capture the most variation in fewest dimensions	Dominant modes of variation

Setting Up the Mathematical Framework

Centering Assumption

The Projection Operator\n\nWe want to project data onto a direction defined by a unit vector $\mathbf{w} \in \mathbb{R}^d$ where $\|\mathbf{w}\| = 1$. The projection of a single observation $\mathbf{x}^{(i)}$ onto this direction is:\n\n$$z^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} = \sum_{j=1}^{d} w_j x_j^{(i)}$$\n\nThis scalar $z^{(i)}$ represents the 'coordinate' of point $\mathbf{x}^{(i)}$ along the direction $\mathbf{w}$. Geometrically, it's the signed length of the projection of $\mathbf{x}^{(i)}$ onto the line through the origin with direction $\mathbf{w}$.

Variance of Projected Data\n\nThe variance of these projected values across all observations is:\n\n$$\text{Var}(z) = \frac{1}{n} \sum_{i=1}^{n} (z^{(i)})^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}^{(i)})^2$$\n\nNote: we use $(z^{(i)})^2$ rather than $(z^{(i)} - \bar{z})^2$ because the data is centered, so $\bar{z} = \mathbf{w}^T \bar{\mathbf{x}} = 0$.\n\n### Matrix Formulation\n\nLet's rewrite this in matrix form. If we project all observations, we get a vector:\n\n$$\mathbf{z} = \mathbf{X}\mathbf{w}$$\n\nThe variance becomes:\n\n$$\text{Var}(z) = \frac{1}{n} \mathbf{z}^T \mathbf{z} = \frac{1}{n} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{w}^T \mathbf{S} \mathbf{w}$$\n\nwhere $\mathbf{S} = \frac{1}{n} \mathbf{X}^T \mathbf{X}$ is the sample covariance matrix (for centered data).

The Covariance Matrix is Central

Formulating the Optimization Problem

Solving with Lagrange Multipliers\n\nThis is a constrained optimization problem. We use the method of Lagrange multipliers. Define the Lagrangian:\n\n$$\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$\n\nTaking the gradient with respect to $\mathbf{w}$ and setting it to zero:\n\n$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0$$\n\nThis simplifies to:\n\n$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$\n\nThis is an eigenvalue equation! The optimal direction $\mathbf{w}$ must be an eigenvector of the covariance matrix $\mathbf{S}$, and $\lambda$ is the corresponding eigenvalue.

The Eigenvalue Connection

Which Eigenvector Maximizes Variance?\n\nThe covariance matrix $\mathbf{S}$ (being symmetric positive semi-definite) has $d$ real, non-negative eigenvalues. Let's order them:\n\n$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$$\n\nwith corresponding eigenvectors $\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d$.\n\nNow, the variance along eigenvector $\mathbf{w}_k$ is:\n\n$$\mathbf{w}_k^T \mathbf{S} \mathbf{w}_k = \mathbf{w}_k^T (\lambda_k \mathbf{w}_k) = \lambda_k \|\mathbf{w}_k\|^2 = \lambda_k$$\n\nThe eigenvalue equals the variance along its eigenvector!\n\nTherefore, the first principal component—the direction of maximum variance—is the eigenvector corresponding to the largest eigenvalue:\n\n$$\mathbf{w}_1 = \text{eigenvector of } \mathbf{S} \text{ with eigenvalue } \lambda_1$$

Finding Subsequent Principal Components

Solving with Two Constraints\n\nUsing Lagrange multipliers with both constraints:\n\n$$\mathcal{L}(\mathbf{w}, \lambda, \mu) = \mathbf{w}^T \mathbf{S} \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1) - \mu(\mathbf{w}^T \mathbf{w}_1)$$\n\nTaking the gradient:\n\n$$2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} - \mu\mathbf{w}_1 = 0$$\n\nMultiplying both sides by $\mathbf{w}_1^T$:\n\n$$2\mathbf{w}_1^T \mathbf{S}\mathbf{w} - 2\lambda\mathbf{w}_1^T\mathbf{w} - \mu\mathbf{w}_1^T\mathbf{w}_1 = 0$$\n\nSince $\mathbf{S}$ is symmetric: $\mathbf{w}_1^T \mathbf{S}\mathbf{w} = \mathbf{w}^T \mathbf{S}\mathbf{w}_1 = \mathbf{w}^T (\lambda_1 \mathbf{w}_1) = \lambda_1 (\mathbf{w}^T \mathbf{w}_1) = 0$\n\nAlso, $\mathbf{w}_1^T\mathbf{w} = 0$ (orthogonality constraint) and $\mathbf{w}_1^T\mathbf{w}_1 = 1$.\n\nThis gives us: $0 - 0 - \mu = 0$, so $\mu = 0$.\n\nWith $\mu = 0$, the optimality condition reduces to:\n\n$$\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$$\n\nThe same eigenvalue equation! The second principal component is the eigenvector with the second-largest eigenvalue.

The Elegant Recursive Structure

Why Orthogonality Automatically Emerges\n\nThere's a beautiful theorem underlying this: eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are automatically orthogonal. Since $\mathbf{S}$ is symmetric (real and $\mathbf{S} = \mathbf{S}^T$), its eigenvectors form an orthonormal basis.\n\nProof sketch: Let $\mathbf{S}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\mathbf{S}\mathbf{w}_j = \lambda_j \mathbf{w}_j$ with $\lambda_i \neq \lambda_j$.\n\nCompute: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = \mathbf{w}_j^T (\lambda_i \mathbf{w}_i) = \lambda_i (\mathbf{w}_j^T \mathbf{w}_i)$\n\nAlso: $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_i = (\mathbf{S}\mathbf{w}_j)^T \mathbf{w}_i = (\lambda_j \mathbf{w}_j)^T \mathbf{w}_i = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$\n\nEquating: $\lambda_i (\mathbf{w}_j^T \mathbf{w}_i) = \lambda_j (\mathbf{w}_j^T \mathbf{w}_i)$\n\nSince $\lambda_i \neq \lambda_j$, we must have $\mathbf{w}_j^T \mathbf{w}_i = 0$. QED.

Geometric Interpretation

Principal Components as Ellipsoid Axes\n\nThe principal components are precisely the principal axes of this ellipsoid:\n\n- First PC ($\mathbf{w}_1$): The direction along which the ellipsoid is longest—the major axis\n- Second PC ($\mathbf{w}_2$): The direction perpendicular to $\mathbf{w}_1$ along which the ellipsoid is longest—the next major axis\n- And so on...\n\nThe eigenvalues are the squared semi-axis lengths: $\sqrt{\lambda_k}$ is the standard deviation along the $k$-th principal axis.

Geometric Interpretation of PCA Components
Mathematical Object	Geometric Meaning	Statistical Meaning
Eigenvector $\mathbf{w}_k$	$k$-th principal axis of data ellipsoid	Direction of $k$-th most variance
Eigenvalue $\lambda_k$	Squared length of $k$-th semi-axis	Variance along $k$-th principal direction
$\sqrt{\lambda_k}$	Length of $k$-th semi-axis	Standard deviation along $k$-th PC
Projection $z_k = \mathbf{w}_k^T \mathbf{x}$	Coordinate along $k$-th axis	$k$-th principal component score

The Rotation Interpretation\n\nAnother way to view PCA: it rotates the coordinate system to align with the data's natural axes. In the original coordinates, features may be correlated—the data cloud is tilted. In the PCA coordinate system, the axes align with directions of maximum spread, and the transformed features are uncorrelated.\n\nLet $\mathbf{W} = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_d]$ be the matrix of all eigenvectors (as columns). The transformation:\n\n$$\mathbf{Z} = \mathbf{X}\mathbf{W}$$\n\ngives us data in the new coordinate system. The covariance of $\mathbf{Z}$ is diagonal:\n\n$$\text{Cov}(\mathbf{Z}) = \mathbf{W}^T \mathbf{S} \mathbf{W} = \mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$$\n\nThis is the decorrelation property of PCA: the transformed features have zero covariance with each other.

PCA as Decorrelation

The Covariance Matrix: A Deep Dive

Key Properties of the Covariance Matrix

•Symmetric: $\mathbf{S} = \mathbf{S}^T$. Covariance of $j$ with $k$ equals covariance of $k$ with $j$.
•Positive Semi-Definite: $\mathbf{w}^T \mathbf{S} \mathbf{w} \geq 0$ for all $\mathbf{w}$. Variance is never negative.
•Real Eigenvalues: Symmetry guarantees all eigenvalues are real numbers.
•Non-Negative Eigenvalues: Positive semi-definiteness guarantees $\lambda_k \geq 0$ for all $k$.
•Orthogonal Eigenvectors: Eigenvectors can be chosen to form an orthonormal basis.
•Rank ≤ min(n, d): If $n < d$, at least $d - n$ eigenvalues are zero.

The Rank Deficiency Issue\n\nA critical practical consideration: if you have fewer observations than features ($n < d$), the covariance matrix is rank deficient. It will have at most $n - 1$ non-zero eigenvalues (subtracting 1 for centering).\n\nThis means:\n- You can extract at most $n - 1$ meaningful principal components\n- The remaining components correspond to eigenvalue 0—directions where the data has no variance\n- The data lies in an $(n-1)$-dimensional subspace of the $d$-dimensional feature space\n\nThis is common in genomics (thousands of genes, hundreds of samples), text analysis (millions of words, thousands of documents), and image analysis (millions of pixels, limited images).

High-Dimensional Caution

Computational Considerations

Approach 2: SVD-Based Computation\n\nInstead of explicitly forming $\mathbf{S}$, we can use the Singular Value Decomposition of $\mathbf{X}$ directly:\n\n$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$\n\nThe right singular vectors $\mathbf{V}$ are exactly the eigenvectors of $\mathbf{X}^T\mathbf{X}$ (proportional to $\mathbf{S}$). The singular values $\sigma_k$ relate to eigenvalues by $\lambda_k = \sigma_k^2 / n$.\n\nAdvantages of SVD:\n- More numerically stable\n- Works directly on $\mathbf{X}$ without forming $\mathbf{X}^T\mathbf{X}$ (avoids squaring condition number)\n- Many efficient implementations available (LAPACK, etc.)

Computational Complexity of PCA Methods
Method	Time Complexity	Space Complexity	Best When
Full Eigendecomposition	$O(nd^2 + d^3)$	$O(d^2)$	$d$ is small
Full SVD	$O(\min(nd^2, n^2d))$	$O(nd)$	$n$ and $d$ moderate
Truncated SVD	$O(ndk)$	$O((n+d)k)$	Only need $k \ll d$ components
Randomized SVD	$O(nd \log k)$	$O((n+d)k)$	Large datasets, approximate OK
Incremental PCA	$O(ndk)$ per pass	$O(dk)$	Data doesn't fit in memory

The Gram Matrix Trick\n\nWhen $n \ll d$ (more features than samples), there's a clever trick. Instead of the $d \times d$ covariance matrix, compute the $n \times n$ Gram matrix:\n\n$$\mathbf{G} = \frac{1}{n} \mathbf{X}\mathbf{X}^T$$\n\nThe eigenvectors of $\mathbf{G}$ can be converted to eigenvectors of $\mathbf{S}$:\n\nIf $\mathbf{G}\mathbf{u} = \lambda \mathbf{u}$, then $\mathbf{S}(\mathbf{X}^T\mathbf{u}) = \lambda(\mathbf{X}^T\mathbf{u})$\n\nSo $\mathbf{w} = \mathbf{X}^T\mathbf{u}$ (normalized) is the eigenvector of $\mathbf{S}$.\n\nComplexity: $O(n^2d + n^3)$ — much better when $n \ll d$.

Practical Recommendation

Summary and Key Takeaways

We've developed the maximum variance formulation of PCA from first principles. Let's consolidate the key insights:

Key Takeaways

•PCA maximizes projected variance: The first principal component is the direction that captures the most variance in the data.
•The solution is an eigenvalue problem: Maximizing variance subject to unit length constraint yields the eigenvalue equation $\mathbf{S}\mathbf{w} = \lambda\mathbf{w}$.
•Eigenvalues equal variances: The eigenvalue $\lambda_k$ is exactly the variance of data projected onto eigenvector $\mathbf{w}_k$.
•Principal components are ordered: The first PC has the largest eigenvalue, the second has the second-largest, and so on.
•Orthogonality is automatic: Eigenvectors of a symmetric matrix are orthogonal, giving us uncorrelated components.
•PCA decorrelates features: In the PCA coordinate system, the covariance matrix becomes diagonal.
•The covariance matrix encodes everything: All PCA information is contained in $\mathbf{S}$—the raw data isn't needed after computing it.

What's Next

1 / 5