Pca Theory - Learning Module

Loading content...

0/278

Minimum Reconstruction Error

A Different Lens on the Same Problem

In the previous page, we derived PCA by asking: Which directions capture the most variance? This led us to eigenvectors of the covariance matrix. Now we'll approach PCA from a completely different angle—one that illuminates its nature as an optimal compression scheme.

Imagine you're an engineer tasked with compressing sensor data for transmission over a bandwidth-limited channel. You have 100 sensors, but can only transmit 10 numbers per time step. Your goal: choose those 10 numbers such that the original 100 sensor readings can be reconstructed as accurately as possible on the receiving end.

This is the minimum reconstruction error perspective on PCA. We'll see that minimizing reconstruction error and maximizing variance are two sides of the same coin—and this equivalence reveals deep insights about what PCA preserves and what it discards.

What You Will Learn

By the end of this page, you will understand the reconstruction error formulation of PCA, prove that it yields the same solution as variance maximization, develop geometric intuition for projections and reconstructions, and see how this view connects PCA to autoencoders and compression theory.

The Compression-Reconstruction Framework

Let's formalize what it means to 'compress' and 'reconstruct' data using linear operations.

The Setup

We have data points $\mathbf{x}^{(i)} \in \mathbb{R}^d$ (centered, as usual). We want to:

Encode: Map each $\mathbf{x}^{(i)}$ to a lower-dimensional representation $\mathbf{z}^{(i)} \in \mathbb{R}^k$ where $k < d$
Decode: Reconstruct an approximation $\hat{\mathbf{x}}^{(i)} \in \mathbb{R}^d$ from $\mathbf{z}^{(i)}$

We restrict ourselves to linear encoders and decoders:

$$\mathbf{z}^{(i)} = \mathbf{W}^T \mathbf{x}^{(i)} \quad (\text{Encoding: } d \to k)$$ $$\hat{\mathbf{x}}^{(i)} = \mathbf{W} \mathbf{z}^{(i)} \quad (\text{Decoding: } k \to d)$$

Here, $\mathbf{W}$ is a $d \times k$ matrix whose columns are the encoding directions. We'll require its columns to be orthonormal: $\mathbf{W}^T \mathbf{W} = \mathbf{I}_k$.

Why Orthonormal Columns?

The orthonormality constraint serves two purposes: (1) it prevents redundancy—non-orthogonal columns would encode overlapping information, and (2) it makes the decoder uniquely determined by the encoder. With orthonormal $\mathbf{W}$, the encoding-decoding pair represents projection onto a $k$-dimensional subspace followed by embedding back into $\mathbb{R}^d$.

The Reconstruction

Combining encoding and decoding:

$$\hat{\mathbf{x}}^{(i)} = \mathbf{W} \mathbf{W}^T \mathbf{x}^{(i)} = \mathbf{P} \mathbf{x}^{(i)}$$

where $\mathbf{P} = \mathbf{W}\mathbf{W}^T$ is a projection matrix. It projects data onto the $k$-dimensional subspace spanned by the columns of $\mathbf{W}$.

Properties of $\mathbf{P}$:

Symmetric: $\mathbf{P}^T = \mathbf{P}$
Idempotent: $\mathbf{P}^2 = \mathbf{P}$ (projecting twice = projecting once)
Rank $k$: The image is exactly the column space of $\mathbf{W}$

The reconstruction $\hat{\mathbf{x}}^{(i)}$ is the orthogonal projection of $\mathbf{x}^{(i)}$ onto the subspace spanned by $\mathbf{W}$'s columns.

Defining Reconstruction Error

How do we measure the quality of our compression? The natural choice is the reconstruction error—how different is $\hat{\mathbf{x}}$ from the original $\mathbf{x}$?

Error for a Single Point

For a single observation, the reconstruction error is:

$$\mathbf{e}^{(i)} = \mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)} = \mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)} = (\mathbf{I} - \mathbf{W}\mathbf{W}^T)\mathbf{x}^{(i)}$$

The squared error magnitude is:

$$\|\mathbf{e}^{(i)}\|^2 = \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2$$

Total Reconstruction Error

Averaging over all observations, the mean squared reconstruction error is:

$$\mathcal{J}(\mathbf{W}) = \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2 = \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)}\|^2$$

Our goal: find the orthonormal $\mathbf{W}$ that minimizes $\mathcal{J}(\mathbf{W})$.

The Minimum Reconstruction Error Problem:

$$\mathbf{W}^* = \arg\min_{\mathbf{W}} \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)}\|^2 \quad \text{s.t.} \quad \mathbf{W}^T\mathbf{W} = \mathbf{I}_k$$

Geometric Intuition

Geometrically, we're choosing a $k$-dimensional subspace (a hyperplane through the origin) such that the sum of squared distances from data points to the subspace is minimized. Each $\hat{\mathbf{x}}^{(i)}$ is the closest point on the subspace to $\mathbf{x}^{(i)}$—it's the orthogonal projection.

The Variance-Error Decomposition

Now we'll prove the beautiful result that links variance maximization to error minimization. The key is a decomposition of total variance.

Decomposing the Data

For any orthonormal $\mathbf{W}$, we can write:

$$\mathbf{x}^{(i)} = \hat{\mathbf{x}}^{(i)} + \mathbf{e}^{(i)} = \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)} + (\mathbf{I} - \mathbf{W}\mathbf{W}^T)\mathbf{x}^{(i)}$$

This splits each data point into:

$\hat{\mathbf{x}}^{(i)}$: the part lying in the $k$-dimensional subspace (reconstruction)
$\mathbf{e}^{(i)}$: the part orthogonal to the subspace (error)

Pythagorean Theorem

Since $\hat{\mathbf{x}}^{(i)}$ and $\mathbf{e}^{(i)}$ are orthogonal (projection property), the Pythagorean theorem applies:

$$\|\mathbf{x}^{(i)}\|^2 = \|\hat{\mathbf{x}}^{(i)}\|^2 + \|\mathbf{e}^{(i)}\|^2$$

Summing over all observations and dividing by $n$:

$$\frac{1}{n}\sum_i \|\mathbf{x}^{(i)}\|^2 = \frac{1}{n}\sum_i \|\hat{\mathbf{x}}^{(i)}\|^2 + \frac{1}{n}\sum_i \|\mathbf{e}^{(i)}\|^2$$

In terms of variances (for centered data, $\|\mathbf{x}\|^2$ is related to squared distance from mean = origin):

$$\text{Total Variance} = \text{Preserved Variance} + \text{Reconstruction Error}$$

The Fundamental Tradeoff

This is the key insight: Total variance is constant. It's a property of the data, independent of our choice of $\mathbf{W}$. Therefore:

$$\text{Minimize Error} \Leftrightarrow \text{Maximize Preserved Variance}$$

The two formulations are exactly equivalent!

Mathematical Details

Let's make this precise. The total variance is:

$$\text{Total} = \frac{1}{n}\sum_i \|\mathbf{x}^{(i)}\|^2 = \text{tr}(\mathbf{S}) = \sum_{j=1}^{d} \lambda_j$$

where $\text{tr}(\mathbf{S})$ is the trace of the covariance matrix and $\lambda_j$ are its eigenvalues.

The preserved variance (variance of reconstructed points) equals the variance of projections onto the subspace. If $\mathbf{W} = [\mathbf{w}_1 | \cdots | \mathbf{w}_k]$ with orthonormal columns:

$$\text{Preserved} = \frac{1}{n}\sum_i \|\hat{\mathbf{x}}^{(i)}\|^2 = \sum_{j=1}^{k} \mathbf{w}_j^T \mathbf{S} \mathbf{w}_j = \text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$$

Since Total is fixed:

$$\text{Error} = \text{Total} - \text{Preserved} = \text{tr}(\mathbf{S}) - \text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$$

Minimizing error ⟺ maximizing $\text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$, the preserved variance.

Deriving the Optimal Solution

We've shown that minimizing reconstruction error is equivalent to maximizing:

$$\text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W}) = \sum_{j=1}^{k} \mathbf{w}_j^T \mathbf{S} \mathbf{w}_j$$

subject to $\mathbf{W}^T\mathbf{W} = \mathbf{I}_k$ (orthonormal columns).

The Rayleigh Quotient View

For each column $\mathbf{w}_j$, we're maximizing $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_j$ subject to $\|\mathbf{w}_j\| = 1$. This is the Rayleigh quotient, and from our variance maximization derivation, we know the maximum is achieved when $\mathbf{w}_j$ is an eigenvector of $\mathbf{S}$.

With the orthogonality constraints between columns, the solution is:

$$\mathbf{W}^* = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_k]$$

where $\mathbf{w}_j$ is the eigenvector of $\mathbf{S}$ with the $j$-th largest eigenvalue $\lambda_j$.

Optimal Reconstruction Error

With the optimal $\mathbf{W}^*$, the minimum reconstruction error is:

$$\mathcal{J}^* = \text{tr}(\mathbf{S}) - \sum_{j=1}^{k} \lambda_j = \sum_{j=k+1}^{d} \lambda_j$$

The error equals the sum of the discarded eigenvalues—the eigenvalues corresponding to the principal components we didn't include.

Similarly, the preserved variance is:

$$\text{Preserved Variance} = \sum_{j=1}^{k} \lambda_j$$

The preserved variance equals the sum of the kept eigenvalues.

The Eigenvalue Interpretation

Each eigenvalue $\lambda_j$ represents the variance along principal component $j$. Including the $j$-th PC in our encoding preserves $\lambda_j$ units of variance and prevents $\lambda_j$ units of reconstruction error. This makes choosing $k$ (how many components to keep) interpretable: we're doing a cost-benefit analysis where each additional component provides $\lambda_j$ benefit.

Reconstruction Error and Variance for Different k
Components $k$	Preserved Variance	Reconstruction Error	% Variance Explained
$k = 0$	0	$\sum_{j=1}^{d} \lambda_j$	0%
$k = 1$	$\lambda_1$	$\sum_{j=2}^{d} \lambda_j$	$\lambda_1 / \sum \lambda$
$k = 2$	$\lambda_1 + \lambda_2$	$\sum_{j=3}^{d} \lambda_j$	$(\lambda_1 + \lambda_2) / \sum \lambda$
$k = d$	$\sum_{j=1}^{d} \lambda_j$	0	100%

Geometric Visualization

The reconstruction error perspective offers beautiful geometric intuition. Let's visualize what's happening in different scenarios.

2D to 1D: Projection onto a Line

Consider 2D data points and their projection onto a line through the origin. For each point:

The reconstruction $\hat{\mathbf{x}}$ is the point on the line closest to $\mathbf{x}$
The error $\mathbf{e}$ is the perpendicular distance from the point to the line

The sum of squared distances from all points to the line is the reconstruction error. PCA finds the line that minimizes this total squared distance—equivalently, the line that passes through the 'middle' of the data cloud in a least-squares sense.

3D to 2D: Projection onto a Plane

With 3D data projected to 2D, we're finding the best-fit plane through the origin. Each point's error is its perpendicular distance to the plane.

The optimal plane is spanned by the two eigenvectors with largest eigenvalues—the first two principal components. Points project onto this plane, and the distances from the original points to their projections are minimized.

Orthogonal vs Oblique Projections

PCA uses orthogonal projection—each point projects to its closest point on the subspace. This is different from oblique projections that might map points along non-perpendicular directions. The orthogonality is why the Pythagorean theorem applies and why reconstructed and residual components have zero correlation.

The Residual Subspace

Just as important as the $k$-dimensional subspace we project onto is the $(d-k)$-dimensional residual subspace we project away from. The residual subspace is spanned by the remaining eigenvectors $\mathbf{w}_{k+1}, \ldots, \mathbf{w}_d$.

The error vector $\mathbf{e}^{(i)}$ lies entirely in this residual subspace. If we wanted to perfectly reconstruct the data, we'd need both:

$$\mathbf{x}^{(i)} = \underbrace{\mathbf{W}k \mathbf{W}k^T \mathbf{x}^{(i)}}{\text{kept}} + \underbrace{\mathbf{W}{-k} \mathbf{W}{-k}^T \mathbf{x}^{(i)}}{\text{discarded}}$$

where $\mathbf{W}_{-k}$ contains the remaining $d-k$ eigenvectors.

Connection to Autoencoders

The reconstruction error perspective reveals PCA as a special case of a broader class of models: autoencoders. Understanding this connection illuminates both what PCA does and what its limitations are.

The Autoencoder Framework

An autoencoder consists of:

Encoder $f$: Maps input $\mathbf{x}$ to a lower-dimensional code $\mathbf{z} = f(\mathbf{x})$
Decoder $g$: Reconstructs from the code $\hat{\mathbf{x}} = g(\mathbf{z})$

The model is trained to minimize reconstruction loss: $\|\mathbf{x} - g(f(\mathbf{x}))\|^2$

PCA is the specific case where both $f$ and $g$ are linear:

$f(\mathbf{x}) = \mathbf{W}^T \mathbf{x}$
$g(\mathbf{z}) = \mathbf{W} \mathbf{z}$

PCA vs Modern Autoencoders
Aspect	PCA	Neural Autoencoder
Encoder	Linear: $\mathbf{W}^T \mathbf{x}$	Nonlinear: Neural network
Decoder	Linear: $\mathbf{W} \mathbf{z}$	Nonlinear: Neural network
Training	Closed-form (eigendecomposition)	Gradient descent
Learned representation	Linear subspace	Nonlinear manifold
Optimality	Global optimum guaranteed	Local optima possible
Capacity	Linear relationships only	Arbitrary nonlinear relationships

Linear Autoencoders Learn PCA

A remarkable theorem: if you train a neural network autoencoder with linear activations (no nonlinearities) using gradient descent to minimize reconstruction error, the encoder weights will converge to the principal components. The linear case has a unique global optimum, and gradient descent finds it.

Why PCA is the Optimal Linear Autoencoder

Among all linear encoders/decoders, PCA minimizes reconstruction error. This is a strong optimality statement: no clever choice of linear encoding can do better than the eigenvector solution.

Theorem (Eckart-Young-Mirsky): The best rank-$k$ approximation to a matrix (in Frobenius norm) is given by keeping the top $k$ singular value components. For centered data, this is equivalent to the PCA solution.

This theorem tells us that:

PCA is not just a heuristic—it's provably optimal
To improve on PCA, we must either accept nonlinear encodings or optimize a different criterion

Analyzing Reconstruction Errors

Understanding the structure of reconstruction errors provides valuable diagnostic information and can reveal when PCA is appropriate or inappropriate for a dataset.

Per-Point Error Analysis

For each observation, we can compute:

$$e_i = \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\| = \sqrt{\sum_{j=k+1}^{d} (z_j^{(i)})^2}$$

where $z_j^{(i)} = \mathbf{w}_j^T \mathbf{x}^{(i)}$ are the scores on the discarded components.

Points with high reconstruction error are outliers with respect to the principal subspace—they have unusual values on the less important components.

What High Reconstruction Error Indicates

•Anomalies: Points that don't follow the main structure of the data
•Noise: Observations with high measurement error in minor components
•Subpopulations: Data from a different underlying distribution
•Nonlinear Structure: The data lies on a curved manifold, not a linear subspace
•Model Inadequacy: Too few components to capture true dimensionality

The Residual Sum of Squares

For the entire dataset, the total reconstruction error (residual sum of squares) is:

$$\text{RSS} = \sum_{i=1}^{n} \|\mathbf{e}^{(i)}\|^2 = n \sum_{j=k+1}^{d} \lambda_j$$

This can be normalized to a fraction of total variance:

$$\text{Fraction Unexplained} = \frac{\sum_{j=k+1}^{d} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$

The complement is the proportion of variance explained:

$$\text{Explained Variance Ratio} = 1 - \text{Fraction Unexplained} = \frac{\sum_{j=1}^{k} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$

This is the most common metric for choosing the number of components.

Reconstruction Error ≠ Model Quality

Low reconstruction error doesn't necessarily mean PCA has captured meaningful structure. If the true structure is nonlinear, PCA might achieve low error by capturing the linear component while missing the important nonlinear relationships. Always combine quantitative metrics with qualitative inspection of results.

Summary and Key Takeaways

We've developed the minimum reconstruction error formulation of PCA and shown its deep connection to variance maximization. Here are the essential insights:

Key Takeaways

•PCA minimizes reconstruction error: Among all linear encoders, PCA achieves the lowest reconstruction error for a given number of components.
•Total variance = Preserved + Error: The Pythagorean decomposition shows these are complementary views of the same optimization.
•Error equals discarded eigenvalues: The reconstruction error is exactly $\sum_{j>k} \lambda_j$—the sum of eigenvalues not included.
•Orthogonal projection is key: PCA projects data orthogonally onto the principal subspace, enabling the clean variance decomposition.
•PCA is the optimal linear autoencoder: No linear encoding can achieve lower reconstruction error than PCA.
•Reconstruction error diagnoses outliers: Points with high error are unusual in the discarded dimensions.

The Two Perspectives United

Variance Maximization	Reconstruction Minimization
Find directions of max spread	Find best approximating subspace
Preserve signal	Minimize distortion
Forward: what to keep	Backward: what we lose
Eigenvalue = variance along PC	Eigenvalue = error prevented by PC

Both lead to the same solution: eigenvectors of the covariance matrix, ordered by eigenvalue.

What's Next

We've now seen why eigenvectors of the covariance matrix are the answer from two perspectives. In the next page, we'll dive deep into the eigenvalue problem itself—understanding eigendecomposition, the spectral theorem, and the computational methods used to find principal components efficiently.