Loading content...
In the previous page, we derived PCA by asking: Which directions capture the most variance? This led us to eigenvectors of the covariance matrix. Now we'll approach PCA from a completely different angle—one that illuminates its nature as an optimal compression scheme.
Imagine you're an engineer tasked with compressing sensor data for transmission over a bandwidth-limited channel. You have 100 sensors, but can only transmit 10 numbers per time step. Your goal: choose those 10 numbers such that the original 100 sensor readings can be reconstructed as accurately as possible on the receiving end.
This is the minimum reconstruction error perspective on PCA. We'll see that minimizing reconstruction error and maximizing variance are two sides of the same coin—and this equivalence reveals deep insights about what PCA preserves and what it discards.
By the end of this page, you will understand the reconstruction error formulation of PCA, prove that it yields the same solution as variance maximization, develop geometric intuition for projections and reconstructions, and see how this view connects PCA to autoencoders and compression theory.
Let's formalize what it means to 'compress' and 'reconstruct' data using linear operations.
We have data points $\mathbf{x}^{(i)} \in \mathbb{R}^d$ (centered, as usual). We want to:
We restrict ourselves to linear encoders and decoders:
$$\mathbf{z}^{(i)} = \mathbf{W}^T \mathbf{x}^{(i)} \quad (\text{Encoding: } d \to k)$$ $$\hat{\mathbf{x}}^{(i)} = \mathbf{W} \mathbf{z}^{(i)} \quad (\text{Decoding: } k \to d)$$
Here, $\mathbf{W}$ is a $d \times k$ matrix whose columns are the encoding directions. We'll require its columns to be orthonormal: $\mathbf{W}^T \mathbf{W} = \mathbf{I}_k$.
The orthonormality constraint serves two purposes: (1) it prevents redundancy—non-orthogonal columns would encode overlapping information, and (2) it makes the decoder uniquely determined by the encoder. With orthonormal $\mathbf{W}$, the encoding-decoding pair represents projection onto a $k$-dimensional subspace followed by embedding back into $\mathbb{R}^d$.
Combining encoding and decoding:
$$\hat{\mathbf{x}}^{(i)} = \mathbf{W} \mathbf{W}^T \mathbf{x}^{(i)} = \mathbf{P} \mathbf{x}^{(i)}$$
where $\mathbf{P} = \mathbf{W}\mathbf{W}^T$ is a projection matrix. It projects data onto the $k$-dimensional subspace spanned by the columns of $\mathbf{W}$.
Properties of $\mathbf{P}$:
The reconstruction $\hat{\mathbf{x}}^{(i)}$ is the orthogonal projection of $\mathbf{x}^{(i)}$ onto the subspace spanned by $\mathbf{W}$'s columns.
How do we measure the quality of our compression? The natural choice is the reconstruction error—how different is $\hat{\mathbf{x}}$ from the original $\mathbf{x}$?
For a single observation, the reconstruction error is:
$$\mathbf{e}^{(i)} = \mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)} = \mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)} = (\mathbf{I} - \mathbf{W}\mathbf{W}^T)\mathbf{x}^{(i)}$$
The squared error magnitude is:
$$\|\mathbf{e}^{(i)}\|^2 = \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2$$
Averaging over all observations, the mean squared reconstruction error is:
$$\mathcal{J}(\mathbf{W}) = \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2 = \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)}\|^2$$
Our goal: find the orthonormal $\mathbf{W}$ that minimizes $\mathcal{J}(\mathbf{W})$.
The Minimum Reconstruction Error Problem:
$$\mathbf{W}^* = \arg\min_{\mathbf{W}} \frac{1}{n} \sum_{i=1}^{n} \|\mathbf{x}^{(i)} - \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)}\|^2 \quad \text{s.t.} \quad \mathbf{W}^T\mathbf{W} = \mathbf{I}_k$$
Geometrically, we're choosing a $k$-dimensional subspace (a hyperplane through the origin) such that the sum of squared distances from data points to the subspace is minimized. Each $\hat{\mathbf{x}}^{(i)}$ is the closest point on the subspace to $\mathbf{x}^{(i)}$—it's the orthogonal projection.
Now we'll prove the beautiful result that links variance maximization to error minimization. The key is a decomposition of total variance.
For any orthonormal $\mathbf{W}$, we can write:
$$\mathbf{x}^{(i)} = \hat{\mathbf{x}}^{(i)} + \mathbf{e}^{(i)} = \mathbf{W}\mathbf{W}^T\mathbf{x}^{(i)} + (\mathbf{I} - \mathbf{W}\mathbf{W}^T)\mathbf{x}^{(i)}$$
This splits each data point into:
Since $\hat{\mathbf{x}}^{(i)}$ and $\mathbf{e}^{(i)}$ are orthogonal (projection property), the Pythagorean theorem applies:
$$\|\mathbf{x}^{(i)}\|^2 = \|\hat{\mathbf{x}}^{(i)}\|^2 + \|\mathbf{e}^{(i)}\|^2$$
Summing over all observations and dividing by $n$:
$$\frac{1}{n}\sum_i \|\mathbf{x}^{(i)}\|^2 = \frac{1}{n}\sum_i \|\hat{\mathbf{x}}^{(i)}\|^2 + \frac{1}{n}\sum_i \|\mathbf{e}^{(i)}\|^2$$
In terms of variances (for centered data, $\|\mathbf{x}\|^2$ is related to squared distance from mean = origin):
$$\text{Total Variance} = \text{Preserved Variance} + \text{Reconstruction Error}$$
This is the key insight: Total variance is constant. It's a property of the data, independent of our choice of $\mathbf{W}$. Therefore:
$$\text{Minimize Error} \Leftrightarrow \text{Maximize Preserved Variance}$$
The two formulations are exactly equivalent!
Let's make this precise. The total variance is:
$$\text{Total} = \frac{1}{n}\sum_i \|\mathbf{x}^{(i)}\|^2 = \text{tr}(\mathbf{S}) = \sum_{j=1}^{d} \lambda_j$$
where $\text{tr}(\mathbf{S})$ is the trace of the covariance matrix and $\lambda_j$ are its eigenvalues.
The preserved variance (variance of reconstructed points) equals the variance of projections onto the subspace. If $\mathbf{W} = [\mathbf{w}_1 | \cdots | \mathbf{w}_k]$ with orthonormal columns:
$$\text{Preserved} = \frac{1}{n}\sum_i \|\hat{\mathbf{x}}^{(i)}\|^2 = \sum_{j=1}^{k} \mathbf{w}_j^T \mathbf{S} \mathbf{w}_j = \text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$$
Since Total is fixed:
$$\text{Error} = \text{Total} - \text{Preserved} = \text{tr}(\mathbf{S}) - \text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$$
Minimizing error ⟺ maximizing $\text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W})$, the preserved variance.
We've shown that minimizing reconstruction error is equivalent to maximizing:
$$\text{tr}(\mathbf{W}^T \mathbf{S} \mathbf{W}) = \sum_{j=1}^{k} \mathbf{w}_j^T \mathbf{S} \mathbf{w}_j$$
subject to $\mathbf{W}^T\mathbf{W} = \mathbf{I}_k$ (orthonormal columns).
For each column $\mathbf{w}_j$, we're maximizing $\mathbf{w}_j^T \mathbf{S} \mathbf{w}_j$ subject to $\|\mathbf{w}_j\| = 1$. This is the Rayleigh quotient, and from our variance maximization derivation, we know the maximum is achieved when $\mathbf{w}_j$ is an eigenvector of $\mathbf{S}$.
With the orthogonality constraints between columns, the solution is:
$$\mathbf{W}^* = [\mathbf{w}_1 | \mathbf{w}_2 | \cdots | \mathbf{w}_k]$$
where $\mathbf{w}_j$ is the eigenvector of $\mathbf{S}$ with the $j$-th largest eigenvalue $\lambda_j$.
With the optimal $\mathbf{W}^*$, the minimum reconstruction error is:
$$\mathcal{J}^* = \text{tr}(\mathbf{S}) - \sum_{j=1}^{k} \lambda_j = \sum_{j=k+1}^{d} \lambda_j$$
The error equals the sum of the discarded eigenvalues—the eigenvalues corresponding to the principal components we didn't include.
Similarly, the preserved variance is:
$$\text{Preserved Variance} = \sum_{j=1}^{k} \lambda_j$$
The preserved variance equals the sum of the kept eigenvalues.
Each eigenvalue $\lambda_j$ represents the variance along principal component $j$. Including the $j$-th PC in our encoding preserves $\lambda_j$ units of variance and prevents $\lambda_j$ units of reconstruction error. This makes choosing $k$ (how many components to keep) interpretable: we're doing a cost-benefit analysis where each additional component provides $\lambda_j$ benefit.
| Components $k$ | Preserved Variance | Reconstruction Error | % Variance Explained |
|---|---|---|---|
| $k = 0$ | 0 | $\sum_{j=1}^{d} \lambda_j$ | 0% |
| $k = 1$ | $\lambda_1$ | $\sum_{j=2}^{d} \lambda_j$ | $\lambda_1 / \sum \lambda$ |
| $k = 2$ | $\lambda_1 + \lambda_2$ | $\sum_{j=3}^{d} \lambda_j$ | $(\lambda_1 + \lambda_2) / \sum \lambda$ |
| $k = d$ | $\sum_{j=1}^{d} \lambda_j$ | 0 | 100% |
The reconstruction error perspective offers beautiful geometric intuition. Let's visualize what's happening in different scenarios.
Consider 2D data points and their projection onto a line through the origin. For each point:
The sum of squared distances from all points to the line is the reconstruction error. PCA finds the line that minimizes this total squared distance—equivalently, the line that passes through the 'middle' of the data cloud in a least-squares sense.
With 3D data projected to 2D, we're finding the best-fit plane through the origin. Each point's error is its perpendicular distance to the plane.
The optimal plane is spanned by the two eigenvectors with largest eigenvalues—the first two principal components. Points project onto this plane, and the distances from the original points to their projections are minimized.
PCA uses orthogonal projection—each point projects to its closest point on the subspace. This is different from oblique projections that might map points along non-perpendicular directions. The orthogonality is why the Pythagorean theorem applies and why reconstructed and residual components have zero correlation.
Just as important as the $k$-dimensional subspace we project onto is the $(d-k)$-dimensional residual subspace we project away from. The residual subspace is spanned by the remaining eigenvectors $\mathbf{w}_{k+1}, \ldots, \mathbf{w}_d$.
The error vector $\mathbf{e}^{(i)}$ lies entirely in this residual subspace. If we wanted to perfectly reconstruct the data, we'd need both:
$$\mathbf{x}^{(i)} = \underbrace{\mathbf{W}k \mathbf{W}k^T \mathbf{x}^{(i)}}{\text{kept}} + \underbrace{\mathbf{W}{-k} \mathbf{W}{-k}^T \mathbf{x}^{(i)}}{\text{discarded}}$$
where $\mathbf{W}_{-k}$ contains the remaining $d-k$ eigenvectors.
The reconstruction error perspective reveals PCA as a special case of a broader class of models: autoencoders. Understanding this connection illuminates both what PCA does and what its limitations are.
An autoencoder consists of:
The model is trained to minimize reconstruction loss: $\|\mathbf{x} - g(f(\mathbf{x}))\|^2$
PCA is the specific case where both $f$ and $g$ are linear:
| Aspect | PCA | Neural Autoencoder |
|---|---|---|
| Encoder | Linear: $\mathbf{W}^T \mathbf{x}$ | Nonlinear: Neural network |
| Decoder | Linear: $\mathbf{W} \mathbf{z}$ | Nonlinear: Neural network |
| Training | Closed-form (eigendecomposition) | Gradient descent |
| Learned representation | Linear subspace | Nonlinear manifold |
| Optimality | Global optimum guaranteed | Local optima possible |
| Capacity | Linear relationships only | Arbitrary nonlinear relationships |
A remarkable theorem: if you train a neural network autoencoder with linear activations (no nonlinearities) using gradient descent to minimize reconstruction error, the encoder weights will converge to the principal components. The linear case has a unique global optimum, and gradient descent finds it.
Among all linear encoders/decoders, PCA minimizes reconstruction error. This is a strong optimality statement: no clever choice of linear encoding can do better than the eigenvector solution.
Theorem (Eckart-Young-Mirsky): The best rank-$k$ approximation to a matrix (in Frobenius norm) is given by keeping the top $k$ singular value components. For centered data, this is equivalent to the PCA solution.
This theorem tells us that:
Understanding the structure of reconstruction errors provides valuable diagnostic information and can reveal when PCA is appropriate or inappropriate for a dataset.
For each observation, we can compute:
$$e_i = \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\| = \sqrt{\sum_{j=k+1}^{d} (z_j^{(i)})^2}$$
where $z_j^{(i)} = \mathbf{w}_j^T \mathbf{x}^{(i)}$ are the scores on the discarded components.
Points with high reconstruction error are outliers with respect to the principal subspace—they have unusual values on the less important components.
For the entire dataset, the total reconstruction error (residual sum of squares) is:
$$\text{RSS} = \sum_{i=1}^{n} \|\mathbf{e}^{(i)}\|^2 = n \sum_{j=k+1}^{d} \lambda_j$$
This can be normalized to a fraction of total variance:
$$\text{Fraction Unexplained} = \frac{\sum_{j=k+1}^{d} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$
The complement is the proportion of variance explained:
$$\text{Explained Variance Ratio} = 1 - \text{Fraction Unexplained} = \frac{\sum_{j=1}^{k} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$
This is the most common metric for choosing the number of components.
Low reconstruction error doesn't necessarily mean PCA has captured meaningful structure. If the true structure is nonlinear, PCA might achieve low error by capturing the linear component while missing the important nonlinear relationships. Always combine quantitative metrics with qualitative inspection of results.
We've developed the minimum reconstruction error formulation of PCA and shown its deep connection to variance maximization. Here are the essential insights:
| Variance Maximization | Reconstruction Minimization |
|---|---|
| Find directions of max spread | Find best approximating subspace |
| Preserve signal | Minimize distortion |
| Forward: what to keep | Backward: what we lose |
| Eigenvalue = variance along PC | Eigenvalue = error prevented by PC |
Both lead to the same solution: eigenvectors of the covariance matrix, ordered by eigenvalue.
We've now seen why eigenvectors of the covariance matrix are the answer from two perspectives. In the next page, we'll dive deep into the eigenvalue problem itself—understanding eigendecomposition, the spectral theorem, and the computational methods used to find principal components efficiently.