Loading learning content...
We've established that principal components are eigenvectors of the covariance matrix. But what exactly is an eigenvector? Why do eigenvectors of symmetric matrices have such beautiful properties? And how do we actually compute them for real-world datasets with thousands of dimensions?
This page dives deep into the eigenvalue problem—the mathematical structure that makes PCA tractable and elegant. Understanding eigendecomposition isn't just academic; it provides insight into why PCA works, when it fails, and how its results should be interpreted.
We'll explore the spectral theorem for symmetric matrices, understand the computational algorithms behind the scenes, and develop intuition for what eigenvalues and eigenvectors truly represent.
By the end of this page, you will understand eigenvalues and eigenvectors from multiple perspectives, grasp the spectral theorem and its implications for PCA, know the computational methods used to solve eigenvalue problems, and appreciate the numerical considerations that affect real-world PCA implementations.
Let's start with the fundamental definitions and build intuition for what these objects represent.
Given a square matrix $\mathbf{A} \in \mathbb{R}^{d \times d}$, a scalar $\lambda$ and a non-zero vector $\mathbf{v}$ are an eigenvalue-eigenvector pair if:
$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$
This seemingly simple equation says something profound: when $\mathbf{A}$ acts on $\mathbf{v}$, the result is just a scaled version of $\mathbf{v}$.
Most vectors get 'rotated' and 'stretched' by a matrix—their direction changes. Eigenvectors are special: they only get scaled, remaining along the same line through the origin.
Imagine a linear transformation represented by matrix $\mathbf{A}$. This transformation:
But along eigenvector directions, something simple happens: the transformation is pure scaling.
'Eigen' is German for 'own' or 'characteristic.' An eigenvector is a vector that is 'characteristic' of the matrix—a direction that reveals the matrix's essential behavior. The eigenvalue tells you how strongly that characteristic direction is expressed.
To find eigenvalues, we rearrange:
$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v} \implies (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}$$
For a non-zero solution $\mathbf{v}$ to exist, the matrix $(\mathbf{A} - \lambda\mathbf{I})$ must be singular (not invertible). This happens when:
$$\det(\mathbf{A} - \lambda\mathbf{I}) = 0$$
This is the characteristic equation. It's a polynomial of degree $d$ in $\lambda$, so it has exactly $d$ roots (counting multiplicity, in the complex numbers).
For a $d \times d$ matrix, there are $d$ eigenvalues (some may be repeated, and for general matrices, some may be complex).
The covariance matrix has a special structure: it is symmetric ($\mathbf{S} = \mathbf{S}^T$). This symmetry grants us a powerful theorem that makes PCA particularly elegant.
Theorem: Let $\mathbf{S}$ be a real symmetric matrix. Then:
More precisely, there exists an orthogonal matrix $\mathbf{W}$ (meaning $\mathbf{W}^T = \mathbf{W}^{-1}$) such that:
$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T$$
where $\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$ is a diagonal matrix of eigenvalues.
The spectral theorem guarantees that the covariance matrix has an orthonormal basis of eigenvectors with real eigenvalues. This is exactly what PCA needs: real, interpretable variances (eigenvalues) along independent, orthogonal directions (eigenvectors). Without symmetry, we might get complex eigenvalues or non-orthogonal eigenvectors, making interpretation much harder.
Let $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ where $\mathbf{v} eq \mathbf{0}$. Consider:
$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T (\lambda \mathbf{v}) = \lambda \|\mathbf{v}\|^2$$
Also, since $\mathbf{S}$ is symmetric and taking the transpose:
$$\mathbf{v}^T \mathbf{S} \mathbf{v} = (\mathbf{v}^T \mathbf{S} \mathbf{v})^T = \mathbf{v}^T \mathbf{S}^T \mathbf{v} = \mathbf{v}^T \mathbf{S} \mathbf{v}$$
This scalar equals its own transpose, so it's real. Since $\|\mathbf{v}\|^2 > 0$, we have $\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} / \|\mathbf{v}\|^2$ is real.
(For complex eigenvectors, a more careful argument using Hermitian adjoints is needed, but the conclusion holds: symmetric real matrices have only real eigenvalues.)
Let $\mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i$ and $\mathbf{S}\mathbf{v}_j = \lambda_j \mathbf{v}_j$ with $\lambda_i eq \lambda_j$.
Compute $\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i$ two ways:
$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = \mathbf{v}_j^T (\lambda_i \mathbf{v}_i) = \lambda_i (\mathbf{v}_j^T \mathbf{v}_i)$$
$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = (\mathbf{S} \mathbf{v}_j)^T \mathbf{v}_i = (\lambda_j \mathbf{v}_j)^T \mathbf{v}_i = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$$
Equating: $\lambda_i (\mathbf{v}_j^T \mathbf{v}_i) = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$
Since $\lambda_i eq \lambda_j$, we must have $\mathbf{v}_j^T \mathbf{v}_i = 0$. QED.
The spectral theorem guarantees that the covariance matrix can be decomposed as:
$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T = \sum_{j=1}^{d} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$
Let's unpack this decomposition and understand what it tells us.
The sum form is particularly illuminating:
$$\mathbf{S} = \lambda_1 \mathbf{w}_1 \mathbf{w}_1^T + \lambda_2 \mathbf{w}_2 \mathbf{w}_2^T + \cdots + \lambda_d \mathbf{w}_d \mathbf{w}_d^T$$
Each term $\mathbf{w}_j \mathbf{w}_j^T$ is a projection matrix onto the line spanned by $\mathbf{w}_j$. The covariance matrix is a weighted sum of projection matrices, where the weights are the eigenvalues.
This decomposition shows that the covariance matrix 'packages' variance contributions from orthogonal directions. Direction $\mathbf{w}j$ contributes $\lambda_j$ units of variance. When we do PCA with $k$ components, we keep the first $k$ terms: $\hat{\mathbf{S}} = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$, the best rank-$k$ approximation.
If we truncate the sum after $k$ terms:
$$\hat{\mathbf{S}}k = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$
This is the best rank-$k$ approximation to $\mathbf{S}$ in the Frobenius norm. The approximation error is:
$$\|\mathbf{S} - \hat{\mathbf{S}}_k\|F = \sqrt{\sum{j=k+1}^{d} \lambda_j^2}$$
This is the Eckart-Young-Mirsky theorem applied to the covariance matrix.
The eigendecomposition also reveals matrix inverses:
$$\mathbf{S}^{-1} = \mathbf{W} \mathbf{\Lambda}^{-1} \mathbf{W}^T = \sum_{j=1}^{d} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}_j^T$$
This only works if all $\lambda_j > 0$. If some eigenvalues are zero (rank-deficient case), the matrix is singular and has no inverse.
The pseudoinverse handles this by only inverting non-zero eigenvalues:
$$\mathbf{S}^{+} = \sum_{j: \lambda_j > 0} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}_j^T$$
This is useful in high-dimensional settings where $n < d$.
The covariance matrix has an additional special property beyond symmetry: it is positive semi-definite (PSD). This has important implications for its eigenvalues and for PCA.
A symmetric matrix $\mathbf{S}$ is positive semi-definite if:
$$\mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0 \quad \text{for all } \mathbf{v} \in \mathbb{R}^d$$
For the covariance matrix, this is easily verified:
$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T \left(\frac{1}{n}\mathbf{X}^T\mathbf{X}\right) \mathbf{v} = \frac{1}{n}\|\mathbf{X}\mathbf{v}\|^2 \geq 0$$
The squared norm is always non-negative, so the covariance matrix is always PSD.
Positive semi-definiteness is equivalent to all eigenvalues being non-negative:
$$\lambda_j \geq 0 \quad \text{for all } j$$
Proof: If $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ with $\|\mathbf{v}\| = 1$, then:
$$\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0$$
by the PSD property.
This is crucial for PCA: eigenvalues represent variances, and variances cannot be negative. The mathematics confirms what intuition demands.
| Type | Condition | Eigenvalues | Example |
|---|---|---|---|
| Positive Definite (PD) | $\mathbf{v}^T\mathbf{S}\mathbf{v} > 0$ for all $\mathbf{v} eq 0$ | All $\lambda_j > 0$ | Full-rank covariance matrix |
| Positive Semi-Definite (PSD) | $\mathbf{v}^T\mathbf{S}\mathbf{v} \geq 0$ for all $\mathbf{v}$ | All $\lambda_j \geq 0$ | Rank-deficient covariance |
| Indefinite | Some positive, some negative | Mixed signs | General symmetric matrix |
| Negative Semi-Definite | $\mathbf{v}^T\mathbf{S}\mathbf{v} \leq 0$ for all $\mathbf{v}$ | All $\lambda_j \leq 0$ | Negative covariance |
Zero eigenvalues indicate directions where the data has no variance. If $\lambda_k = 0$, the data lies in a subspace of dimension less than $d$. This happens when: (1) $n < d$ (more features than samples), (2) features are exact linear combinations of others, or (3) all observations have identical values in some direction.
Computing the characteristic polynomial and finding its roots is mathematically exact but computationally impractical for large matrices. Real-world eigensolvers use iterative methods that are more stable and efficient.
The simplest iterative method for finding the largest eigenvalue and its eigenvector:
The eigenvalue is then $\lambda_1 = (\mathbf{v}^{(k)})^T \mathbf{S} \mathbf{v}^{(k)}$.
Why it works: Decompose $\mathbf{v}^{(0)} = \sum_j c_j \mathbf{w}_j$. After $k$ iterations:
$$\mathbf{S}^k \mathbf{v}^{(0)} = \sum_j c_j \lambda_j^k \mathbf{w}_j = \lambda_1^k \left(c_1 \mathbf{w}1 + \sum{j>1} c_j \left(\frac{\lambda_j}{\lambda_1}\right)^k \mathbf{w}_j\right)$$
Since $|\lambda_j / \lambda_1| < 1$ for $j > 1$, the other components decay exponentially.
The workhorse algorithm for computing all eigenvalues simultaneously:
The diagonal entries converge to eigenvalues. For symmetric matrices, convergence is particularly fast.
With shifts: Adding shifts $\mathbf{A}_k - \mu_k \mathbf{I}$ before QR decomposition accelerates convergence dramatically. The Wilkinson shift strategy achieves cubic convergence.
| Method | Computes | Complexity | Best For |
|---|---|---|---|
| Power Iteration | Largest eigenvalue/vector | $O(d^2)$ per iteration | Only need largest |
| Inverse Iteration | Eigenvalue near target | $O(d^3)$ per iteration | Refine known eigenvalue |
| QR Algorithm | All eigenvalues | $O(d^3)$ total | Dense matrices, all eigenvalues |
| Divide-and-Conquer | All eigenvalues | $O(d^3)$ but faster constants | Symmetric matrices |
| Lanczos | Extreme eigenvalues | $O(dk \cdot \text{nnz})$ | Sparse matrices, few eigenvalues |
| Randomized SVD | Top-$k$ singular values | $O(d^2 k)$ | Very large/streaming data |
For most PCA applications, use SVD rather than explicit eigendecomposition. Computing the SVD of $\mathbf{X}$ directly is more numerically stable than forming $\mathbf{X}^T\mathbf{X}$ and then eigendecomposing. Libraries like NumPy, SciPy, and scikit-learn handle this automatically.
Singular Value Decomposition (SVD) is intimately related to PCA and often preferred computationally. Understanding this connection is essential for practical PCA implementation.
Any matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ can be decomposed as:
$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$
where:
Consider the covariance matrix (assuming centered $\mathbf{X}$):
$$\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X} = \frac{1}{n}(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)^T(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)$$
$$= \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{U}^T\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{\Sigma}\mathbf{V}^T = \mathbf{V}\left(\frac{\mathbf{\Sigma}^2}{n}\right)\mathbf{V}^T$$
Comparing with $\mathbf{S} = \mathbf{W}\mathbf{\Lambda}\mathbf{W}^T$:
To compute PCA: (1) Center the data, (2) Compute SVD of centered data matrix, (3) Right singular vectors $\mathbf{V}$ are principal components, (4) Singular values give explained variances via $\lambda_j = \sigma_j^2/n$. This is more numerically stable than explicit covariance eigendecomposition.
Numerical Stability: Forming $\mathbf{X}^T\mathbf{X}$ squares the condition number, amplifying numerical errors. SVD works directly on $\mathbf{X}$.
Memory Efficiency: For $n < d$, we never need to form the $d \times d$ covariance matrix.
Works on Rectangular Matrices: SVD is defined for any matrix, not just square ones.
Truncated Computation: Efficient algorithms compute only the top $k$ singular values/vectors without computing all $d$.
Real-world PCA implementation must handle various numerical challenges. Understanding these helps you interpret results correctly and avoid common mistakes.
The condition number $\kappa(\mathbf{S}) = \lambda_{\max} / \lambda_{\min}$ measures sensitivity to numerical errors.
Ill-conditioning often occurs when features have vastly different scales or when features are highly correlated.
When eigenvalues are very small, practical approaches include:
Two runs of PCA may produce eigenvectors with flipped signs or (for equal eigenvalues) in different order. This is mathematically correct but can cause issues in applications—for example, tracking principal components over time. Establish conventions (e.g., largest component of first PC is positive) for reproducibility.
We've explored the eigenvalue problem that lies at the heart of PCA. Here are the essential concepts:
We've now understood the mathematics behind computing principal components. In the next page, we'll explore the proportion of variance explained—how to measure and interpret the amount of information captured by each component, and how to use this to choose the appropriate number of components for your application.