Loading learning content...
We've established that principal components are eigenvectors of the covariance matrix. But what exactly is an eigenvector? Why do eigenvectors of symmetric matrices have such beautiful properties? And how do we actually compute them for real-world datasets with thousands of dimensions?\n\nThis page dives deep into the eigenvalue problem—the mathematical structure that makes PCA tractable and elegant. Understanding eigendecomposition isn't just academic; it provides insight into why PCA works, when it fails, and how its results should be interpreted.\n\nWe'll explore the spectral theorem for symmetric matrices, understand the computational algorithms behind the scenes, and develop intuition for what eigenvalues and eigenvectors truly represent.
By the end of this page, you will understand eigenvalues and eigenvectors from multiple perspectives, grasp the spectral theorem and its implications for PCA, know the computational methods used to solve eigenvalue problems, and appreciate the numerical considerations that affect real-world PCA implementations.
Let's start with the fundamental definitions and build intuition for what these objects represent.\n\n### The Definition\n\nGiven a square matrix $\mathbf{A} \in \mathbb{R}^{d \times d}$, a scalar $\lambda$ and a non-zero vector $\mathbf{v}$ are an eigenvalue-eigenvector pair if:\n\n$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$\n\nThis seemingly simple equation says something profound: when $\mathbf{A}$ acts on $\mathbf{v}$, the result is just a scaled version of $\mathbf{v}$.\n\nMost vectors get 'rotated' and 'stretched' by a matrix—their direction changes. Eigenvectors are special: they only get scaled, remaining along the same line through the origin.
'Eigen' is German for 'own' or 'characteristic.' An eigenvector is a vector that is 'characteristic' of the matrix—a direction that reveals the matrix's essential behavior. The eigenvalue tells you how strongly that characteristic direction is expressed.
The covariance matrix has a special structure: it is symmetric ($\mathbf{S} = \mathbf{S}^T$). This symmetry grants us a powerful theorem that makes PCA particularly elegant.\n\n### The Spectral Theorem\n\nTheorem: Let $\mathbf{S}$ be a real symmetric matrix. Then:\n\n1. All eigenvalues are real (not complex)\n2. Eigenvectors corresponding to distinct eigenvalues are orthogonal\n3. $\mathbf{S}$ can be diagonalized by an orthogonal matrix\n\nMore precisely, there exists an orthogonal matrix $\mathbf{W}$ (meaning $\mathbf{W}^T = \mathbf{W}^{-1}$) such that:\n\n$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T$$\n\nwhere $\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$ is a diagonal matrix of eigenvalues.
The spectral theorem guarantees that the covariance matrix has an orthonormal basis of eigenvectors with real eigenvalues. This is exactly what PCA needs: real, interpretable variances (eigenvalues) along independent, orthogonal directions (eigenvectors). Without symmetry, we might get complex eigenvalues or non-orthogonal eigenvectors, making interpretation much harder.
The spectral theorem guarantees that the covariance matrix can be decomposed as:\n\n$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T = \sum_{j=1}^{d} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$\n\nLet's unpack this decomposition and understand what it tells us.\n\n### The Outer Product Form\n\nThe sum form is particularly illuminating:\n\n$$\mathbf{S} = \lambda_1 \mathbf{w}_1 \mathbf{w}_1^T + \lambda_2 \mathbf{w}_2 \mathbf{w}_2^T + \cdots + \lambda_d \mathbf{w}_d \mathbf{w}_d^T$$\n\nEach term $\mathbf{w}_j \mathbf{w}_j^T$ is a projection matrix onto the line spanned by $\mathbf{w}_j$. The covariance matrix is a weighted sum of projection matrices, where the weights are the eigenvalues.
This decomposition shows that the covariance matrix 'packages' variance contributions from orthogonal directions. Direction $\mathbf{w}j$ contributes $\lambda_j$ units of variance. When we do PCA with $k$ components, we keep the first $k$ terms: $\hat{\mathbf{S}} = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$, the best rank-$k$ approximation.
The covariance matrix has an additional special property beyond symmetry: it is positive semi-definite (PSD). This has important implications for its eigenvalues and for PCA.\n\n### Definition of PSD\n\nA symmetric matrix $\mathbf{S}$ is positive semi-definite if:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0 \quad \text{for all } \mathbf{v} \in \mathbb{R}^d$$\n\nFor the covariance matrix, this is easily verified:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T \left(\frac{1}{n}\mathbf{X}^T\mathbf{X}\right) \mathbf{v} = \frac{1}{n}\|\mathbf{X}\mathbf{v}\|^2 \geq 0$$\n\nThe squared norm is always non-negative, so the covariance matrix is always PSD.
| Type | Condition | Eigenvalues | Example |
|---|---|---|---|
| Positive Definite (PD) | $\mathbf{v}^T\mathbf{S}\mathbf{v} > 0$ for all $\mathbf{v} \neq 0$ | All $\lambda_j > 0$ | Full-rank covariance matrix |
| Positive Semi-Definite (PSD) | $\mathbf{v}^T\mathbf{S}\mathbf{v} \geq 0$ for all $\mathbf{v}$ | All $\lambda_j \geq 0$ | Rank-deficient covariance |
| Indefinite | Some positive, some negative | Mixed signs | General symmetric matrix |
| Negative Semi-Definite | $\mathbf{v}^T\mathbf{S}\mathbf{v} \leq 0$ for all $\mathbf{v}$ | All $\lambda_j \leq 0$ | Negative covariance |
Zero eigenvalues indicate directions where the data has no variance. If $\lambda_k = 0$, the data lies in a subspace of dimension less than $d$. This happens when: (1) $n < d$ (more features than samples), (2) features are exact linear combinations of others, or (3) all observations have identical values in some direction.
Computing the characteristic polynomial and finding its roots is mathematically exact but computationally impractical for large matrices. Real-world eigensolvers use iterative methods that are more stable and efficient.\n\n### Power Iteration\n\nThe simplest iterative method for finding the largest eigenvalue and its eigenvector:\n\n1. Start with a random vector $\mathbf{v}^{(0)}$\n2. Repeat: $\mathbf{v}^{(k+1)} = \frac{\mathbf{S}\mathbf{v}^{(k)}}{\|\mathbf{S}\mathbf{v}^{(k)}\|}$\n3. Under mild conditions, $\mathbf{v}^{(k)} \to \mathbf{w}_1$\n\nThe eigenvalue is then $\lambda_1 = (\mathbf{v}^{(k)})^T \mathbf{S} \mathbf{v}^{(k)}$.\n\nWhy it works: Decompose $\mathbf{v}^{(0)} = \sum_j c_j \mathbf{w}_j$. After $k$ iterations:\n\n$$\mathbf{S}^k \mathbf{v}^{(0)} = \sum_j c_j \lambda_j^k \mathbf{w}_j = \lambda_1^k \left(c_1 \mathbf{w}1 + \sum{j>1} c_j \left(\frac{\lambda_j}{\lambda_1}\right)^k \mathbf{w}_j\right)$$\n\nSince $|\lambda_j / \lambda_1| < 1$ for $j > 1$, the other components decay exponentially.
| Method | Computes | Complexity | Best For |
|---|---|---|---|
| Power Iteration | Largest eigenvalue/vector | $O(d^2)$ per iteration | Only need largest |
| Inverse Iteration | Eigenvalue near target | $O(d^3)$ per iteration | Refine known eigenvalue |
| QR Algorithm | All eigenvalues | $O(d^3)$ total | Dense matrices, all eigenvalues |
| Divide-and-Conquer | All eigenvalues | $O(d^3)$ but faster constants | Symmetric matrices |
| Lanczos | Extreme eigenvalues | $O(dk \cdot \text{nnz})$ | Sparse matrices, few eigenvalues |
| Randomized SVD | Top-$k$ singular values | $O(d^2 k)$ | Very large/streaming data |
For most PCA applications, use SVD rather than explicit eigendecomposition. Computing the SVD of $\mathbf{X}$ directly is more numerically stable than forming $\mathbf{X}^T\mathbf{X}$ and then eigendecomposing. Libraries like NumPy, SciPy, and scikit-learn handle this automatically.
Singular Value Decomposition (SVD) is intimately related to PCA and often preferred computationally. Understanding this connection is essential for practical PCA implementation.\n\n### SVD Definition\n\nAny matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ can be decomposed as:\n\n$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$\n\nwhere:\n- $\mathbf{U}$ is $n \times n$ orthogonal (left singular vectors)\n- $\mathbf{\Sigma}$ is $n \times d$ diagonal (singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$)\n- $\mathbf{V}$ is $d \times d$ orthogonal (right singular vectors)
To compute PCA: (1) Center the data, (2) Compute SVD of centered data matrix, (3) Right singular vectors $\mathbf{V}$ are principal components, (4) Singular values give explained variances via $\lambda_j = \sigma_j^2/n$. This is more numerically stable than explicit covariance eigendecomposition.
Real-world PCA implementation must handle various numerical challenges. Understanding these helps you interpret results correctly and avoid common mistakes.\n\n### Condition Number\n\nThe condition number $\kappa(\mathbf{S}) = \lambda_{\max} / \lambda_{\min}$ measures sensitivity to numerical errors.\n\n- Small $\kappa$ (close to 1): Well-conditioned, stable computation\n- Large $\kappa$ (> 10^6): Ill-conditioned, results may be unreliable\n- $\kappa = \infty$: Matrix is singular (has zero eigenvalues)\n\nIll-conditioning often occurs when features have vastly different scales or when features are highly correlated.
Two runs of PCA may produce eigenvectors with flipped signs or (for equal eigenvalues) in different order. This is mathematically correct but can cause issues in applications—for example, tracking principal components over time. Establish conventions (e.g., largest component of first PC is positive) for reproducibility.
We've explored the eigenvalue problem that lies at the heart of PCA. Here are the essential concepts:
We've now understood the mathematics behind computing principal components. In the next page, we'll explore the proportion of variance explained—how to measure and interpret the amount of information captured by each component, and how to use this to choose the appropriate number of components for your application.