Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

3 / 5

Eigenvalue Problem

The Mathematical Heart of PCA

We've established that principal components are eigenvectors of the covariance matrix. But what exactly is an eigenvector? Why do eigenvectors of symmetric matrices have such beautiful properties? And how do we actually compute them for real-world datasets with thousands of dimensions?\n\nThis page dives deep into the eigenvalue problem—the mathematical structure that makes PCA tractable and elegant. Understanding eigendecomposition isn't just academic; it provides insight into why PCA works, when it fails, and how its results should be interpreted.\n\nWe'll explore the spectral theorem for symmetric matrices, understand the computational algorithms behind the scenes, and develop intuition for what eigenvalues and eigenvectors truly represent.

What You Will Learn

By the end of this page, you will understand eigenvalues and eigenvectors from multiple perspectives, grasp the spectral theorem and its implications for PCA, know the computational methods used to solve eigenvalue problems, and appreciate the numerical considerations that affect real-world PCA implementations.

Eigenvalues and Eigenvectors Defined

Let's start with the fundamental definitions and build intuition for what these objects represent.\n\n### The Definition\n\nGiven a square matrix $\mathbf{A} \in \mathbb{R}^{d \times d}$, a scalar $\lambda$ and a non-zero vector $\mathbf{v}$ are an eigenvalue-eigenvector pair if:\n\n$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$\n\nThis seemingly simple equation says something profound: when $\mathbf{A}$ acts on $\mathbf{v}$, the result is just a scaled version of $\mathbf{v}$.\n\nMost vectors get 'rotated' and 'stretched' by a matrix—their direction changes. Eigenvectors are special: they only get scaled, remaining along the same line through the origin.

Geometric Interpretation\n\nImagine a linear transformation represented by matrix $\mathbf{A}$. This transformation:\n- Warps space: circles become ellipses, squares become parallelograms\n- Rotates, shears, scales, and possibly reflects\n\nBut along eigenvector directions, something simple happens: the transformation is pure scaling.\n\n- If $\lambda > 1$: the eigenvector direction is stretched\n- If $0 < \lambda < 1$: the eigenvector direction is compressed\n- If $\lambda < 0$: the eigenvector direction is reflected and scaled\n- If $\lambda = 0$: vectors in that direction are collapsed to zero

The 'Eigen' Etymology

'Eigen' is German for 'own' or 'characteristic.' An eigenvector is a vector that is 'characteristic' of the matrix—a direction that reveals the matrix's essential behavior. The eigenvalue tells you how strongly that characteristic direction is expressed.

Finding Eigenvalues: The Characteristic Equation\n\nTo find eigenvalues, we rearrange:\n\n$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v} \implies (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}$$\n\nFor a non-zero solution $\mathbf{v}$ to exist, the matrix $(\mathbf{A} - \lambda\mathbf{I})$ must be singular (not invertible). This happens when:\n\n$$\det(\mathbf{A} - \lambda\mathbf{I}) = 0$$\n\nThis is the characteristic equation. It's a polynomial of degree $d$ in $\lambda$, so it has exactly $d$ roots (counting multiplicity, in the complex numbers).\n\nFor a $d \times d$ matrix, there are $d$ eigenvalues (some may be repeated, and for general matrices, some may be complex).

The Spectral Theorem for Symmetric Matrices

The covariance matrix has a special structure: it is symmetric ($\mathbf{S} = \mathbf{S}^T$). This symmetry grants us a powerful theorem that makes PCA particularly elegant.\n\n### The Spectral Theorem\n\nTheorem: Let $\mathbf{S}$ be a real symmetric matrix. Then:\n\n1. All eigenvalues are real (not complex)\n2. Eigenvectors corresponding to distinct eigenvalues are orthogonal\n3. $\mathbf{S}$ can be diagonalized by an orthogonal matrix\n\nMore precisely, there exists an orthogonal matrix $\mathbf{W}$ (meaning $\mathbf{W}^T = \mathbf{W}^{-1}$) such that:\n\n$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T$$\n\nwhere $\mathbf{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d)$ is a diagonal matrix of eigenvalues.

Why This Matters for PCA

The spectral theorem guarantees that the covariance matrix has an orthonormal basis of eigenvectors with real eigenvalues. This is exactly what PCA needs: real, interpretable variances (eigenvalues) along independent, orthogonal directions (eigenvectors). Without symmetry, we might get complex eigenvalues or non-orthogonal eigenvectors, making interpretation much harder.

Proof Sketch: Real Eigenvalues\n\nLet $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ where $\mathbf{v} \neq \mathbf{0}$. Consider:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T (\lambda \mathbf{v}) = \lambda \|\mathbf{v}\|^2$$\n\nAlso, since $\mathbf{S}$ is symmetric and taking the transpose:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = (\mathbf{v}^T \mathbf{S} \mathbf{v})^T = \mathbf{v}^T \mathbf{S}^T \mathbf{v} = \mathbf{v}^T \mathbf{S} \mathbf{v}$$\n\nThis scalar equals its own transpose, so it's real. Since $\|\mathbf{v}\|^2 > 0$, we have $\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} / \|\mathbf{v}\|^2$ is real.\n\n(For complex eigenvectors, a more careful argument using Hermitian adjoints is needed, but the conclusion holds: symmetric real matrices have only real eigenvalues.)

Proof Sketch: Orthogonal Eigenvectors\n\nLet $\mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i$ and $\mathbf{S}\mathbf{v}_j = \lambda_j \mathbf{v}_j$ with $\lambda_i \neq \lambda_j$.\n\nCompute $\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i$ two ways:\n\n$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = \mathbf{v}_j^T (\lambda_i \mathbf{v}_i) = \lambda_i (\mathbf{v}_j^T \mathbf{v}_i)$$\n\n$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = (\mathbf{S} \mathbf{v}_j)^T \mathbf{v}_i = (\lambda_j \mathbf{v}_j)^T \mathbf{v}_i = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$$\n\nEquating: $\lambda_i (\mathbf{v}_j^T \mathbf{v}_i) = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$\n\nSince $\lambda_i \neq \lambda_j$, we must have $\mathbf{v}_j^T \mathbf{v}_i = 0$. QED.

Eigendecomposition in Detail

The spectral theorem guarantees that the covariance matrix can be decomposed as:\n\n$$\mathbf{S} = \mathbf{W} \mathbf{\Lambda} \mathbf{W}^T = \sum_{j=1}^{d} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$\n\nLet's unpack this decomposition and understand what it tells us.\n\n### The Outer Product Form\n\nThe sum form is particularly illuminating:\n\n$$\mathbf{S} = \lambda_1 \mathbf{w}_1 \mathbf{w}_1^T + \lambda_2 \mathbf{w}_2 \mathbf{w}_2^T + \cdots + \lambda_d \mathbf{w}_d \mathbf{w}_d^T$$\n\nEach term $\mathbf{w}_j \mathbf{w}_j^T$ is a projection matrix onto the line spanned by $\mathbf{w}_j$. The covariance matrix is a weighted sum of projection matrices, where the weights are the eigenvalues.

Covariance as Weighted Projections

This decomposition shows that the covariance matrix 'packages' variance contributions from orthogonal directions. Direction $\mathbf{w}j$ contributes $\lambda_j$ units of variance. When we do PCA with $k$ components, we keep the first $k$ terms: $\hat{\mathbf{S}} = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$, the best rank-$k$ approximation.

Low-Rank Approximation\n\nIf we truncate the sum after $k$ terms:\n\n$$\hat{\mathbf{S}}k = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$\n\nThis is the best rank-$k$ approximation to $\mathbf{S}$ in the Frobenius norm. The approximation error is:\n\n$$\|\mathbf{S} - \hat{\mathbf{S}}_k\|F = \sqrt{\sum{j=k+1}^{d} \lambda_j^2}$$\n\nThis is the Eckart-Young-Mirsky theorem applied to the covariance matrix.

The Inverse and Pseudoinverse\n\nThe eigendecomposition also reveals matrix inverses:\n\n$$\mathbf{S}^{-1} = \mathbf{W} \mathbf{\Lambda}^{-1} \mathbf{W}^T = \sum_{j=1}^{d} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}j^T$$\n\nThis only works if all $\lambda_j > 0$. If some eigenvalues are zero (rank-deficient case), the matrix is singular and has no inverse.\n\nThe pseudoinverse handles this by only inverting non-zero eigenvalues:\n\n$$\mathbf{S}^{+} = \sum{j: \lambda_j > 0} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}_j^T$$\n\nThis is useful in high-dimensional settings where $n < d$.

Positive Semi-Definiteness of Covariance Matrices

The covariance matrix has an additional special property beyond symmetry: it is positive semi-definite (PSD). This has important implications for its eigenvalues and for PCA.\n\n### Definition of PSD\n\nA symmetric matrix $\mathbf{S}$ is positive semi-definite if:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0 \quad \text{for all } \mathbf{v} \in \mathbb{R}^d$$\n\nFor the covariance matrix, this is easily verified:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T \left(\frac{1}{n}\mathbf{X}^T\mathbf{X}\right) \mathbf{v} = \frac{1}{n}\|\mathbf{X}\mathbf{v}\|^2 \geq 0$$\n\nThe squared norm is always non-negative, so the covariance matrix is always PSD.

Implications for Eigenvalues\n\nPositive semi-definiteness is equivalent to all eigenvalues being non-negative:\n\n$$\lambda_j \geq 0 \quad \text{for all } j$$\n\nProof: If $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ with $\|\mathbf{v}\| = 1$, then:\n\n$$\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0$$\n\nby the PSD property.\n\nThis is crucial for PCA: eigenvalues represent variances, and variances cannot be negative. The mathematics confirms what intuition demands.

Types of Definite Matrices
Type	Condition	Eigenvalues	Example
Positive Definite (PD)	$\mathbf{v}^T\mathbf{S}\mathbf{v} > 0$ for all $\mathbf{v} \neq 0$	All $\lambda_j > 0$	Full-rank covariance matrix
Positive Semi-Definite (PSD)	$\mathbf{v}^T\mathbf{S}\mathbf{v} \geq 0$ for all $\mathbf{v}$	All $\lambda_j \geq 0$	Rank-deficient covariance
Indefinite	Some positive, some negative	Mixed signs	General symmetric matrix
Negative Semi-Definite	$\mathbf{v}^T\mathbf{S}\mathbf{v} \leq 0$ for all $\mathbf{v}$	All $\lambda_j \leq 0$	Negative covariance

Zero Eigenvalues

Zero eigenvalues indicate directions where the data has no variance. If $\lambda_k = 0$, the data lies in a subspace of dimension less than $d$. This happens when: (1) $n < d$ (more features than samples), (2) features are exact linear combinations of others, or (3) all observations have identical values in some direction.

Computational Methods for Eigendecomposition

Computing the characteristic polynomial and finding its roots is mathematically exact but computationally impractical for large matrices. Real-world eigensolvers use iterative methods that are more stable and efficient.\n\n### Power Iteration\n\nThe simplest iterative method for finding the largest eigenvalue and its eigenvector:\n\n1. Start with a random vector $\mathbf{v}^{(0)}$\n2. Repeat: $\mathbf{v}^{(k+1)} = \frac{\mathbf{S}\mathbf{v}^{(k)}}{\|\mathbf{S}\mathbf{v}^{(k)}\|}$\n3. Under mild conditions, $\mathbf{v}^{(k)} \to \mathbf{w}_1$\n\nThe eigenvalue is then $\lambda_1 = (\mathbf{v}^{(k)})^T \mathbf{S} \mathbf{v}^{(k)}$.\n\nWhy it works: Decompose $\mathbf{v}^{(0)} = \sum_j c_j \mathbf{w}_j$. After $k$ iterations:\n\n$$\mathbf{S}^k \mathbf{v}^{(0)} = \sum_j c_j \lambda_j^k \mathbf{w}_j = \lambda_1^k \left(c_1 \mathbf{w}1 + \sum{j>1} c_j \left(\frac{\lambda_j}{\lambda_1}\right)^k \mathbf{w}_j\right)$$\n\nSince $|\lambda_j / \lambda_1| < 1$ for $j > 1$, the other components decay exponentially.

QR Algorithm\n\nThe workhorse algorithm for computing all eigenvalues simultaneously:\n\n1. Initialize $\mathbf{A}_0 = \mathbf{S}$\n2. Repeat:\n - Compute QR decomposition: $\mathbf{A}_k = \mathbf{Q}_k \mathbf{R}k$\n - Update: $\mathbf{A}{k+1} = \mathbf{R}_k \mathbf{Q}_k$\n3. $\mathbf{A}_k$ converges to diagonal (or block-diagonal) form\n\nThe diagonal entries converge to eigenvalues. For symmetric matrices, convergence is particularly fast.\n\nWith shifts: Adding shifts $\mathbf{A}_k - \mu_k \mathbf{I}$ before QR decomposition accelerates convergence dramatically. The Wilkinson shift strategy achieves cubic convergence.

Comparison of Eigenvalue Algorithms
Method	Computes	Complexity	Best For
Power Iteration	Largest eigenvalue/vector	$O(d^2)$ per iteration	Only need largest
Inverse Iteration	Eigenvalue near target	$O(d^3)$ per iteration	Refine known eigenvalue
QR Algorithm	All eigenvalues	$O(d^3)$ total	Dense matrices, all eigenvalues
Divide-and-Conquer	All eigenvalues	$O(d^3)$ but faster constants	Symmetric matrices
Lanczos	Extreme eigenvalues	$O(dk \cdot \text{nnz})$	Sparse matrices, few eigenvalues
Randomized SVD	Top-$k$ singular values	$O(d^2 k)$	Very large/streaming data

Practical Recommendation

For most PCA applications, use SVD rather than explicit eigendecomposition. Computing the SVD of $\mathbf{X}$ directly is more numerically stable than forming $\mathbf{X}^T\mathbf{X}$ and then eigendecomposing. Libraries like NumPy, SciPy, and scikit-learn handle this automatically.

The SVD Connection

Singular Value Decomposition (SVD) is intimately related to PCA and often preferred computationally. Understanding this connection is essential for practical PCA implementation.\n\n### SVD Definition\n\nAny matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ can be decomposed as:\n\n$$\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$\n\nwhere:\n- $\mathbf{U}$ is $n \times n$ orthogonal (left singular vectors)\n- $\mathbf{\Sigma}$ is $n \times d$ diagonal (singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$)\n- $\mathbf{V}$ is $d \times d$ orthogonal (right singular vectors)

Connecting SVD to Eigendecomposition\n\nConsider the covariance matrix (assuming centered $\mathbf{X}$):\n\n$$\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X} = \frac{1}{n}(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)^T(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)$$\n\n$$= \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{U}^T\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{\Sigma}\mathbf{V}^T = \mathbf{V}\left(\frac{\mathbf{\Sigma}^2}{n}\right)\mathbf{V}^T$$\n\nComparing with $\mathbf{S} = \mathbf{W}\mathbf{\Lambda}\mathbf{W}^T$:\n\n- Principal component directions: $\mathbf{W} = \mathbf{V}$ (right singular vectors)\n- Eigenvalues: $\lambda_j = \sigma_j^2 / n$ (squared singular values divided by $n$)

The SVD-PCA Equivalence

To compute PCA: (1) Center the data, (2) Compute SVD of centered data matrix, (3) Right singular vectors $\mathbf{V}$ are principal components, (4) Singular values give explained variances via $\lambda_j = \sigma_j^2/n$. This is more numerically stable than explicit covariance eigendecomposition.

Why SVD is Preferred\n\n1. Numerical Stability: Forming $\mathbf{X}^T\mathbf{X}$ squares the condition number, amplifying numerical errors. SVD works directly on $\mathbf{X}$.\n\n2. Memory Efficiency: For $n < d$, we never need to form the $d \times d$ covariance matrix.\n\n3. Works on Rectangular Matrices: SVD is defined for any matrix, not just square ones.\n\n4. Truncated Computation: Efficient algorithms compute only the top $k$ singular values/vectors without computing all $d$.

Numerical Considerations and Pitfalls

Real-world PCA implementation must handle various numerical challenges. Understanding these helps you interpret results correctly and avoid common mistakes.\n\n### Condition Number\n\nThe condition number $\kappa(\mathbf{S}) = \lambda_{\max} / \lambda_{\min}$ measures sensitivity to numerical errors.\n\n- Small $\kappa$ (close to 1): Well-conditioned, stable computation\n- Large $\kappa$ (> 10^6): Ill-conditioned, results may be unreliable\n- $\kappa = \infty$: Matrix is singular (has zero eigenvalues)\n\nIll-conditioning often occurs when features have vastly different scales or when features are highly correlated.

Common Numerical Pitfalls

•Forgotten Centering: Computing SVD on uncentered data gives wrong principal components; always center first.
•Feature Scaling: Vastly different feature scales make covariance eigenvalues dominated by large-scale features; consider standardizing.
•Near-Zero Eigenvalues: Very small eigenvalues (e.g., $10^{-15}$) may be numerical noise, not true zero; use tolerance thresholds.
•Repeated Eigenvalues: When eigenvalues are equal or nearly equal, their eigenvectors are not uniquely determined—any orthonormal basis of the eigenspace works.
•Sign Ambiguity: Eigenvectors are defined only up to sign; $\mathbf{w}$ and $-\mathbf{w}$ are both valid. Different runs may flip signs.

Handling Near-Singular Matrices\n\nWhen eigenvalues are very small, practical approaches include:\n\n1. Regularization: Add a small value to diagonal: $\mathbf{S} + \epsilon \mathbf{I}$\n2. Truncation: Only keep components with $\lambda > \tau$ for some threshold $\tau$\n3. Variance Threshold: Keep components explaining at least $x\%$ of variance\n4. Pseudoinverse: Use Moore-Penrose pseudoinverse for downstream computations

Sign and Order Consistency

Two runs of PCA may produce eigenvectors with flipped signs or (for equal eigenvalues) in different order. This is mathematically correct but can cause issues in applications—for example, tracking principal components over time. Establish conventions (e.g., largest component of first PC is positive) for reproducibility.

Summary and Key Takeaways

We've explored the eigenvalue problem that lies at the heart of PCA. Here are the essential concepts:

Key Takeaways

•Eigenvectors are special directions: They remain unchanged in direction when the matrix operates on them—only scaled by the eigenvalue.
•The spectral theorem is crucial: For symmetric matrices, all eigenvalues are real, and eigenvectors can be chosen orthonormal.
•Positive semi-definiteness ensures non-negative eigenvalues: The covariance matrix's eigenvalues are always ≥ 0, consistent with their interpretation as variances.
•Eigendecomposition factors the covariance: $\mathbf{S} = \mathbf{W}\mathbf{\Lambda}\mathbf{W}^T$ decomposes variance into orthogonal contributions.
•SVD is preferred computationally: Computing SVD of $\mathbf{X}$ directly is more stable than eigendecomposing $\mathbf{X}^T\mathbf{X}$.
•Numerical care is required: Centering, scaling, tolerance thresholds, and sign conventions all matter in practice.

What's Next

We've now understood the mathematics behind computing principal components. In the next page, we'll explore the proportion of variance explained—how to measure and interpret the amount of information captured by each component, and how to use this to choose the appropriate number of components for your application.

3 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

3 / 5

Eigenvalue Problem

The Mathematical Heart of PCA

What You Will Learn

Eigenvalues and Eigenvectors Defined

Geometric Interpretation\n\nImagine a linear transformation represented by matrix $\mathbf{A}$. This transformation:\n- Warps space: circles become ellipses, squares become parallelograms\n- Rotates, shears, scales, and possibly reflects\n\nBut along eigenvector directions, something simple happens: the transformation is pure scaling.\n\n- If $\lambda > 1$: the eigenvector direction is stretched\n- If $0 < \lambda < 1$: the eigenvector direction is compressed\n- If $\lambda < 0$: the eigenvector direction is reflected and scaled\n- If $\lambda = 0$: vectors in that direction are collapsed to zero

The 'Eigen' Etymology

Finding Eigenvalues: The Characteristic Equation\n\nTo find eigenvalues, we rearrange:\n\n$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v} \implies (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}$$\n\nFor a non-zero solution $\mathbf{v}$ to exist, the matrix $(\mathbf{A} - \lambda\mathbf{I})$ must be singular (not invertible). This happens when:\n\n$$\det(\mathbf{A} - \lambda\mathbf{I}) = 0$$\n\nThis is the characteristic equation. It's a polynomial of degree $d$ in $\lambda$, so it has exactly $d$ roots (counting multiplicity, in the complex numbers).\n\nFor a $d \times d$ matrix, there are $d$ eigenvalues (some may be repeated, and for general matrices, some may be complex).

The Spectral Theorem for Symmetric Matrices

Why This Matters for PCA

Proof Sketch: Real Eigenvalues\n\nLet $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ where $\mathbf{v} \neq \mathbf{0}$. Consider:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = \mathbf{v}^T (\lambda \mathbf{v}) = \lambda \|\mathbf{v}\|^2$$\n\nAlso, since $\mathbf{S}$ is symmetric and taking the transpose:\n\n$$\mathbf{v}^T \mathbf{S} \mathbf{v} = (\mathbf{v}^T \mathbf{S} \mathbf{v})^T = \mathbf{v}^T \mathbf{S}^T \mathbf{v} = \mathbf{v}^T \mathbf{S} \mathbf{v}$$\n\nThis scalar equals its own transpose, so it's real. Since $\|\mathbf{v}\|^2 > 0$, we have $\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} / \|\mathbf{v}\|^2$ is real.\n\n(For complex eigenvectors, a more careful argument using Hermitian adjoints is needed, but the conclusion holds: symmetric real matrices have only real eigenvalues.)

Proof Sketch: Orthogonal Eigenvectors\n\nLet $\mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i$ and $\mathbf{S}\mathbf{v}_j = \lambda_j \mathbf{v}_j$ with $\lambda_i \neq \lambda_j$.\n\nCompute $\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i$ two ways:\n\n$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = \mathbf{v}_j^T (\lambda_i \mathbf{v}_i) = \lambda_i (\mathbf{v}_j^T \mathbf{v}_i)$$\n\n$$\mathbf{v}_j^T \mathbf{S} \mathbf{v}_i = (\mathbf{S} \mathbf{v}_j)^T \mathbf{v}_i = (\lambda_j \mathbf{v}_j)^T \mathbf{v}_i = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$$\n\nEquating: $\lambda_i (\mathbf{v}_j^T \mathbf{v}_i) = \lambda_j (\mathbf{v}_j^T \mathbf{v}_i)$\n\nSince $\lambda_i \neq \lambda_j$, we must have $\mathbf{v}_j^T \mathbf{v}_i = 0$. QED.

Eigendecomposition in Detail

Covariance as Weighted Projections

Low-Rank Approximation\n\nIf we truncate the sum after $k$ terms:\n\n$$\hat{\mathbf{S}}k = \sum{j=1}^{k} \lambda_j \mathbf{w}_j \mathbf{w}_j^T$$\n\nThis is the best rank-$k$ approximation to $\mathbf{S}$ in the Frobenius norm. The approximation error is:\n\n$$\|\mathbf{S} - \hat{\mathbf{S}}_k\|F = \sqrt{\sum{j=k+1}^{d} \lambda_j^2}$$\n\nThis is the Eckart-Young-Mirsky theorem applied to the covariance matrix.

The Inverse and Pseudoinverse\n\nThe eigendecomposition also reveals matrix inverses:\n\n$$\mathbf{S}^{-1} = \mathbf{W} \mathbf{\Lambda}^{-1} \mathbf{W}^T = \sum_{j=1}^{d} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}j^T$$\n\nThis only works if all $\lambda_j > 0$. If some eigenvalues are zero (rank-deficient case), the matrix is singular and has no inverse.\n\nThe pseudoinverse handles this by only inverting non-zero eigenvalues:\n\n$$\mathbf{S}^{+} = \sum{j: \lambda_j > 0} \frac{1}{\lambda_j} \mathbf{w}_j \mathbf{w}_j^T$$\n\nThis is useful in high-dimensional settings where $n < d$.

Positive Semi-Definiteness of Covariance Matrices

Implications for Eigenvalues\n\nPositive semi-definiteness is equivalent to all eigenvalues being non-negative:\n\n$$\lambda_j \geq 0 \quad \text{for all } j$$\n\nProof: If $\mathbf{S}\mathbf{v} = \lambda\mathbf{v}$ with $\|\mathbf{v}\| = 1$, then:\n\n$$\lambda = \mathbf{v}^T \mathbf{S} \mathbf{v} \geq 0$$\n\nby the PSD property.\n\nThis is crucial for PCA: eigenvalues represent variances, and variances cannot be negative. The mathematics confirms what intuition demands.

Types of Definite Matrices
Type	Condition	Eigenvalues	Example
Positive Definite (PD)	$\mathbf{v}^T\mathbf{S}\mathbf{v} > 0$ for all $\mathbf{v} \neq 0$	All $\lambda_j > 0$	Full-rank covariance matrix
Positive Semi-Definite (PSD)	$\mathbf{v}^T\mathbf{S}\mathbf{v} \geq 0$ for all $\mathbf{v}$	All $\lambda_j \geq 0$	Rank-deficient covariance
Indefinite	Some positive, some negative	Mixed signs	General symmetric matrix
Negative Semi-Definite	$\mathbf{v}^T\mathbf{S}\mathbf{v} \leq 0$ for all $\mathbf{v}$	All $\lambda_j \leq 0$	Negative covariance

Zero Eigenvalues

Computational Methods for Eigendecomposition

QR Algorithm\n\nThe workhorse algorithm for computing all eigenvalues simultaneously:\n\n1. Initialize $\mathbf{A}_0 = \mathbf{S}$\n2. Repeat:\n - Compute QR decomposition: $\mathbf{A}_k = \mathbf{Q}_k \mathbf{R}k$\n - Update: $\mathbf{A}{k+1} = \mathbf{R}_k \mathbf{Q}_k$\n3. $\mathbf{A}_k$ converges to diagonal (or block-diagonal) form\n\nThe diagonal entries converge to eigenvalues. For symmetric matrices, convergence is particularly fast.\n\nWith shifts: Adding shifts $\mathbf{A}_k - \mu_k \mathbf{I}$ before QR decomposition accelerates convergence dramatically. The Wilkinson shift strategy achieves cubic convergence.

Comparison of Eigenvalue Algorithms
Method	Computes	Complexity	Best For
Power Iteration	Largest eigenvalue/vector	$O(d^2)$ per iteration	Only need largest
Inverse Iteration	Eigenvalue near target	$O(d^3)$ per iteration	Refine known eigenvalue
QR Algorithm	All eigenvalues	$O(d^3)$ total	Dense matrices, all eigenvalues
Divide-and-Conquer	All eigenvalues	$O(d^3)$ but faster constants	Symmetric matrices
Lanczos	Extreme eigenvalues	$O(dk \cdot \text{nnz})$	Sparse matrices, few eigenvalues
Randomized SVD	Top-$k$ singular values	$O(d^2 k)$	Very large/streaming data

Practical Recommendation

The SVD Connection

Connecting SVD to Eigendecomposition\n\nConsider the covariance matrix (assuming centered $\mathbf{X}$):\n\n$$\mathbf{S} = \frac{1}{n}\mathbf{X}^T\mathbf{X} = \frac{1}{n}(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)^T(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)$$\n\n$$= \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{U}^T\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \frac{1}{n}\mathbf{V}\mathbf{\Sigma}^T\mathbf{\Sigma}\mathbf{V}^T = \mathbf{V}\left(\frac{\mathbf{\Sigma}^2}{n}\right)\mathbf{V}^T$$\n\nComparing with $\mathbf{S} = \mathbf{W}\mathbf{\Lambda}\mathbf{W}^T$:\n\n- Principal component directions: $\mathbf{W} = \mathbf{V}$ (right singular vectors)\n- Eigenvalues: $\lambda_j = \sigma_j^2 / n$ (squared singular values divided by $n$)

The SVD-PCA Equivalence

Why SVD is Preferred\n\n1. Numerical Stability: Forming $\mathbf{X}^T\mathbf{X}$ squares the condition number, amplifying numerical errors. SVD works directly on $\mathbf{X}$.\n\n2. Memory Efficiency: For $n < d$, we never need to form the $d \times d$ covariance matrix.\n\n3. Works on Rectangular Matrices: SVD is defined for any matrix, not just square ones.\n\n4. Truncated Computation: Efficient algorithms compute only the top $k$ singular values/vectors without computing all $d$.

Numerical Considerations and Pitfalls

Common Numerical Pitfalls

•Forgotten Centering: Computing SVD on uncentered data gives wrong principal components; always center first.
•Feature Scaling: Vastly different feature scales make covariance eigenvalues dominated by large-scale features; consider standardizing.
•Near-Zero Eigenvalues: Very small eigenvalues (e.g., $10^{-15}$) may be numerical noise, not true zero; use tolerance thresholds.
•Repeated Eigenvalues: When eigenvalues are equal or nearly equal, their eigenvectors are not uniquely determined—any orthonormal basis of the eigenspace works.
•Sign Ambiguity: Eigenvectors are defined only up to sign; $\mathbf{w}$ and $-\mathbf{w}$ are both valid. Different runs may flip signs.

Handling Near-Singular Matrices\n\nWhen eigenvalues are very small, practical approaches include:\n\n1. Regularization: Add a small value to diagonal: $\mathbf{S} + \epsilon \mathbf{I}$\n2. Truncation: Only keep components with $\lambda > \tau$ for some threshold $\tau$\n3. Variance Threshold: Keep components explaining at least $x\%$ of variance\n4. Pseudoinverse: Use Moore-Penrose pseudoinverse for downstream computations

Sign and Order Consistency

Summary and Key Takeaways

We've explored the eigenvalue problem that lies at the heart of PCA. Here are the essential concepts:

Key Takeaways

•Eigenvectors are special directions: They remain unchanged in direction when the matrix operates on them—only scaled by the eigenvalue.
•The spectral theorem is crucial: For symmetric matrices, all eigenvalues are real, and eigenvectors can be chosen orthonormal.
•Positive semi-definiteness ensures non-negative eigenvalues: The covariance matrix's eigenvalues are always ≥ 0, consistent with their interpretation as variances.
•Eigendecomposition factors the covariance: $\mathbf{S} = \mathbf{W}\mathbf{\Lambda}\mathbf{W}^T$ decomposes variance into orthogonal contributions.
•SVD is preferred computationally: Computing SVD of $\mathbf{X}$ directly is more stable than eigendecomposing $\mathbf{X}^T\mathbf{X}$.
•Numerical care is required: Centering, scaling, tolerance thresholds, and sign conventions all matter in practice.

What's Next

3 / 5