Loading learning content...
The theory of eigenvalues and eigenvectors isn't abstract mathematics confined to textbooks—it powers billion-dollar ML systems running in production today.
Google's original search engine was built on the PageRank algorithm, which computes the dominant eigenvector of the web's link matrix. Netflix's recommendation engine relies on matrix factorization, fundamentally enabled by eigendecomposition. Principal Component Analysis (PCA), perhaps the most widely used dimensionality reduction technique, is pure eigenanalysis of covariance matrices. And spectral clustering discovers communities in social networks by analyzing graph Laplacian eigenvectors.
This page brings together everything we've learned to show eigenanalysis in action. You'll understand not just how these algorithms work, but why eigenvalues and eigenvectors naturally solve these problems.
By the end of this page, you will implement PCA from scratch and understand its eigenvalue interpretation, reproduce the core PageRank algorithm and see eigenvectors in action, grasp spectral clustering and why graph eigenvalues reveal clusters, and recognize eigenanalysis patterns across diverse ML applications.
PCA is the crown jewel of eigenanalysis applications in ML. It finds the directions of maximum variance in high-dimensional data, enabling dimensionality reduction while preserving the most important information.
The problem: Given n data points in d dimensions, find a lower-dimensional representation that captures the most variance.
The eigenvalue solution:
Why eigenvalues maximize variance:
The first principal component v₁ maximizes variance subject to ||v||=1:
$$v_1 = \arg\max_{||v||=1} \text{Var}(Xv) = \arg\max_{||v||=1} v^T \Sigma v$$
By the method of Lagrange multipliers, the solution satisfies Σv₁ = λ₁v₁ — the eigenvector with largest eigenvalue!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_iris class PCA: """ Principal Component Analysis from scratch using eigendecomposition. """ def __init__(self, n_components=None): self.n_components = n_components self.components_ = None # Principal component directions self.explained_variance_ = None self.explained_variance_ratio_ = None self.mean_ = None def fit(self, X): """ Fit PCA to data X. Steps: 1. Center the data 2. Compute covariance matrix 3. Eigendecomposition (spectral decomposition) 4. Sort by eigenvalue magnitude """ n_samples, n_features = X.shape # Step 1: Center the data self.mean_ = X.mean(axis=0) X_centered = X - self.mean_ # Step 2: Compute covariance matrix # Cov = X^T X / (n-1) for centered data covariance_matrix = X_centered.T @ X_centered / (n_samples - 1) # Step 3: Eigendecomposition (use eigh for symmetric matrix) eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix) # Step 4: Sort by eigenvalue (descending) idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Store results if self.n_components is not None: eigenvalues = eigenvalues[:self.n_components] eigenvectors = eigenvectors[:, :self.n_components] self.components_ = eigenvectors.T # Rows are components self.explained_variance_ = eigenvalues total_variance = eigenvalues.sum() if self.n_components is None else X_centered.var(axis=0).sum() self.explained_variance_ratio_ = eigenvalues / X_centered.var(axis=0).sum() return self def transform(self, X): """Project data onto principal components.""" X_centered = X - self.mean_ return X_centered @ self.components_.T def fit_transform(self, X): """Fit and transform in one step.""" self.fit(X) return self.transform(X) def inverse_transform(self, X_reduced): """Reconstruct data from reduced representation.""" return X_reduced @ self.components_ + self.mean_ # Load and visualize with Iris datasetiris = load_iris()X = iris.datay = iris.target print("Iris Dataset: PCA Analysis")print("=" * 60)print(f"Original shape: {X.shape} (150 samples, 4 features)") # Fit our PCApca = PCA(n_components=2)X_reduced = pca.fit_transform(X) print(f"Reduced shape: {X_reduced.shape}")print(f"\nExplained variance per component:")for i, (var, ratio) in enumerate(zip(pca.explained_variance_, pca.explained_variance_ratio_)): print(f" PC{i+1}: variance = {var:.4f}, ratio = {ratio*100:.2f}%")print(f"\nTotal variance explained: {pca.explained_variance_ratio_.sum()*100:.2f}%") # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Projected dataax = axes[0]colors = ['red', 'green', 'blue']for i, color in enumerate(colors): mask = y == i ax.scatter(X_reduced[mask, 0], X_reduced[mask, 1], c=color, label=iris.target_names[i], alpha=0.7)ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')ax.set_title('Iris Dataset: PCA Projection')ax.legend()ax.grid(True, alpha=0.3) # Right: Scree plotax = axes[1]full_pca = PCA().fit(X)var_ratios = full_pca.explained_variance_ratio_cumulative = np.cumsum(var_ratios) ax.bar(range(1, len(var_ratios)+1), var_ratios, alpha=0.7, label='Individual')ax.plot(range(1, len(var_ratios)+1), cumulative, 'ro-', label='Cumulative')ax.axhline(y=0.95, color='green', linestyle='--', label='95% threshold')ax.set_xlabel('Principal Component')ax.set_ylabel('Variance Ratio')ax.set_title('Scree Plot: How Many Components?')ax.legend()ax.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('pca_analysis.png', dpi=150)plt.show() # Verify against sklearnfrom sklearn.decomposition import PCA as SklearnPCAsklearn_pca = SklearnPCA(n_components=2).fit(X)print(f"\nVerification against sklearn PCA:")print(f"Variance ratios match: {np.allclose(pca.explained_variance_ratio_, sklearn_pca.explained_variance_ratio_)}")Eigenvalues tell you how much variance each PC captures. A steep drop in the scree plot suggests low intrinsic dimensionality. If the first few eigenvalues capture 95%+ of variance, the data lies near a low-dimensional subspace—PCA works beautifully. If eigenvalues decay slowly, the data is truly high-dimensional.
PageRank, developed by Larry Page and Sergey Brin at Stanford, revolutionized web search. It models a random web surfer who randomly clicks links, occasionally jumping to random pages. Page importance = probability of being on that page in the long run.
The mathematical model:
Why eigenvectors?
The random surfer's state after k steps is π_k = π_0 · Pᵏ. As k→∞, this converges to the unique stationary distribution π satisfying π = πP. Transposed: Pᵀπᵀ = πᵀ — the eigenvector of Pᵀ with eigenvalue 1!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npimport matplotlib.pyplot as plt def pagerank(adjacency_matrix, damping=0.85, max_iterations=100, tol=1e-8): """ Compute PageRank using the power iteration method. Parameters: adjacency_matrix: A[i,j] = 1 if page j links to page i damping: Probability of following a link (vs random jump) max_iterations: Maximum number of iterations tol: Convergence tolerance Returns: PageRank scores (probability distribution) """ n = adjacency_matrix.shape[0] # Step 1: Create column-stochastic transition matrix # Normalize by out-degree (column sums) out_degrees = adjacency_matrix.sum(axis=0) # Handle dangling nodes (no outgoing links) dangling = (out_degrees == 0) out_degrees[dangling] = 1 # Avoid division by zero M = adjacency_matrix / out_degrees M[:, dangling] = 1/n # Dangling nodes link to everyone # Step 2: Add damping factor # P = α*M + (1-α) * (1/n) * ones teleport = np.ones((n, n)) / n P = damping * M + (1 - damping) * teleport # Step 3: Power iteration to find dominant eigenvector # Start with uniform distribution pagerank_scores = np.ones(n) / n for iteration in range(max_iterations): new_scores = P @ pagerank_scores # Check convergence diff = np.abs(new_scores - pagerank_scores).sum() pagerank_scores = new_scores if diff < tol: print(f"Converged after {iteration + 1} iterations") break return pagerank_scores def pagerank_eigen(adjacency_matrix, damping=0.85): """ Compute PageRank using explicit eigendecomposition. (Power iteration is preferred for large graphs, but this shows the connection.) """ n = adjacency_matrix.shape[0] # Build transition matrix (same as above) out_degrees = adjacency_matrix.sum(axis=0) dangling = (out_degrees == 0) out_degrees[dangling] = 1 M = adjacency_matrix / out_degrees M[:, dangling] = 1/n P = damping * M + (1 - damping) * np.ones((n, n)) / n # Find eigenvector with eigenvalue 1 eigenvalues, eigenvectors = np.linalg.eig(P) # Find the eigenvalue closest to 1 idx = np.argmin(np.abs(eigenvalues - 1)) dominant_eigenvector = eigenvectors[:, idx].real # Normalize to be a probability distribution pagerank_scores = dominant_eigenvector / dominant_eigenvector.sum() return pagerank_scores, eigenvalues # Example: Small web graph# 0 -> 1 -> 3# | ^ |# v | v# 2 ---+ 4# pages = ['Homepage', 'Products', 'About', 'Blog', 'Contact']n = len(pages) # Adjacency matrix: A[i,j] = 1 if j links to i# (Column j shows where page j links TO)links = np.array([ [0, 0, 1, 0, 0], # Links TO Homepage [1, 0, 1, 0, 0], # Links TO Products [1, 0, 0, 0, 0], # Links TO About [0, 1, 0, 0, 0], # Links TO Blog [0, 0, 0, 1, 0], # Links TO Contact]) print("PageRank Analysis")print("=" * 60)print("\nLink structure:")for i, page in enumerate(pages): outlinks = [pages[j] for j in range(n) if links[j, i] == 1] print(f" {page} -> {outlinks if outlinks else '[none]'}") # Compute PageRankprint("\n" + "-" * 40)print("Power Iteration Method:")pr_power = pagerank(links.astype(float), damping=0.85)for page, score in zip(pages, pr_power): print(f" {page}: {score:.4f}") print("\n" + "-" * 40)print("Eigendecomposition Method:")pr_eigen, eigenvalues = pagerank_eigen(links.astype(float), damping=0.85)for page, score in zip(pages, pr_eigen): print(f" {page}: {score:.4f}") print(f"\nEigenvalues of transition matrix:")for i, ev in enumerate(sorted(np.abs(eigenvalues), reverse=True)[:5]): print(f" |λ{i+1}| = {ev:.6f}")print("(Largest eigenvalue = 1 corresponds to stationary distribution)") # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: PageRank scores as bar chartax = axes[0]colors = plt.cm.viridis(pr_power / pr_power.max())ax.barh(pages, pr_power, color=colors)ax.set_xlabel('PageRank Score')ax.set_title('PageRank Scores')ax.grid(True, alpha=0.3) # Right: Eigenvalue spectrumax = axes[1]theta = np.linspace(0, 2*np.pi, 100)ax.plot(np.cos(theta), np.sin(theta), 'k--', alpha=0.3)ax.scatter(eigenvalues.real, eigenvalues.imag, s=100, c='red', zorder=5)ax.scatter([1], [0], s=200, c='green', marker='*', zorder=6, label='λ=1 (PageRank)')ax.set_xlim(-1.5, 1.5)ax.set_ylim(-1.5, 1.5)ax.set_aspect('equal')ax.set_xlabel('Real')ax.set_ylabel('Imaginary')ax.set_title('Eigenvalues of Transition Matrix\n(Green star = PageRank eigenvector)')ax.legend()ax.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('pagerank_analysis.png', dpi=150)plt.show()The actual web has billions of pages. Power iteration works because: (1) it only needs the dominant eigenvector, (2) the transition matrix is extremely sparse (each page links to few others), and (3) it can be parallelized using MapReduce. Early Google computed PageRank on clusters of commodity machines.
Spectral clustering discovers structure in graphs (and in data, via similarity graphs). The key insight: clusters correspond to near-zero eigenvalues of the graph Laplacian.
The Graph Laplacian:
For a graph with adjacency matrix W (weighted edges) and degree matrix D (diagonal, D_ii = sum of weights at node i):
Key properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom scipy.spatial.distance import cdist def build_similarity_graph(X, sigma=1.0): """ Build a weighted adjacency matrix using Gaussian (RBF) similarity. W_ij = exp(-||x_i - x_j||^2 / (2 * sigma^2)) """ distances = cdist(X, X, 'sqeuclidean') W = np.exp(-distances / (2 * sigma**2)) np.fill_diagonal(W, 0) # No self-loops return W def spectral_clustering(X, n_clusters, sigma=1.0): """ Spectral clustering from scratch. Steps: 1. Build similarity graph 2. Compute unnormalized Laplacian L = D - W 3. Find k smallest eigenvectors of L 4. Use these as features for k-means """ n = X.shape[0] # Step 1: Build similarity graph W = build_similarity_graph(X, sigma) # Step 2: Compute Laplacian D = np.diag(W.sum(axis=1)) # Degree matrix L = D - W # Unnormalized Laplacian # Step 3: Eigendecomposition (using eigh for symmetric matrix) eigenvalues, eigenvectors = np.linalg.eigh(L) # The k smallest eigenvalues (excluding the trivial 0 eigenvalue for connected graphs) # For k clusters, we use the k smallest eigenvectors spectral_embedding = eigenvectors[:, :n_clusters] # Normalize rows (for normalized spectral clustering) spectral_embedding = spectral_embedding / np.linalg.norm(spectral_embedding, axis=1, keepdims=True) # Step 4: K-means in the spectral embedding kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) labels = kmeans.fit_predict(spectral_embedding) return labels, eigenvalues, spectral_embedding # Generate synthetic data with clustersnp.random.seed(42) # Two moons - classic test for spectral clusteringfrom sklearn.datasets import make_moons, make_circles X_moons, y_moons = make_moons(n_samples=300, noise=0.05)X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5) # Figurefig, axes = plt.subplots(2, 4, figsize=(18, 9)) for row, (X, y_true, name) in enumerate([(X_moons, y_moons, 'Two Moons'), (X_circles, y_circles, 'Concentric Circles')]): # Column 1: Original data ax = axes[row, 0] ax.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7) ax.set_title(f'{name}\nTrue Labels') ax.set_aspect('equal') # Column 2: K-means (fails on non-convex) kmeans = KMeans(n_clusters=2, random_state=42, n_init=10) labels_kmeans = kmeans.fit_predict(X) ax = axes[row, 1] ax.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='viridis', alpha=0.7) ax.set_title('K-means Clustering\n(fails on non-convex)') ax.set_aspect('equal') # Column 3: Spectral clustering labels_spectral, eigenvalues, embedding = spectral_clustering(X, n_clusters=2, sigma=0.3) ax = axes[row, 2] ax.scatter(X[:, 0], X[:, 1], c=labels_spectral, cmap='viridis', alpha=0.7) ax.set_title('Spectral Clustering\n(handles non-convex!)') ax.set_aspect('equal') # Column 4: Eigenvalue spectrum ax = axes[row, 3] ax.plot(eigenvalues[:20], 'o-', markersize=8) ax.axhline(y=0, color='red', linestyle='--', alpha=0.5) ax.set_xlabel('Index') ax.set_ylabel('Eigenvalue') ax.set_title('Laplacian Eigenvalues\n(Gap reveals # clusters)') ax.grid(True, alpha=0.3) # Highlight the spectral gap ax.axvline(x=1.5, color='green', linestyle='--', alpha=0.7, label='Spectral gap') ax.legend() plt.tight_layout()plt.savefig('spectral_clustering_demo.png', dpi=150)plt.show() print("Spectral Clustering Analysis")print("=" * 60)print("\nKey insight: The SPECTRAL GAP (jump between eigenvalues)")print("reveals the number of clusters!")print("\nFor 'Two Moons':")print(f" First 5 eigenvalues: {eigenvalues[:5].round(6)}")print(f" Notice gap after k=2: eigenvalue 3 is much larger than eigenvalue 2")The eigenvectors of the Laplacian represent the 'smoothest' functions over the graph—functions that vary slowly across edges. Cluster membership IS a smooth function (constant within clusters, different between). So the bottom eigenvectors naturally encode cluster structure!
The Singular Value Decomposition (SVD) extends eigenanalysis to rectangular matrices. Any m×n matrix A can be decomposed as:
$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$
where:
Connection to eigenvalues:
Applications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npimport matplotlib.pyplot as plt def demonstrate_svd_compression(): """ Demonstrate low-rank approximation via SVD for image compression. """ # Create a simple grayscale "image" (or use any image) # Using a gradient + pattern for illustration x = np.linspace(0, 4*np.pi, 100) y = np.linspace(0, 4*np.pi, 80) X, Y = np.meshgrid(x, y) image = np.sin(X) * np.cos(Y) + 0.5 * np.sin(2*X) print("SVD for Low-Rank Approximation (Image Compression)") print("=" * 60) print(f"Original image shape: {image.shape}") print(f"Original storage: {image.size} values") # Compute SVD U, singular_values, Vt = np.linalg.svd(image, full_matrices=False) print(f"\nSingular values (first 10): {singular_values[:10].round(4)}") print(f"Total singular values: {len(singular_values)}") # Reconstruct with different ranks ranks = [1, 5, 10, 20, 50, 80] fig, axes = plt.subplots(2, 4, figsize=(16, 8)) # Original ax = axes[0, 0] ax.imshow(image, cmap='viridis') ax.set_title(f'Original\n{image.size} values') ax.axis('off') for idx, k in enumerate(ranks): # Low-rank approximation: A_k = U[:,:k] @ diag(s[:k]) @ Vt[:k,:] image_k = U[:, :k] @ np.diag(singular_values[:k]) @ Vt[:k, :] # Compression ratio storage = U.shape[0] * k + k + Vt.shape[1] * k compression = 100 * (1 - storage / image.size) # Reconstruction error error = np.linalg.norm(image - image_k) / np.linalg.norm(image) ax = axes.flat[idx + 1] ax.imshow(image_k, cmap='viridis') ax.set_title(f'Rank {k}\n{compression:.1f}% compression\nError: {error:.4f}') ax.axis('off') # Scree plot ax = axes[1, 3] ax.semilogy(singular_values, 'b-o', markersize=3) ax.set_xlabel('Index') ax.set_ylabel('Singular Value (log scale)') ax.set_title('Singular Value Decay') ax.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('svd_compression.png', dpi=150) plt.show() return singular_values def demonstrate_latent_semantic_analysis(): """ Demonstrate LSA for text similarity using SVD. """ print("\n" + "=" * 60) print("Latent Semantic Analysis (LSA)") print("=" * 60) # Toy document-term matrix # Documents: rows, Terms: columns # Each entry = term frequency documents = [ "cat dog pet", "dog cat animal pet", "car truck vehicle", "truck car motor vehicle", "pet animal love", ] # Build vocabulary and term-document matrix vocab = sorted(set(' '.join(documents).split())) n_docs = len(documents) n_terms = len(vocab) term_idx = {term: i for i, term in enumerate(vocab)} A = np.zeros((n_docs, n_terms)) for i, doc in enumerate(documents): for word in doc.split(): A[i, term_idx[word]] += 1 print(f"Documents: {documents}") print(f"Vocabulary: {vocab}") print(f"\nDocument-Term Matrix A ({n_docs} docs x {n_terms} terms):") print(A) # SVD U, s, Vt = np.linalg.svd(A, full_matrices=False) print(f"\nSingular values: {s.round(4)}") # Project documents to 2D k = 2 doc_embeddings = U[:, :k] @ np.diag(s[:k]) print(f"\nDocument embeddings (2D):") for i, doc in enumerate(documents): print(f" Doc {i} ({doc[:20]}...): {doc_embeddings[i].round(3)}") # Compute document similarities in LSA space from scipy.spatial.distance import cosine print(f"\nDocument similarities in LSA space:") for i in range(n_docs): for j in range(i+1, n_docs): sim = 1 - cosine(doc_embeddings[i], doc_embeddings[j]) if sim > 0.7: # Show high similarities print(f" Doc {i} <-> Doc {j}: {sim:.3f}") return doc_embeddings singular_values = demonstrate_svd_compression()doc_embeddings = demonstrate_latent_semantic_analysis()Eigenvalues are crucial for understanding the training dynamics of neural networks, particularly recurrent neural networks (RNNs).
Gradient flow and eigenvalues:
In a deep network, gradients flow through weight matrices. If W is a weight matrix:
RNN dynamics:
An RNN updates hidden states as h_t = tanh(W_h h_{t-1} + W_x x_t). Over T timesteps, the Jacobian involves W_h^T. If |eigenvalues| ≠ 1:
This is why standard RNNs struggle with long sequences—eigenvalue analysis explains the vanishing gradient problem!
| Problem | Eigenvalue Diagnosis | Solution |
|---|---|---|
| Vanishing gradients in RNN | |λ| < 1 for W_h | LSTM/GRU (gating), orthogonal init |
| Exploding gradients | |λ| > 1 for W_h | Gradient clipping, spectral norm |
| Slow convergence | Large condition number | Batch normalization, Adam |
| Mode collapse in GANs | Singular Jacobian | Spectral normalization |
| Training instability | Complex eigenvalues | Careful initialization |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import numpy as npimport matplotlib.pyplot as plt def analyze_rnn_weight_matrix(W, name="W"): """ Analyze an RNN weight matrix for gradient flow properties. """ eigenvalues = np.linalg.eigvals(W) print(f"\n{name}: {W.shape[0]}x{W.shape[1]} matrix") print(f" Spectral radius (max |λ|): {np.max(np.abs(eigenvalues)):.4f}") print(f" Min |λ|: {np.min(np.abs(eigenvalues)):.4f}") # Classify gradient behavior max_abs = np.max(np.abs(eigenvalues)) if max_abs > 1.01: status = "⚠️ EXPLODING gradients risk" elif max_abs < 0.99: status = "⚠️ VANISHING gradients risk" else: status = "✓ Stable gradient flow" print(f" Status: {status}") return eigenvalues def compare_initializations(hidden_size=100): """ Compare different RNN weight initializations via eigenvalue analysis. """ print("RNN Weight Matrix Initialization Analysis") print("=" * 60) fig, axes = plt.subplots(2, 2, figsize=(12, 12)) initializations = { 'Random Gaussian (σ=1.0)': np.random.randn(hidden_size, hidden_size), 'Xavier/Glorot': np.random.randn(hidden_size, hidden_size) / np.sqrt(hidden_size), 'Orthogonal': np.linalg.qr(np.random.randn(hidden_size, hidden_size))[0], 'Identity + small noise': np.eye(hidden_size) + 0.01 * np.random.randn(hidden_size, hidden_size), } for ax, (name, W) in zip(axes.flat, initializations.items()): eigenvalues = analyze_rnn_weight_matrix(W, name) # Plot eigenvalues in complex plane theta = np.linspace(0, 2*np.pi, 100) ax.plot(np.cos(theta), np.sin(theta), 'g--', alpha=0.5, label='Unit circle') ax.scatter(eigenvalues.real, eigenvalues.imag, alpha=0.6, s=20) max_abs = np.max(np.abs(eigenvalues)) ax.set_xlim(-max(2, max_abs*1.1), max(2, max_abs*1.1)) ax.set_ylim(-max(2, max_abs*1.1), max(2, max_abs*1.1)) ax.set_aspect('equal') ax.set_xlabel('Real') ax.set_ylabel('Imaginary') ax.set_title(f'{name}\nSpectral radius: {max_abs:.3f}') ax.grid(True, alpha=0.3) ax.legend(loc='upper right') plt.suptitle('Eigenvalue Distribution for Different RNN Initializations', fontsize=14) plt.tight_layout() plt.savefig('rnn_eigenvalue_init.png', dpi=150) plt.show() def simulate_gradient_flow(W, T=50): """ Simulate gradient magnitude over time in an RNN. For a simple RNN, gradient at time 0 for loss at time T involves W^T. """ spectral_radius = np.max(np.abs(np.linalg.eigvals(W))) gradient_norms = [] gradient = np.ones(W.shape[0]) gradient = gradient / np.linalg.norm(gradient) for t in range(T): gradient = W.T @ gradient gradient_norms.append(np.linalg.norm(gradient)) return gradient_norms, spectral_radius # Run analysisnp.random.seed(42)compare_initializations(hidden_size=100) # Simulate gradient flowprint("\n" + "=" * 60)print("Gradient Flow Simulation")print("=" * 60) fig, ax = plt.subplots(figsize=(10, 6)) hidden_size = 50T = 30 for spectral_target, label in [(0.8, 'σ_r = 0.8 (vanishing)'), (1.0, 'σ_r = 1.0 (stable)'), (1.2, 'σ_r = 1.2 (exploding)')]: # Create orthogonal matrix scaled to target spectral radius Q = np.linalg.qr(np.random.randn(hidden_size, hidden_size))[0] W = spectral_target * Q gradient_norms, _ = simulate_gradient_flow(W, T) ax.semilogy(gradient_norms, label=label, linewidth=2) ax.set_xlabel('Timesteps back')ax.set_ylabel('Gradient norm (log scale)')ax.set_title('Gradient Flow vs Spectral Radius\n(Why RNNs struggle with long sequences)')ax.legend()ax.grid(True, alpha=0.3)ax.axhline(y=1, color='gray', linestyle='--', alpha=0.5)plt.savefig('gradient_flow_simulation.png', dpi=150)plt.show()LSTMs solve the vanishing gradient problem by using gating mechanisms that effectively control eigenvalues during backpropagation. The cell state path has gradients that multiply by (close to 1) factors, maintaining |λ| ≈ 1 over long sequences. This is why LSTMs capture long-range dependencies!
We've seen eigenvalues and eigenvectors appear across diverse ML applications. Let's consolidate the patterns:
| Application | What's Diagonalized | Key Insight |
|---|---|---|
| PCA | Covariance matrix Σ | Eigenvectors = principal directions, eigenvalues = variance |
| PageRank | Transition matrix P | Dominant eigenvector (λ=1) = page importance |
| Spectral Clustering | Graph Laplacian L | Small eigenvalues = cluster indicators |
| SVD / LSA | Data matrix A (via AᵀA) | Low-rank structure for compression/similarity |
| RNN Analysis | Weight matrix W | Spectral radius determines gradient flow |
| Kernel PCA | Kernel matrix K | Nonlinear PCA in feature space |
| Markov Chains | Transition matrix P | Eigenvalue 1 = steady state; |λ₂| = mixing rate |
| Graph Embedding | Adjacency/Laplacian | Eigenvectors = node features |
| Optimization | Hessian H | Eigenvalues = curvature, condition number |
We've completed a comprehensive exploration of eigenvalues and eigenvectors—from formal definitions to production ML systems. Let's consolidate everything:
You now possess a deep understanding of eigenvalues and eigenvectors—the mathematical backbone of dimensionality reduction, graph analysis, and dynamical systems in machine learning. This knowledge will serve you across PCA, spectral methods, matrix factorization, and neural network analysis. The eigenvalue lens reveals structure hidden in matrices everywhere.
What's Next:
The next module in this chapter covers Matrix Decompositions—SVD, QR, Cholesky, and LU decompositions. These build on eigenanalysis to provide practical tools for solving linear systems, low-rank approximation, and numerical stability in ML algorithms.