Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

4 / 5

Proportion of Variance Explained

Quantifying Information Preservation

You've computed principal components for a dataset with 1,000 features. The first component captures the most variance, the second captures the second most, and so on. But a crucial question remains: how many components should you actually keep?\n\nKeeping all 1,000 defeats the purpose of dimensionality reduction. Keeping just one might discard too much information. The answer lies in understanding the proportion of variance explained by each component—a metric that quantifies how much 'information' each principal component captures.\n\nThis page develops the mathematical framework for variance explained, shows how to interpret and visualize it, and provides principled approaches for choosing the number of components to retain.

What You Will Learn

By the end of this page, you will understand variance explained and cumulative variance ratios, know how to interpret these metrics for your specific application, learn multiple methods for choosing the number of components (threshold-based, elbow method, cross-validation), and understand the tradeoffs involved in this decision.

Defining Variance Explained

Let's establish the precise definitions and understand what variance explained really measures.\n\n### Total Variance\n\nFor centered data, the total variance is the sum of all eigenvalues:\n\n$$\text{Total Variance} = \sum_{j=1}^{d} \lambda_j = \text{tr}(\mathbf{S}) = \sum_{j=1}^{d} \text{Var}(X_j)$$\n\nThis equals the trace of the covariance matrix, which also equals the sum of individual feature variances. It's a fixed property of the data—the 'total information' we start with.\n\n### Variance Explained by Component $j$\n\nEach principal component captures variance equal to its eigenvalue:\n\n$$\text{Variance of PC}j = \lambda_j$$\n\nThe proportion of variance explained by component $j$ is:\n\n$$\text{PVE}j = \frac{\lambda_j}{\sum{k=1}^{d} \lambda_k}$$\n\nThis is always between 0 and 1, and all proportions sum to 1:\n\n$$\sum{j=1}^{d} \text{PVE}_j = 1$$

Interpretation

PVE$_j$ tells you what fraction of the data's total spread is captured by the $j$-th direction. If PVE$_1 = 0.80$, the first principal component alone accounts for 80% of how much the data varies. The unordered features typically don't have such concentration—variance is distributed across many features.

Cumulative Variance Explained\n\nMore practically, we care about how much total variance is explained by the first $k$ components:\n\n$$\text{CVE}k = \sum{j=1}^{k} \text{PVE}j = \frac{\sum{j=1}^{k} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$\n\nThis tells us: if we keep $k$ components, we preserve CVE$_k$ fraction of the total variance.\n\nThe cumulative variance sequence is:\n- CVE$_1 = $ PVE$_1$\n- CVE$_2 = $ PVE$_1 + $ PVE$_2$\n- ...\n- CVE$_d = 1$ (keeping all components preserves everything)

Eigenvalue Spectra Patterns

The distribution of eigenvalues—the eigenvalue spectrum—reveals important information about the data's structure. Different patterns have different implications for dimensionality reduction.\n\n### Pattern 1: Dominant Eigenvalue(s)\n\nIf $\lambda_1 \gg \lambda_2 \gg \cdots \gg \lambda_d$:\n\n- The data is approximately low-dimensional\n- A small number of components captures most variance\n- Strong correlations exist between original features\n- Dimensionality reduction is highly effective\n\nExample: Stock returns often show this pattern—most variance is explained by market-wide factors.

Pattern 2: Slow Decay\n\nIf eigenvalues decrease gradually: $\lambda_1 \approx \lambda_2 \approx \cdots \approx \lambda_d$:\n\n- Data is truly high-dimensional\n- Many components needed to capture most variance\n- Features are relatively independent\n- Dimensionality reduction has limited benefit\n\nExample: Well-designed experiments with orthogonal factors show this pattern.\n\n### Pattern 3: Cliff Edge\n\nIf eigenvalues are large for some components then suddenly drop:\n\n- Data lies near a lower-dimensional subspace\n- The 'cliff' indicates the true dimensionality\n- Components after the cliff capture noise\n- Clear choice of $k$ (components before the cliff)\n\nExample: PCA on noisy observations of a truly low-dimensional phenomenon.

Common Eigenvalue Spectrum Patterns
Pattern	Eigenvalue Distribution	Interpretation	Typical Source
Power Law	$\lambda_j \propto j^{-\alpha}$	Scale-free structure, few dominant modes	Natural images, language
Exponential Decay	$\lambda_j \propto e^{-\alpha j}$	Rapidly diminishing importance	Smooth phenomena
Uniform (Flat)	$\lambda_j \approx \lambda_1$	All directions equally important	Independent features, white noise
Step/Cliff	Large, then sudden drop	Low-rank structure + noise	Signal + noise model
Mixture	Multiple scale regimes	Hierarchical structure	Multi-scale data

Understanding Your Data

Always examine the eigenvalue spectrum before choosing $k$. It tells you whether PCA is appropriate (fast decay = good), how many components to consider (where decay slows), and whether the data has special structure (cliffs, regimes).

Methods for Choosing the Number of Components

There is no single 'correct' number of components—the right choice depends on your application, goals, and constraints. Here are the main approaches:\n\n### Method 1: Variance Threshold\n\nChoose $k$ such that CVE$_k \geq \tau$ for some threshold $\tau$:\n\n$$k^* = \min\{k : \text{CVE}_k \geq \tau\}$$\n\nCommon thresholds:\n- $\tau = 0.80$: Keep components explaining 80% of variance\n- $\tau = 0.90$: Keep components explaining 90% of variance\n- $\tau = 0.95$: Keep components explaining 95% of variance\n\nPros: Simple, interpretable, provides a guarantee on information preserved\nCons: Arbitrary threshold choice, may not align with downstream task needs

Method 2: Elbow Method (Scree Plot Analysis)\n\nPlot eigenvalues or PVE against component index. Look for an 'elbow'—a point where the rate of decrease changes sharply.\n\nProcedure:\n1. Plot $\lambda_j$ vs $j$ (or PVE$_j$ vs $j$)\n2. Identify where the curve 'bends' from steep to flat\n3. Choose $k$ at or before the elbow\n\nPros: Data-driven, captures structure in eigenvalue decay\nCons: Subjective (no unique definition of 'elbow'), may be ambiguous

Method 3: Kaiser Criterion\n\nKeep components with eigenvalue greater than the average:\n\n$$k^* = |\{j : \lambda_j > \bar{\lambda}\}|$$\n\nFor standardized data (each feature has variance 1), $\bar{\lambda} = 1$, so this becomes: keep components with $\lambda_j > 1$.\n\nIntuition: A component with $\lambda_j = 1$ explains the same variance as a single original (standardized) feature. If it explains less, why bother?\n\nPros: Objective, no parameter to choose\nCons: Can over- or under-estimate, not connected to downstream task

Method 4: Cross-Validation\n\nUse the reconstruction error on held-out data to choose $k$:\n\n1. Split data into train and validation sets\n2. Fit PCA with $k$ components on training data\n3. Compute reconstruction error on validation data\n4. Choose $k$ that minimizes validation reconstruction error (plus complexity penalty if desired)\n\nThis is the most principled approach when reconstruction accuracy matters.\n\n### Method 5: Downstream Task Performance\n\nIf PCA is a preprocessing step for classification, regression, or clustering:\n\n1. Fit PCA with various $k$ values\n2. Train downstream model on PCA-transformed data\n3. Evaluate downstream model on validation set\n4. Choose $k$ that optimizes downstream performance\n\nThis directly optimizes what you care about.

No Universal Best Method

Each method makes different assumptions and optimizes different objectives. The variance threshold doesn't know about your downstream task. The elbow method is subjective. Cross-validation adds computational cost. Choose based on your specific situation and validate empirically.

Mathematical Analysis of Variance Capture

Let's develop a deeper mathematical understanding of how variance is distributed across components.\n\n### Effective Dimensionality\n\nThe eigenvalue distribution contains information about the data's 'effective' dimensionality. One formal measure is the participation ratio:\n\n$$d_{\text{eff}} = \frac{\left(\sum_{j=1}^{d} \lambda_j\right)^2}{\sum_{j=1}^{d} \lambda_j^2}$$\n\nThis quantity:\n- Equals $d$ when all eigenvalues are equal (truly $d$-dimensional)\n- Equals 1 when only one eigenvalue is non-zero (1-dimensional)\n- Generally falls between: extent of variance concentration\n\nInterpretation: $d_{\text{eff}}$ estimates how many 'effective' dimensions the data has, accounting for unequal importance of components.

Entropy-Based Measure\n\nAnother measure treats the normalized eigenvalues as a probability distribution and computes entropy:\n\n$$p_j = \frac{\lambda_j}{\sum_k \lambda_k}$$\n\n$$H = -\sum_j p_j \log p_j$$\n\nThe effective dimensionality is then $d_{\text{entropy}} = e^H$.\n\n- Maximum entropy $H = \log d$ yields $d_{\text{entropy}} = d$\n- Minimum entropy (single dominant component) yields $d_{\text{entropy}} \approx 1$\n\nThis has information-theoretic motivation: it measures the 'surprise' in encountering variance in different directions.

Bounds on Cumulative Variance\n\nFor the first $k$ components, we can establish bounds and approximations:\n\nUpper bound: CVE$_k \leq k/d$ with equality when eigenvalues are equal.\n\nLower bound: If $\lambda_1 \geq \lambda_2 \geq \cdots$, then:\n\n$$\text{CVE}_k \geq \frac{k \lambda_k}{\sum_j \lambda_j}$$\n\nDecay rate connection: If eigenvalues follow power law $\lambda_j \sim j^{-\alpha}$:\n\n$$\text{CVE}_k \approx \begin{cases} 1 - (k/d)^{1-\alpha} & \alpha < 1 \\ \log(k)/\log(d) & \alpha = 1 \\ (k/d)^{1-1/\alpha} & \alpha > 1 \end{cases}$$\n\nThese asymptotics help predict how many components are needed for a given variance target.

Theoretical Insights

Understanding the theoretical distribution of eigenvalues helps set expectations. For truly high-dimensional data (flat spectrum), don't expect 95% variance from a few components. For structured data (steep spectrum), even 1-2 components might suffice.

Practical Interpretation Guidelines

Understanding variance explained in context is crucial for making good decisions. Here are guidelines for interpreting these metrics in practice.\n\n### Context Matters\n\nA given PVE value means different things in different contexts:\n\n- Image data: 80% variance in 10 components (from 10,000 pixels) is impressive\n- Survey data: 80% variance in 10 components (from 20 questions) is less impressive\n- Genomics: 50% variance in 5 components (from 20,000 genes) is remarkable\n\nThe ratio of kept components to original features provides context for interpreting variance explained.

Variance Explained Interpretation by Domain
Domain	Typical Features	Good #PCs	Typical CVE
Face Images	10,000+ pixels	100-500	95%+ with ~100 PCs
Genomics	20,000+ genes	10-100	50-80% (population structure)
Finance	~100-1000 assets	3-20	50-80% (market factors)
NLP (word embeddings)	300-1000 dims	50-200	80-95%
Sensor Data	~10-100 sensors	2-10	90%+ (correlated sensors)
Survey/Psychometrics	~20-100 items	3-10	60-80% (latent factors)

Variance Explained ≠ Information Preserved\n\nCritical caveat: variance is not the same as task-relevant information.\n\nExample: In a classification task, the class-discriminative information might be in the low-variance components. High-variance components might capture within-class variation that's irrelevant for classification.\n\nPCA optimizes for variance, not for any specific downstream task. A component with 1% variance might be more important for your task than one with 30%.

The Variance-Relevance Gap

If your goal is prediction or classification, variance explained is a crude proxy. Consider supervised dimensionality reduction (LDA, partial least squares) or validate your choice of $k$ against downstream performance. Never assume that 'more variance = more useful information.'

Beyond Variance: Alternative Selection Criteria

While variance explained is the most common criterion, alternative approaches may be more appropriate for specific applications.\n\n### Reconstruction Error on Held-Out Data\n\nInstead of training-set variance, measure reconstruction error on validation data:\n\n$$\mathcal{E}{\text{val}}(k) = \frac{1}{n{\text{val}}} \sum_{i \in \text{val}} \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2$$\n\nThis accounts for overfitting: more components always reduce training error but may not generalize.\n\nKey insight: Very small eigenvalue components may capture noise rather than signal. On held-out data, reconstructing noise hurts.

Information Criteria\n\nModel selection criteria like AIC and BIC can be applied:\n\n$$\text{AIC}(k) = n \log(\mathcal{E}(k)) + 2 \cdot \text{params}(k)$$\n$$\text{BIC}(k) = n \log(\mathcal{E}(k)) + \log(n) \cdot \text{params}(k)$$\n\nwhere params$(k)$ is the effective number of parameters with $k$ components.\n\nBIC typically selects fewer components (stronger complexity penalty), suitable when parsimony is valued.\n\n### Parallel Analysis (Random Matrix Theory)\n\nCompare eigenvalues to those from random data with the same dimensions:\n\n1. Generate random data of same size $(n \times d)$\n2. Compute its eigenvalues\n3. Repeat multiple times to get null distribution\n4. Keep components with eigenvalues exceeding the 95th percentile of random eigenvalues\n\nThis accounts for 'spurious' eigenvalues that arise from finite sample effects.

When to Use Alternative Criteria

•Cross-validation: When generalization to new data matters (most ML applications)
•Information criteria: When you want automatic complexity penalty without validation set
•Parallel analysis: When sample size is small relative to dimensions and spurious eigenvalues are a concern
•Downstream task validation: When PCA is preprocessing for prediction—validate on prediction performance

Visualizing Variance Explained

Effective visualization is essential for understanding variance distribution and communicating results. Here are the key visualization types.\n\n### The Scree Plot\n\nThe most common visualization: eigenvalues (or PVE) plotted against component index.\n\nConstruction:\n- X-axis: Component index ($j = 1, 2, \ldots, d$)\n- Y-axis: Eigenvalue $\lambda_j$ (or PVE$_j$)\n- Optional: Add horizontal line at mean eigenvalue (Kaiser criterion)\n- Optional: Mark the elbow point\n\nInterpretation: The 'elbow' indicates where additional components contribute diminishing returns. Components before the elbow capture structure; components after may capture noise.

Cumulative Variance Plot\n\nShows CVE$_k$ against $k$, revealing how variance accumulates.\n\nConstruction:\n- X-axis: Number of components ($k$)\n- Y-axis: Cumulative variance explained (0 to 1 or 0% to 100%)\n- Optional: Add horizontal lines at common thresholds (80%, 90%, 95%)\n\nReading the plot: Steep initial rise indicates concentrated variance; gradual rise indicates distributed variance. Intersections with threshold lines indicate how many components needed.\n\n### Combined Dashboard\n\nBest practice: show both individual and cumulative variance.\n\n| Plot | Primary Information | Use For |\n|------|---------------------|---------|\n| Bar chart of PVE$_j$ | Contribution of each component | Identifying important components |\n| Scree plot (log scale) | Eigenvalue decay pattern | Detecting structure type |\n| Cumulative line | Total variance with $k$ components | Choosing number of components |

Visualization Best Practices

Always show a scree plot when reporting PCA results. Use log scale on y-axis if eigenvalues span many orders of magnitude. Include both individual and cumulative views. Annotate the chosen number of components and corresponding variance explained. Report exact numbers in a table for reproducibility.

Summary and Key Takeaways

We've explored how to measure, interpret, and use variance explained for choosing the number of principal components. Here are the key insights:

Key Takeaways

•Eigenvalues equal variance along components: PVE$_j = \lambda_j / \sum \lambda$ directly measures information in each component.
•Eigenvalue spectra reveal data structure: Fast decay suggests effective low-dimensionality; flat spectra suggest true high-dimensionality.
•Multiple methods exist for choosing $k$: Variance thresholds, elbow method, Kaiser criterion, cross-validation—each has tradeoffs.
•Context determines interpretation: 80% variance from 10 components has different meaning for 100 vs 10,000 original features.
•Variance ≠ task-relevant information: For supervised learning, validate $k$ using downstream task performance, not just variance.
•Visualization is essential: Scree plots and cumulative variance plots reveal patterns that inform component selection.

What's Next

We've learned to quantify variance explained. The next page covers scree plots in greater depth—understanding their construction, interpretation subtleties, and formal methods for detecting elbow points. We'll also explore how scree plots can mislead and when to trust or distrust their suggestions.

4 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Principal Component Analysis Theory

LevelIntermediate

Duration120 mins

TopicDimensionality Reduction

4 / 5

Proportion of Variance Explained

Quantifying Information Preservation

What You Will Learn

Defining Variance Explained

Interpretation

Cumulative Variance Explained\n\nMore practically, we care about how much total variance is explained by the first $k$ components:\n\n$$\text{CVE}k = \sum{j=1}^{k} \text{PVE}j = \frac{\sum{j=1}^{k} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$\n\nThis tells us: if we keep $k$ components, we preserve CVE$_k$ fraction of the total variance.\n\nThe cumulative variance sequence is:\n- CVE$_1 = $ PVE$_1$\n- CVE$_2 = $ PVE$_1 + $ PVE$_2$\n- ...\n- CVE$_d = 1$ (keeping all components preserves everything)

Eigenvalue Spectra Patterns

Pattern 2: Slow Decay\n\nIf eigenvalues decrease gradually: $\lambda_1 \approx \lambda_2 \approx \cdots \approx \lambda_d$:\n\n- Data is truly high-dimensional\n- Many components needed to capture most variance\n- Features are relatively independent\n- Dimensionality reduction has limited benefit\n\nExample: Well-designed experiments with orthogonal factors show this pattern.\n\n### Pattern 3: Cliff Edge\n\nIf eigenvalues are large for some components then suddenly drop:\n\n- Data lies near a lower-dimensional subspace\n- The 'cliff' indicates the true dimensionality\n- Components after the cliff capture noise\n- Clear choice of $k$ (components before the cliff)\n\nExample: PCA on noisy observations of a truly low-dimensional phenomenon.

Common Eigenvalue Spectrum Patterns
Pattern	Eigenvalue Distribution	Interpretation	Typical Source
Power Law	$\lambda_j \propto j^{-\alpha}$	Scale-free structure, few dominant modes	Natural images, language
Exponential Decay	$\lambda_j \propto e^{-\alpha j}$	Rapidly diminishing importance	Smooth phenomena
Uniform (Flat)	$\lambda_j \approx \lambda_1$	All directions equally important	Independent features, white noise
Step/Cliff	Large, then sudden drop	Low-rank structure + noise	Signal + noise model
Mixture	Multiple scale regimes	Hierarchical structure	Multi-scale data

Understanding Your Data

Methods for Choosing the Number of Components

Method 2: Elbow Method (Scree Plot Analysis)\n\nPlot eigenvalues or PVE against component index. Look for an 'elbow'—a point where the rate of decrease changes sharply.\n\nProcedure:\n1. Plot $\lambda_j$ vs $j$ (or PVE$_j$ vs $j$)\n2. Identify where the curve 'bends' from steep to flat\n3. Choose $k$ at or before the elbow\n\nPros: Data-driven, captures structure in eigenvalue decay\nCons: Subjective (no unique definition of 'elbow'), may be ambiguous

Method 3: Kaiser Criterion\n\nKeep components with eigenvalue greater than the average:\n\n$$k^* = |\{j : \lambda_j > \bar{\lambda}\}|$$\n\nFor standardized data (each feature has variance 1), $\bar{\lambda} = 1$, so this becomes: keep components with $\lambda_j > 1$.\n\nIntuition: A component with $\lambda_j = 1$ explains the same variance as a single original (standardized) feature. If it explains less, why bother?\n\nPros: Objective, no parameter to choose\nCons: Can over- or under-estimate, not connected to downstream task

Method 4: Cross-Validation\n\nUse the reconstruction error on held-out data to choose $k$:\n\n1. Split data into train and validation sets\n2. Fit PCA with $k$ components on training data\n3. Compute reconstruction error on validation data\n4. Choose $k$ that minimizes validation reconstruction error (plus complexity penalty if desired)\n\nThis is the most principled approach when reconstruction accuracy matters.\n\n### Method 5: Downstream Task Performance\n\nIf PCA is a preprocessing step for classification, regression, or clustering:\n\n1. Fit PCA with various $k$ values\n2. Train downstream model on PCA-transformed data\n3. Evaluate downstream model on validation set\n4. Choose $k$ that optimizes downstream performance\n\nThis directly optimizes what you care about.

No Universal Best Method

Mathematical Analysis of Variance Capture

Entropy-Based Measure\n\nAnother measure treats the normalized eigenvalues as a probability distribution and computes entropy:\n\n$$p_j = \frac{\lambda_j}{\sum_k \lambda_k}$$\n\n$$H = -\sum_j p_j \log p_j$$\n\nThe effective dimensionality is then $d_{\text{entropy}} = e^H$.\n\n- Maximum entropy $H = \log d$ yields $d_{\text{entropy}} = d$\n- Minimum entropy (single dominant component) yields $d_{\text{entropy}} \approx 1$\n\nThis has information-theoretic motivation: it measures the 'surprise' in encountering variance in different directions.

Bounds on Cumulative Variance\n\nFor the first $k$ components, we can establish bounds and approximations:\n\nUpper bound: CVE$_k \leq k/d$ with equality when eigenvalues are equal.\n\nLower bound: If $\lambda_1 \geq \lambda_2 \geq \cdots$, then:\n\n$$\text{CVE}_k \geq \frac{k \lambda_k}{\sum_j \lambda_j}$$\n\nDecay rate connection: If eigenvalues follow power law $\lambda_j \sim j^{-\alpha}$:\n\n$$\text{CVE}_k \approx \begin{cases} 1 - (k/d)^{1-\alpha} & \alpha < 1 \\ \log(k)/\log(d) & \alpha = 1 \\ (k/d)^{1-1/\alpha} & \alpha > 1 \end{cases}$$\n\nThese asymptotics help predict how many components are needed for a given variance target.

Theoretical Insights

Practical Interpretation Guidelines

Variance Explained Interpretation by Domain
Domain	Typical Features	Good #PCs	Typical CVE
Face Images	10,000+ pixels	100-500	95%+ with ~100 PCs
Genomics	20,000+ genes	10-100	50-80% (population structure)
Finance	~100-1000 assets	3-20	50-80% (market factors)
NLP (word embeddings)	300-1000 dims	50-200	80-95%
Sensor Data	~10-100 sensors	2-10	90%+ (correlated sensors)
Survey/Psychometrics	~20-100 items	3-10	60-80% (latent factors)

Variance Explained ≠ Information Preserved\n\nCritical caveat: variance is not the same as task-relevant information.\n\nExample: In a classification task, the class-discriminative information might be in the low-variance components. High-variance components might capture within-class variation that's irrelevant for classification.\n\nPCA optimizes for variance, not for any specific downstream task. A component with 1% variance might be more important for your task than one with 30%.

The Variance-Relevance Gap

Beyond Variance: Alternative Selection Criteria

Information Criteria\n\nModel selection criteria like AIC and BIC can be applied:\n\n$$\text{AIC}(k) = n \log(\mathcal{E}(k)) + 2 \cdot \text{params}(k)$$\n$$\text{BIC}(k) = n \log(\mathcal{E}(k)) + \log(n) \cdot \text{params}(k)$$\n\nwhere params$(k)$ is the effective number of parameters with $k$ components.\n\nBIC typically selects fewer components (stronger complexity penalty), suitable when parsimony is valued.\n\n### Parallel Analysis (Random Matrix Theory)\n\nCompare eigenvalues to those from random data with the same dimensions:\n\n1. Generate random data of same size $(n \times d)$\n2. Compute its eigenvalues\n3. Repeat multiple times to get null distribution\n4. Keep components with eigenvalues exceeding the 95th percentile of random eigenvalues\n\nThis accounts for 'spurious' eigenvalues that arise from finite sample effects.

When to Use Alternative Criteria

•Cross-validation: When generalization to new data matters (most ML applications)
•Information criteria: When you want automatic complexity penalty without validation set
•Parallel analysis: When sample size is small relative to dimensions and spurious eigenvalues are a concern
•Downstream task validation: When PCA is preprocessing for prediction—validate on prediction performance

Visualizing Variance Explained

Cumulative Variance Plot\n\nShows CVE$_k$ against $k$, revealing how variance accumulates.\n\nConstruction:\n- X-axis: Number of components ($k$)\n- Y-axis: Cumulative variance explained (0 to 1 or 0% to 100%)\n- Optional: Add horizontal lines at common thresholds (80%, 90%, 95%)\n\nReading the plot: Steep initial rise indicates concentrated variance; gradual rise indicates distributed variance. Intersections with threshold lines indicate how many components needed.\n\n### Combined Dashboard\n\nBest practice: show both individual and cumulative variance.\n\n| Plot | Primary Information | Use For |\n|------|---------------------|---------|\n| Bar chart of PVE$_j$ | Contribution of each component | Identifying important components |\n| Scree plot (log scale) | Eigenvalue decay pattern | Detecting structure type |\n| Cumulative line | Total variance with $k$ components | Choosing number of components |

Visualization Best Practices

Summary and Key Takeaways

We've explored how to measure, interpret, and use variance explained for choosing the number of principal components. Here are the key insights:

Key Takeaways

•Eigenvalues equal variance along components: PVE$_j = \lambda_j / \sum \lambda$ directly measures information in each component.
•Eigenvalue spectra reveal data structure: Fast decay suggests effective low-dimensionality; flat spectra suggest true high-dimensionality.
•Multiple methods exist for choosing $k$: Variance thresholds, elbow method, Kaiser criterion, cross-validation—each has tradeoffs.
•Context determines interpretation: 80% variance from 10 components has different meaning for 100 vs 10,000 original features.
•Variance ≠ task-relevant information: For supervised learning, validate $k$ using downstream task performance, not just variance.
•Visualization is essential: Scree plots and cumulative variance plots reveal patterns that inform component selection.

What's Next

4 / 5