Loading learning content...
You've computed principal components for a dataset with 1,000 features. The first component captures the most variance, the second captures the second most, and so on. But a crucial question remains: how many components should you actually keep?\n\nKeeping all 1,000 defeats the purpose of dimensionality reduction. Keeping just one might discard too much information. The answer lies in understanding the proportion of variance explained by each component—a metric that quantifies how much 'information' each principal component captures.\n\nThis page develops the mathematical framework for variance explained, shows how to interpret and visualize it, and provides principled approaches for choosing the number of components to retain.
By the end of this page, you will understand variance explained and cumulative variance ratios, know how to interpret these metrics for your specific application, learn multiple methods for choosing the number of components (threshold-based, elbow method, cross-validation), and understand the tradeoffs involved in this decision.
Let's establish the precise definitions and understand what variance explained really measures.\n\n### Total Variance\n\nFor centered data, the total variance is the sum of all eigenvalues:\n\n$$\text{Total Variance} = \sum_{j=1}^{d} \lambda_j = \text{tr}(\mathbf{S}) = \sum_{j=1}^{d} \text{Var}(X_j)$$\n\nThis equals the trace of the covariance matrix, which also equals the sum of individual feature variances. It's a fixed property of the data—the 'total information' we start with.\n\n### Variance Explained by Component $j$\n\nEach principal component captures variance equal to its eigenvalue:\n\n$$\text{Variance of PC}j = \lambda_j$$\n\nThe proportion of variance explained by component $j$ is:\n\n$$\text{PVE}j = \frac{\lambda_j}{\sum{k=1}^{d} \lambda_k}$$\n\nThis is always between 0 and 1, and all proportions sum to 1:\n\n$$\sum{j=1}^{d} \text{PVE}_j = 1$$
PVE$_j$ tells you what fraction of the data's total spread is captured by the $j$-th direction. If PVE$_1 = 0.80$, the first principal component alone accounts for 80% of how much the data varies. The unordered features typically don't have such concentration—variance is distributed across many features.
The distribution of eigenvalues—the eigenvalue spectrum—reveals important information about the data's structure. Different patterns have different implications for dimensionality reduction.\n\n### Pattern 1: Dominant Eigenvalue(s)\n\nIf $\lambda_1 \gg \lambda_2 \gg \cdots \gg \lambda_d$:\n\n- The data is approximately low-dimensional\n- A small number of components captures most variance\n- Strong correlations exist between original features\n- Dimensionality reduction is highly effective\n\nExample: Stock returns often show this pattern—most variance is explained by market-wide factors.
| Pattern | Eigenvalue Distribution | Interpretation | Typical Source |
|---|---|---|---|
| Power Law | $\lambda_j \propto j^{-\alpha}$ | Scale-free structure, few dominant modes | Natural images, language |
| Exponential Decay | $\lambda_j \propto e^{-\alpha j}$ | Rapidly diminishing importance | Smooth phenomena |
| Uniform (Flat) | $\lambda_j \approx \lambda_1$ | All directions equally important | Independent features, white noise |
| Step/Cliff | Large, then sudden drop | Low-rank structure + noise | Signal + noise model |
| Mixture | Multiple scale regimes | Hierarchical structure | Multi-scale data |
Always examine the eigenvalue spectrum before choosing $k$. It tells you whether PCA is appropriate (fast decay = good), how many components to consider (where decay slows), and whether the data has special structure (cliffs, regimes).
There is no single 'correct' number of components—the right choice depends on your application, goals, and constraints. Here are the main approaches:\n\n### Method 1: Variance Threshold\n\nChoose $k$ such that CVE$_k \geq \tau$ for some threshold $\tau$:\n\n$$k^* = \min\{k : \text{CVE}_k \geq \tau\}$$\n\nCommon thresholds:\n- $\tau = 0.80$: Keep components explaining 80% of variance\n- $\tau = 0.90$: Keep components explaining 90% of variance\n- $\tau = 0.95$: Keep components explaining 95% of variance\n\nPros: Simple, interpretable, provides a guarantee on information preserved\nCons: Arbitrary threshold choice, may not align with downstream task needs
Each method makes different assumptions and optimizes different objectives. The variance threshold doesn't know about your downstream task. The elbow method is subjective. Cross-validation adds computational cost. Choose based on your specific situation and validate empirically.
Let's develop a deeper mathematical understanding of how variance is distributed across components.\n\n### Effective Dimensionality\n\nThe eigenvalue distribution contains information about the data's 'effective' dimensionality. One formal measure is the participation ratio:\n\n$$d_{\text{eff}} = \frac{\left(\sum_{j=1}^{d} \lambda_j\right)^2}{\sum_{j=1}^{d} \lambda_j^2}$$\n\nThis quantity:\n- Equals $d$ when all eigenvalues are equal (truly $d$-dimensional)\n- Equals 1 when only one eigenvalue is non-zero (1-dimensional)\n- Generally falls between: extent of variance concentration\n\nInterpretation: $d_{\text{eff}}$ estimates how many 'effective' dimensions the data has, accounting for unequal importance of components.
Understanding the theoretical distribution of eigenvalues helps set expectations. For truly high-dimensional data (flat spectrum), don't expect 95% variance from a few components. For structured data (steep spectrum), even 1-2 components might suffice.
Understanding variance explained in context is crucial for making good decisions. Here are guidelines for interpreting these metrics in practice.\n\n### Context Matters\n\nA given PVE value means different things in different contexts:\n\n- Image data: 80% variance in 10 components (from 10,000 pixels) is impressive\n- Survey data: 80% variance in 10 components (from 20 questions) is less impressive\n- Genomics: 50% variance in 5 components (from 20,000 genes) is remarkable\n\nThe ratio of kept components to original features provides context for interpreting variance explained.
| Domain | Typical Features | Good #PCs | Typical CVE |
|---|---|---|---|
| Face Images | 10,000+ pixels | 100-500 | 95%+ with ~100 PCs |
| Genomics | 20,000+ genes | 10-100 | 50-80% (population structure) |
| Finance | ~100-1000 assets | 3-20 | 50-80% (market factors) |
| NLP (word embeddings) | 300-1000 dims | 50-200 | 80-95% |
| Sensor Data | ~10-100 sensors | 2-10 | 90%+ (correlated sensors) |
| Survey/Psychometrics | ~20-100 items | 3-10 | 60-80% (latent factors) |
If your goal is prediction or classification, variance explained is a crude proxy. Consider supervised dimensionality reduction (LDA, partial least squares) or validate your choice of $k$ against downstream performance. Never assume that 'more variance = more useful information.'
While variance explained is the most common criterion, alternative approaches may be more appropriate for specific applications.\n\n### Reconstruction Error on Held-Out Data\n\nInstead of training-set variance, measure reconstruction error on validation data:\n\n$$\mathcal{E}{\text{val}}(k) = \frac{1}{n{\text{val}}} \sum_{i \in \text{val}} \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2$$\n\nThis accounts for overfitting: more components always reduce training error but may not generalize.\n\nKey insight: Very small eigenvalue components may capture noise rather than signal. On held-out data, reconstructing noise hurts.
Effective visualization is essential for understanding variance distribution and communicating results. Here are the key visualization types.\n\n### The Scree Plot\n\nThe most common visualization: eigenvalues (or PVE) plotted against component index.\n\nConstruction:\n- X-axis: Component index ($j = 1, 2, \ldots, d$)\n- Y-axis: Eigenvalue $\lambda_j$ (or PVE$_j$)\n- Optional: Add horizontal line at mean eigenvalue (Kaiser criterion)\n- Optional: Mark the elbow point\n\nInterpretation: The 'elbow' indicates where additional components contribute diminishing returns. Components before the elbow capture structure; components after may capture noise.
Always show a scree plot when reporting PCA results. Use log scale on y-axis if eigenvalues span many orders of magnitude. Include both individual and cumulative views. Annotate the chosen number of components and corresponding variance explained. Report exact numbers in a table for reproducibility.
We've explored how to measure, interpret, and use variance explained for choosing the number of principal components. Here are the key insights:
We've learned to quantify variance explained. The next page covers scree plots in greater depth—understanding their construction, interpretation subtleties, and formal methods for detecting elbow points. We'll also explore how scree plots can mislead and when to trust or distrust their suggestions.