Loading learning content...
You've computed principal components for a dataset with 1,000 features. The first component captures the most variance, the second captures the second most, and so on. But a crucial question remains: how many components should you actually keep?
Keeping all 1,000 defeats the purpose of dimensionality reduction. Keeping just one might discard too much information. The answer lies in understanding the proportion of variance explained by each component—a metric that quantifies how much 'information' each principal component captures.
This page develops the mathematical framework for variance explained, shows how to interpret and visualize it, and provides principled approaches for choosing the number of components to retain.
By the end of this page, you will understand variance explained and cumulative variance ratios, know how to interpret these metrics for your specific application, learn multiple methods for choosing the number of components (threshold-based, elbow method, cross-validation), and understand the tradeoffs involved in this decision.
Let's establish the precise definitions and understand what variance explained really measures.
For centered data, the total variance is the sum of all eigenvalues:
$$\text{Total Variance} = \sum_{j=1}^{d} \lambda_j = \text{tr}(\mathbf{S}) = \sum_{j=1}^{d} \text{Var}(X_j)$$
This equals the trace of the covariance matrix, which also equals the sum of individual feature variances. It's a fixed property of the data—the 'total information' we start with.
Each principal component captures variance equal to its eigenvalue:
$$\text{Variance of PC}_j = \lambda_j$$
The proportion of variance explained by component $j$ is:
$$\text{PVE}j = \frac{\lambda_j}{\sum{k=1}^{d} \lambda_k}$$
This is always between 0 and 1, and all proportions sum to 1:
$$\sum_{j=1}^{d} \text{PVE}_j = 1$$
PVE$_j$ tells you what fraction of the data's total spread is captured by the $j$-th direction. If PVE$_1 = 0.80$, the first principal component alone accounts for 80% of how much the data varies. The unordered features typically don't have such concentration—variance is distributed across many features.
More practically, we care about how much total variance is explained by the first $k$ components:
$$\text{CVE}k = \sum{j=1}^{k} \text{PVE}j = \frac{\sum{j=1}^{k} \lambda_j}{\sum_{j=1}^{d} \lambda_j}$$
This tells us: if we keep $k$ components, we preserve CVE$_k$ fraction of the total variance.
The cumulative variance sequence is:
The distribution of eigenvalues—the eigenvalue spectrum—reveals important information about the data's structure. Different patterns have different implications for dimensionality reduction.
If $\lambda_1 \gg \lambda_2 \gg \cdots \gg \lambda_d$:
Example: Stock returns often show this pattern—most variance is explained by market-wide factors.
If eigenvalues decrease gradually: $\lambda_1 \approx \lambda_2 \approx \cdots \approx \lambda_d$:
Example: Well-designed experiments with orthogonal factors show this pattern.
If eigenvalues are large for some components then suddenly drop:
Example: PCA on noisy observations of a truly low-dimensional phenomenon.
| Pattern | Eigenvalue Distribution | Interpretation | Typical Source |
|---|---|---|---|
| Power Law | $\lambda_j \propto j^{-\alpha}$ | Scale-free structure, few dominant modes | Natural images, language |
| Exponential Decay | $\lambda_j \propto e^{-\alpha j}$ | Rapidly diminishing importance | Smooth phenomena |
| Uniform (Flat) | $\lambda_j \approx \lambda_1$ | All directions equally important | Independent features, white noise |
| Step/Cliff | Large, then sudden drop | Low-rank structure + noise | Signal + noise model |
| Mixture | Multiple scale regimes | Hierarchical structure | Multi-scale data |
Always examine the eigenvalue spectrum before choosing $k$. It tells you whether PCA is appropriate (fast decay = good), how many components to consider (where decay slows), and whether the data has special structure (cliffs, regimes).
There is no single 'correct' number of components—the right choice depends on your application, goals, and constraints. Here are the main approaches:
Choose $k$ such that CVE$_k \geq \tau$ for some threshold $\tau$:
$$k^* = \min\{k : \text{CVE}_k \geq \tau\}$$
Common thresholds:
Pros: Simple, interpretable, provides a guarantee on information preserved Cons: Arbitrary threshold choice, may not align with downstream task needs
Plot eigenvalues or PVE against component index. Look for an 'elbow'—a point where the rate of decrease changes sharply.
Procedure:
Pros: Data-driven, captures structure in eigenvalue decay Cons: Subjective (no unique definition of 'elbow'), may be ambiguous
Keep components with eigenvalue greater than the average:
$$k^* = |\{j : \lambda_j > \bar{\lambda}\}|$$
For standardized data (each feature has variance 1), $\bar{\lambda} = 1$, so this becomes: keep components with $\lambda_j > 1$.
Intuition: A component with $\lambda_j = 1$ explains the same variance as a single original (standardized) feature. If it explains less, why bother?
Pros: Objective, no parameter to choose Cons: Can over- or under-estimate, not connected to downstream task
Use the reconstruction error on held-out data to choose $k$:
This is the most principled approach when reconstruction accuracy matters.
If PCA is a preprocessing step for classification, regression, or clustering:
This directly optimizes what you care about.
Each method makes different assumptions and optimizes different objectives. The variance threshold doesn't know about your downstream task. The elbow method is subjective. Cross-validation adds computational cost. Choose based on your specific situation and validate empirically.
Let's develop a deeper mathematical understanding of how variance is distributed across components.
The eigenvalue distribution contains information about the data's 'effective' dimensionality. One formal measure is the participation ratio:
$$d_{\text{eff}} = \frac{\left(\sum_{j=1}^{d} \lambda_j\right)^2}{\sum_{j=1}^{d} \lambda_j^2}$$
This quantity:
Interpretation: $d_{\text{eff}}$ estimates how many 'effective' dimensions the data has, accounting for unequal importance of components.
Another measure treats the normalized eigenvalues as a probability distribution and computes entropy:
$$p_j = \frac{\lambda_j}{\sum_k \lambda_k}$$
$$H = -\sum_j p_j \log p_j$$
The effective dimensionality is then $d_{\text{entropy}} = e^H$.
This has information-theoretic motivation: it measures the 'surprise' in encountering variance in different directions.
For the first $k$ components, we can establish bounds and approximations:
Upper bound: CVE$_k \leq k/d$ with equality when eigenvalues are equal.
Lower bound: If $\lambda_1 \geq \lambda_2 \geq \cdots$, then:
$$\text{CVE}_k \geq \frac{k \lambda_k}{\sum_j \lambda_j}$$
Decay rate connection: If eigenvalues follow power law $\lambda_j \sim j^{-\alpha}$:
$$\text{CVE}_k \approx \begin{cases} 1 - (k/d)^{1-\alpha} & \alpha < 1 \\ \log(k)/\log(d) & \alpha = 1 \\ (k/d)^{1-1/\alpha} & \alpha > 1 \end{cases}$$
These asymptotics help predict how many components are needed for a given variance target.
Understanding the theoretical distribution of eigenvalues helps set expectations. For truly high-dimensional data (flat spectrum), don't expect 95% variance from a few components. For structured data (steep spectrum), even 1-2 components might suffice.
Understanding variance explained in context is crucial for making good decisions. Here are guidelines for interpreting these metrics in practice.
A given PVE value means different things in different contexts:
The ratio of kept components to original features provides context for interpreting variance explained.
| Domain | Typical Features | Good #PCs | Typical CVE |
|---|---|---|---|
| Face Images | 10,000+ pixels | 100-500 | 95%+ with ~100 PCs |
| Genomics | 20,000+ genes | 10-100 | 50-80% (population structure) |
| Finance | ~100-1000 assets | 3-20 | 50-80% (market factors) |
| NLP (word embeddings) | 300-1000 dims | 50-200 | 80-95% |
| Sensor Data | ~10-100 sensors | 2-10 | 90%+ (correlated sensors) |
| Survey/Psychometrics | ~20-100 items | 3-10 | 60-80% (latent factors) |
Critical caveat: variance is not the same as task-relevant information.
Example: In a classification task, the class-discriminative information might be in the low-variance components. High-variance components might capture within-class variation that's irrelevant for classification.
PCA optimizes for variance, not for any specific downstream task. A component with 1% variance might be more important for your task than one with 30%.
If your goal is prediction or classification, variance explained is a crude proxy. Consider supervised dimensionality reduction (LDA, partial least squares) or validate your choice of $k$ against downstream performance. Never assume that 'more variance = more useful information.'
While variance explained is the most common criterion, alternative approaches may be more appropriate for specific applications.
Instead of training-set variance, measure reconstruction error on validation data:
$$\mathcal{E}{\text{val}}(k) = \frac{1}{n{\text{val}}} \sum_{i \in \text{val}} \|\mathbf{x}^{(i)} - \hat{\mathbf{x}}^{(i)}\|^2$$
This accounts for overfitting: more components always reduce training error but may not generalize.
Key insight: Very small eigenvalue components may capture noise rather than signal. On held-out data, reconstructing noise hurts.
Model selection criteria like AIC and BIC can be applied:
$$\text{AIC}(k) = n \log(\mathcal{E}(k)) + 2 \cdot \text{params}(k)$$ $$\text{BIC}(k) = n \log(\mathcal{E}(k)) + \log(n) \cdot \text{params}(k)$$
where params$(k)$ is the effective number of parameters with $k$ components.
BIC typically selects fewer components (stronger complexity penalty), suitable when parsimony is valued.
Compare eigenvalues to those from random data with the same dimensions:
This accounts for 'spurious' eigenvalues that arise from finite sample effects.
Effective visualization is essential for understanding variance distribution and communicating results. Here are the key visualization types.
The most common visualization: eigenvalues (or PVE) plotted against component index.
Construction:
Interpretation: The 'elbow' indicates where additional components contribute diminishing returns. Components before the elbow capture structure; components after may capture noise.
Shows CVE$_k$ against $k$, revealing how variance accumulates.
Construction:
Reading the plot: Steep initial rise indicates concentrated variance; gradual rise indicates distributed variance. Intersections with threshold lines indicate how many components needed.
Best practice: show both individual and cumulative variance.
| Plot | Primary Information | Use For |
|---|---|---|
| Bar chart of PVE$_j$ | Contribution of each component | Identifying important components |
| Scree plot (log scale) | Eigenvalue decay pattern | Detecting structure type |
| Cumulative line | Total variance with $k$ components | Choosing number of components |
Always show a scree plot when reporting PCA results. Use log scale on y-axis if eigenvalues span many orders of magnitude. Include both individual and cumulative views. Annotate the chosen number of components and corresponding variance explained. Report exact numbers in a table for reproducibility.
We've explored how to measure, interpret, and use variance explained for choosing the number of principal components. Here are the key insights:
We've learned to quantify variance explained. The next page covers scree plots in greater depth—understanding their construction, interpretation subtleties, and formal methods for detecting elbow points. We'll also explore how scree plots can mislead and when to trust or distrust their suggestions.